# ExoStat Lab 09: Exam 2

**Administrative details:**

- This is the second midterm exam, and is due by the end of class today (5:20PM on Tuesday, April 2).

- You can use your notes, previous Labs, or the internet, but you may not collaborate with anyone else.  If you have questions, please speak **quietly** with the instructor.

**Submission:**

Submit your completed exam both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

In [None]:
# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

from sklearn.decomposition import PCA
from sklearn import preprocessing

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# 1.  Principal Components analysis and Stellar Activity

For this section you will be considering a dataset that has 25 spectra taken across 25 days so that each spectrum represents one day of observing a star with a rotation period of 25 days.  The spectra have some time of variability - it may be stellar activity or it may be a Doppler shift (you will be asked about this in a later question).

Run the cells below to read in the data.  Note that `wavelength` is the set of wavelength values, and can be used for both the spot and planet data.

In [None]:
#Read in the wavelengths
wavelength = Table.read_table("wave.txt", header = None, names = ["Wave"])

In [None]:
#Create a header for the spectra data
spec_header = make_array()
for i in np.arange(wavelength.num_rows):
    spec_header = np.append(spec_header, str(i))

In [None]:
#Read in the spectra
spectra = Table.read_table("spectra.txt",sep = "\s+", header = None, names = spec_header)

**Question 1.1.** How many rows and columns does `spectra` have?

In [None]:
...

**Question 1.2.** Make a plot with wavelength on the horizontal axis and the first row of `spectra` on the vertical axis. The horizontal axis label can be `Wavelength` and the vertical axis label can be `Intensity`.

In [None]:
...

In order to use the `PCA()` function, we need to convert our table to a dataframe, which is a different type of data object.  Run the cell below to do this.

In [None]:
# Run this cell
spectra_df = Table.to_df(spectra)

**Question 1.3.** Run a PCA on `spectra_df` and print the principal components.  Set `n_components = 6`.

In [None]:
...

**Question 1.4.**  Create a plot that displays the percentage of variability accounted for by each of the 6 principal components for `spectra_df`.  What do you notice about the amount of variability the first PC accounts for?

In [None]:
...

[Add response here]

**Question 1.5.**  Create a scatterplot with the first three sets of the PC scores for `spectra_df`.  Be sure to label your axes.  The horizontal axis should have values 1 through 25 representing the 25 time points of the observations (call it `Time (days)`).  The vertical axis are the scores (call it `PC Scores`).  Be sure to label your axes.

In [None]:
...

**Question 1.6.**  What do you notice about the plot of the PCA scores?   Thoughts on whether the main source of variability is due to stellar activity or a Doppler shift?  *Hint:  What does the pattern of the PC1 scores look like?*

[Add response here]

**Question 1.7.** With this question, we are going to focus on PC1 for the spectra.  For this question, plot the first spectrum of `spectra`, but color the spectrum according to PC1 value for each wavelength.  I suggest you consider using `plt.scatter` so you can easily assign a different color for each point.  
To set up the color assignment, write a function, `assign_color`, that takes a value and returns `blue` if the value is less than or equal to -0.005, `gray` if the value is between -0.005 and 0.005, and `red` if the value is greater than 0.005.

Apply this function to the PC1 values to get the colors.  Then use the array of colors for your scatter plot of the spectra.

In [None]:
# Define assign_color function
...

In [None]:
# Plot 
...

**Question 1.8.** What do you notice about the variability of the spectra?  Going back to a previous question, thoughts on whether the main source of variability is due to stellar activity or a Doppler shift?  

[Add you response here]

# 2. Exploring Exoplanet Populations

In this section, we are going to be using the confirmed exoplanet data, `confirmed_planets.csv`, and was collected from the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu).  You can run the cell below to read in the data.  Note that we have to skip a number of rows to get rid of some of the header information.

In [None]:
exoplanets = Table.read_table("confirmed_planets.csv", skiprows = 71)
exoplanets

**Question 2.1.**  The [orbital eccentricity](https://en.wikipedia.org/wiki/Orbital_eccentricity) of an exoplanet (`pl_orbeccen` in the `exoplanet` table) describes how much the shape of the orbit differs from a circle.  An orbital eccentricity of 0 is a perfect circle.  We also know that the orbital eccentricity must be less than 1 in order for the planet to remain in orbit. 

Define a new table called `eccentricity` that only includes a column of the orbital eccentricity from the `exoplanet` table.  

In [None]:
eccentricity = ...

**Question 2.2.**  There are a number of missing values represented by `nan`.  Remove the `nan` from your `eccentricity` table.

In [None]:
...

**For the remainder of this section, use the eccentricity values with the `nan`'s removed.**

**Question 2.3.**  Make a histogram of the eccentricity values.  What is the (exact) range (min and max) of the eccentricities?

In [None]:
...

**Question 2.4.**  Suppose we are interested in estimating the population mean eccentricity.  Let's suppose that our sample of eccentricities can be considered a random sample of eccentricities from the population of exoplanets in our Milky Way galaxy.  Calculate the mean eccentricity of sample and call it `sample_mean`.

In [None]:
sample_mean = ...
sample_mean

**Question 2.5.**  Now we want to get a sense of the uncertainty in our estimate.  Define a function called `bootstrap_sample_mean` that produces a bootstrap sample of our eccentricities given a table input with a single column labeled `pl_orbeccen` and returns the mean of the sample.

In [None]:
...

**Question 2.6.**  Now bring it all together to draw 5,000 bootstrap estimates of the eccentricity mean by filling in the gaps in the code below.

In [None]:
boot_means = ...
replications = ...

for i in np.arange(...):
    new_mean = ...
    boot_means = np.append(..., ...)   

boot_means

**Question 2.7.** Create a histogram of your `boot_means`.

In [None]:
...

**Question 2.8.** Give a 95% bootstrap confidence interval for the mean eccentricity.

In [None]:
left = ...
right = ...

make_array(left, right)

**Question 2.9.**  How would you explain how to interpret this approximate 95% confidence interval to someone unfamiliar with confidence intervals?  Be sure to explain what "confidence" means in your own words.

[Add your response here]

**Submission:** Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb.  Then submit the two files through Canvas.