# ExoStat Lab 06: Principal Components Analysis and Stellar Activity

**Administrative details:**

- This Lab will be turned in for credit.

- Some questions of this lab are the same as the Practice 06 questions found on the main [YData website](http://ydata123.org/sp19/).  

- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

**Deadline:**

This assignment is due Monday, March 4th at 11:59 P.M. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on [Canvas](https://canvas.yale.edu)).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

#### Today's ExoStat Lab

1.  Hypothesis testing

2. Principal Components Analysis (PCA)

3.  Stellar spectra

**Submission:**

Submit your assignment both as a .pdf and .ipynb (Jupyter notebook) in Canvas.  

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "HTML (.html)"
3.  After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
4.  From the print window, select the option to save as a .pdf

To produce the .ipynb, please do the following:  
1.  Go to "File" at the top-left of your Jupyter Notebook
2.  Under "Download as", select "Notebook (.ipynb)"

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

## 1.  Hypothesis testing

**For those of you not in the main YData course, it is strongly suggested that you read through [Chapter 11](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html) of the textbook to learn about hypothesis testing.**

### What is the Therapeutic Touch

The Therapeutic Touch (TT) is the idea that everyone can feel the Human Energy Field (HEF) around individuals. Certain practictioners claim they have the ability to feel the HEF and can massage it in order to promote health and relaxation in individuals. Those who practice TT have described different people's HEFs as "warm as Jell-O" and "tactile as taffy". 

TT was a popular technique used throughout the 20th century that was toted to be a great way to bring balance to a person's health. 

### Emily Rosa

Emily Rosa was a 4th grade student who had wide exposure to the world of TT due to her parents. Her parents were both medical practitioners and skeptics of the idea of TT. 

For her 4th grade science fair project, Emily decided to test whether or not TT practitioners could truly interact with a person's HEF. 

**Question 1.1:** How would you set up an experiment to test this?  Feel free to discuss with your neighbors.

*Write your answer here, replacing this text.*

### Emily's Experiment

Emily's experiment was clean, simple, and effective. Due to her parents' occupations in the medical field, she had wide access to people who claimed to be TT practitioners. 

Emily took 21 TT practitioners and used them for her science experiment. She would take a TT practitioner and ask them to extend their hands through a screen (through which they can't see). Emily would be on the other side and would flip a coin. Depending on how the coin landed, she would put out either her left hand or her right hand. The TT practitioner would then have to correctly answer which hand Emily put out. Overall, through 210 samples, the practitioner picked the correct hand 44% of the time. 

Emily's main goal here was to test whether or not the TT practicioners' guesses were random, like the flip of a coin. In most medical experiments, this is the norm. We want to test whether or not the treatment has an effect, *not* whether or not the treatment actually works. 

We will now begin to formulate this experiment in terms of the terminology we learned in the YData course.  Read about hypothesis testing in [Chapter 11](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html) and [Chapter 12](https://www.inferentialthinking.com/chapters/12/Comparing_Two_Samples.html) of the textbook. 

**Question 1.2**: What are the null and alternative hypothesis for Emily's experiment? Discuss with students around you to come to a conclusion. 

**Your Answer Here:**

Null Hypothesis: 

Alternative Hypothesis: 

**Question 3:** Remember that the practitioner got the correct answer 44% of the time. According to the null hypothesis, on average, what proportion of times do we expect the practitioner to guess the correct hand? Make sure your answer is between 0 and 1. 

In [None]:
expected_correct = ...
expected_correct

The goal now is to see if our deviation from this expected proportion of correct answers is due to something other than chance. 

**Question 4:** What is a valid test statistic we can use to test our model? Assign `valid_ts` to a list of integers representing the following options: 

1. The difference of the expected percent correct and the actual percent correct
2. The absolute difference of the expected percent correct and the actual percent correct
3. The sum of the expected percent correct and the actual percent correct

There may be more than one correct answer. 

In [None]:
valid_ts = ...
valid_ts

**Question 5:** Define the function `test_statistic` which takes in an expected proportion and an actual proportion, and returns the value of the test statistic chosen above. Assume that you are taking in proportions, but you want to return your answer as a percentage. 

*Hint:* Remember we are asking for a **percentage**, not a proportion. 

In [None]:
def test_statistic(expected_prop, actual_prop):
    ...


**Question 6:** Use your newly defined function to calculate the observed test statistic from Emily's experiment. 

In [None]:
observed_test_statistic = ...
observed_test_statistic

**Is this test statistic likely if the null hypothesis was true? Or is the deviation from the expected proportion due to something other than chance?**

In order to answer this question, we can simulate the experiment as though the null hypothesis was true, and calculate the test statistic per each simulation.

**Question 7:** To begin simulating, we should start by creating an array which has two items in it. The first item should be the proportion of times, assuming the null model is true, a TT practictioner picks the correct hand. The second item should be the proportion of times, under the same assumption, that the TT practicioner picks the incorrect hand. Assign `model_proportions` to this array. After this, simulate, using the `sample_proportions` function, Emily running through this experiment 210 times (as done in real life), and assign the proportion of correct answers to `simulation_proportion`. Lastly, define `one_test_statistic` to the test statistic of this one simulation. 

*Hint:* `sample_proportions` usage can be found here: [here](http://data8.org/sp18/python-reference.html)

In [None]:
model_proportions = ...
simulation_proportion = ...
one_test_statistic = ...
one_test_statistic

**Question 8:** Let's now see what the distribution of test statistics is actually like under our fully specified model. Assign `simulated_test_statistics` to an array of 1000 test statistics that you simulated assuming the null hypothesis is true. 

*Hint:* This should follow the same pattern as normal simulations, in combination with the code you did in the previous problem.  

In [None]:
num_repetitions = 1000
num_guesses = 210

simulated_test_statistics = ...

for ... in ...:
    ...


Let's view the distribution of the simulated test statistics under the null, and visually compare how the observed test statistic lies against the rest. 

In [None]:
t = Table().with_column('Simulated Test Statistics', simulated_test_statistics)
t.hist()
plt.scatter(observed_test_statistic, 0, color='red', s=30)

We can make a visual argument as to whether or not we believe the observed test statistic is likely to occur under the null, or we can use the definition of p-values to help us make a more formal argument. 

**Question 9:** Assign `p_value` to the integer corresponding to the correct definition of what a p-value really is. 

1. The chance, under the null hypothesis, that the test statistic is equal to the value that was observed
2. The chance, under the null hypothesis, that the test statistic is equal to the value that was observed or is even further in the direction of the alternative
3. The chance, under the alternative hypothesis, that the test statistic is equal to the value that was observed or is even further in the direction of the null 
4. The number of times, under the null hypothesis, that the test statistic is equal to the value that was observed or is even further in the direction of the alternative

In [None]:
p_value = ...
p_value

**Question 10:** Using the definition above, calculate the p-value of Emily's observed value in this experiment. 

*Hint:* If our test statistic is further in the direction of the alternative, will larger value or a smaller value? 

*Hint:* [This section](https://www.inferentialthinking.com/chapters/11/3/decisions-and-uncertainty) of the textbook contains an example of calculating an empirical p-value.

In [None]:
emily_p_val = ...
emily_p_val

If our p-value is less than or equal to .05, then this is in favor of our alternative and we reject the null hypothesis. Otherwise, we do not have enough evidence against our null hypothesis. Note that this does **not** say we side in favor with the null hypothesis and accept it, but rather, that we just fail to reject it. 

This should help you make your own conclusions about Emily Rosa's experiment. 

Therapeutic touch fell out of use after this experiment, which was eventually accepted into one of the premier medical journals. TT practitioners hit back and accused Emily and her family of tampering with the results, while some claimed that Emily's bad spiritual mood towards therapeutic touch made it difficult to read her HEF. Whatever it may be, Emily's experiment is a classic example about how anyone, with the right resources, can test anything they want!

Think about the following questions: 

1. Do we reject the null hypothesis, or fail to reject it? 
2. What does this mean in terms of Emily's experiment? Do the TT practitioners' answers follow an even chance model or is there something else at play? 

[Put your reponse here]

## 2. Principal Components Analysis (PCA)

[Principal components analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) is popular statistical method, and is widely used in other fields as well.  There are many uses for PCA, such as for reducing the dimension of data or defining a set of variables from the data that are uncorrelated.  For this lab, we are going to use PCA as a way of investigating, understanding, and visualizing the variability of our data.

To begin, we are going to learn some of the basics by exploring some problems found in [this guide](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html).


First let's install the module with the PCA functions.

In [None]:
#Run this
from sklearn.decomposition import PCA
from sklearn import preprocessing

There are a number of details about PCA that are beyond the scope of our course.  But let's look at a few aspects of what goes into it.

Below is code to create a simulated dataset for us.  The `rng` fixes the random number so that we all will use the same randomly generated data.  The dataset is created and then plotted below.

In [None]:
#Generate the dataset
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

Okay, now we can run the PCA on our data `X`.  Below is the code for it.  Notice that first we define some properties of the PCA using the `PCA()` option.  In this case, all we are doing is specifying the number of principal components.  We set this to `n_components=2` because we only have two dimensions to our data.  Sometimes you will have many dimensions, but will only specify a small number of components like 2 - 10 (which will often contain most of the variability in the data).

After specifying the details of the PCA, we run it on our data using `pca.fit(X)`.

In [None]:
# Run the PCA
pca = PCA(n_components=2)
pca.fit(X)

The output of the PCA is stored in `pca`, which we can look at.  For example, we can look at the two principal components by running the code below.

In [None]:
# These are the two principal component vectors
print(pca.components_)

The first principal component, PC1, is the vector [-.94445, -0.32862557].  This means that the direction of this vector explains the most variability in the data.  We are going to be plotting this vector below so you can get a sense of what that means.

Another output of `pca` is `pca.explained_variance_`.  This specifies how much variability is explained by each of the two PCs.  

In [None]:
print(pca.explained_variance_)

Notice that variability PC1 acounts for .7625 and PC2 accounts for 0.0184.  Since there are only two dimensions to our data, the total variability in the data is the sum of these two values:

In [None]:
#Total variability
sum(pca.explained_variance_)

We can also calculate the total variability by adding the variance of the two columns of `X` together as is done in the cell below.  The factor `200/199` is to account for a difference in the way variance is calculated with `np.var` and the PCA function.

In [None]:
(np.var(X[:,0])+np.var(X[:,1]))*200/199

Instead of looking at the raw variance, often people are interested in the proportion of the variability accounted for by each component.  This information is stored in `pca.explained_variance_ratio_`.

In [None]:
print(pca.explained_variance_ratio_)

This is simply the ratio of the variability of each component divided by the total variability.  We could calculate it manually below:

In [None]:
pca.explained_variance_/sum(pca.explained_variance_)

The mean vector is also stored, which is simply the mean of the different columns of `X` (as seen in the second cell below).

In [None]:
pca.mean_

In [None]:
[np.mean(X[:,0]),np.mean(X[:,1])]

Now, let's visualize the PCs!  Below is code for plotting the data and then adding the principal component vectors as arrows.  PC1 is the longest vector and indicates the direction of maximum variability.  The second arrow is for PC2 and is perpendicular to the PC1 vector.  Both of the vectors are drawn from the mean, `pca.mean_`.

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0, color = "blue")
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3* np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

**Question 2.1.** Now it is your turn!  Below is a new dataset.  Run the cell below to see what it looks like.  Where do you think PC1 is going to be?  Put your answer in the indicated cell below.

In [None]:
# Run this to get the new dataset.
A = [[-1,-1], 
    [.5,1]]
rng = np.random.RandomState(1)
X = np.dot(A, rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

[Add you response here]

**Question 2.2.** Run a PCA on the new data.  What are the values of the PC1 vector?

In [None]:
#Code to run PCA here
...

In [None]:
#Values of PC1
pc1 = ...
pc1

**Question 2.3.** Now produce a plot like the one above with the arrows.  You should be able to simply copy the plotting code from above and run it.

In [None]:
...

**Question 2.4.**  What percentage of variability is explained by PC2?

In [None]:
...

Next we are going to look at the `exams.txt` dataset.  These are the scores of 88 students in five different math areas:  vectors, mechanics, algebra, analysis, statistics.  Read in the data below. 

In [None]:
exams0 = Table.read_table("exams.txt", sep = "\s+", header = None, 
                         names = ["Vectors","Mechanics","Algebra","Analysis","Stat"])
exams0

In order to use the PCA functions, we need the dataset to be an array.  The cell below converts our table into an array.

In [None]:
exams = np.zeros((exams0.num_rows, exams0.num_columns))
for i in np.arange(exams0.num_rows):
    exams[i] = exams0.values[i]

**Question 2.5.**  Run a PCA on the `exams` data and print out the principal components.  Since there are five columns in our dataset, set the `n_components` to be 5.

In [None]:
...

**Question 2.6.**  Next we are going to produce a plot like was discussed during the lecture portion of the class, which is called a [biplot](https://en.wikipedia.org/wiki/Biplot).  The code is provided for you below, for this question you just need to run the code.  The points that are plotted are the 88 students projected onto the first two principal component vectors.

In [None]:
## project data into PC space
dat = exams
labels = ["Vectors","Mechanics","Algebra","Analysis","Stat"]
# 0,1 denote PC1 and PC2; change values for other PCs
xvector = pca.components_[0] # see 'prcomp(my_data)$rotation' in R
yvector = pca.components_[1]

xs = pca.transform(dat)[:,0] # see 'prcomp(my_data)$x' in R
ys = pca.transform(dat)[:,1]

In [None]:
## visualize projections
for i in range(len(xvector)):
# arrows project features (ie columns from csv) as vectors onto PC axes
    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.0005, head_width=0.0025)
    plt.text(xvector[i]*max(xs)*1.2, yvector[i]*max(ys)*1.2,
             labels[i], color='r')

for i in range(len(xs)):
# circles project documents (ie rows from csv) as points onto PC axes
    plt.plot(xs[i], ys[i], 'bo')
    plt.text(xs[i]*1.2, ys[i]*1.2, i, color='b')

plt.show()

**Question 2.7.**  Now we want to interpret the biplot from above.  As we did during lecture, find a few students that are well-separated in the PC1 direction and explain what it is about their scores that seems to make them different.  That is, what seems to be leading to the variability in this direction.

In [None]:
...

**Question 2.8.**  Same question as above, except for the PC2 direction.

In [None]:
...

## 3. Stellar activity

In this section, we are going to look at some simulated stellar spectra.  These spectra were generated from [SOAP 2.0](https://arxiv.org/abs/1409.3594), which is tool for simulating stellar spectra with different types of stellar activity such as spots.

Next week we are going to consider running a PCA on these data, but today we are only going to make a plot.

In [None]:
#Read in the wavelengths
wavelength = Table.read_table("spot_1_percent_wave.txt", header = None, names = "Wave")

In [None]:
#Create a header for the spectra data
spec_header = make_array()
for i in np.arange(wavelength.num_rows):
    spec_header = np.append(spec_header, str(i))

In [None]:
#Read in the spectra data
spec = Table.read_table("spot_1_percent.txt",sep = "\s+", header = None, names = spec_header)
spec

**Question 3.1.** How many rows and columns does `spec` have?

In [None]:
spec_rows = ...
spec_columns = ...
print("Rows: ", spec_rows)
print("Columns: ", spec_columns)

**Question 3.2.** Make a plot with `wavelength` on the horizontal axis and the first row of `spec` on the vertical axis.  The horizontal axis label can be `Wavelength` and the vertical axis label can be `Intensity`. Hint:  You may find `spec.values[0]` helpful in producing the plot.

In [None]:
...

As noted above, we will look at these data again next week and run a PCA on them in order to investigate stellar activity.

## 4. Final Project Ideas

The final project is an opportunity for you to explore, in more detail, a question of interest related to exoplanets and data science.  It is expected that the topic of the question will be related to exoplanets and the project will include an analysis using data science methods.  

The final project will culminate in a 5 - 10 page written report introducing your question, describing the methodology you employ to answer the question, a discussion of the results, and finally the conclusions you can draw from the analysis.

**Question 4.1.** It is time to start thinking about ideas for your final project.  For this question, describe an idea (or several) that you are intersted in pursuing for your final project.  Explain the question or topic you plan to explore along with the type or source of data you would need for addressing your question.

[Answer here]

**Submission**: Once you're finished, follow the instructions at the top of this notebook to save as a .pdf and .ipynb. Then submit the two files through Canvas.