# Week 3: Distributions Continued

## Recall from last week

We are going to continue with our discussion from last week on distributions. To rehash... __there are lots of different distributions__. The most common used in astronomy are the Gaussian (normal), power law, binomial, Poisson, and Lorentz.

When working with a data set we may be faced __two questions__:  
- Is a set of samples consistent with following one of these distributions?  
- Are two sets of samples drawn from the same distribution?  

The tests discussed last week should help us answer these questions. Depending on the situation, __different tests may be appropriate__. The table below should help answer these questions.

### Comparing a data set to a known (analytic) distribution

- $\chi$^2: Widely used, rigorously defined for known, Gaussian uncertainties.
- Kolmogorov-Smirnov (KS): Widely used, compares maximum difference in CDF. Not great for small numbers or outliers.
- Anderson-Darling: Integrated version of the KS test. Therefore better than KS for small numbers and outliers.
- t-test: Determines whether a data set could be consistent with having a mean at some value.

### Comparing two data sets

- Mann-Whitney U (Wilcox rank sum): Assumes nothing about underlying distributions. Really only compares the medians.
- Kolmogorov-Smirnov (KS): Widely used, compares maximum difference in CDF. Not great for small numbers of outliers.
- Anderson-Darling: Integrated version of the KS test. Therefore better than KS for small numbers and outliers.
- t-test: Tests if the means of two (assumed to be Gaussian) distributions equal. Distributions may have different variances - math is slightly different.
- F-test: Tests if the means of two Gaussian distributions are equal.

### Correlation tests

- Pearson r or $\rho$: Widely used, simple to implement. Easily skewed by outliers. Works for linear correlations.
- Spearman r or $\rho$: Handles outliers better. Linear not required -> works for any monotonic function.
- Kendall $\tau$: Also a rank test. Used to determine whether two variables are independent.

## Now, on to a few examples

In [None]:
# First, let's load the libraries we will need
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

%matplotlib inline

### Exercise 1: Characterizing two distributions

In [None]:
# First let's load up the data

folder = "../data/Week_3/"

sample_1 = np.genfromtxt(folder + "sample_1.dat")
sample_2 = np.genfromtxt(folder + "sample_2.dat")

# Let's see how big the data sets are
print("Data set 1 has", len(sample_1), "elements")
print("Data set 2 has", len(sample_2), "elements")

In [None]:
# Now, let's plot histograms of the two data set to see how they compare
plt.hist(sample_1, normed=True, color='r', alpha=0.3, label="Sample 1")
plt.hist(sample_2, normed=True, color='g', alpha=0.3, label="Sample 2")

plt.legend()
plt.show()

First, let's see if these two data sets are consistent with being a Gaussian. Let's start with Sample 1.

In [None]:
# We will use the Anderson-Darling test in the scipy package.

statistic, critical_values, significance = stats.anderson(sample_1, dist='norm')

print("Statistic =", statistic)
print("Critical values =", critical_values)
print("Significance =", significance)

Since the statistic, 0.685, is larger than 0.577 but smaller than 0.692, the distribution is non-Gaussian only at the 90% to 95% level. The statistic would have to be above 0.96 to be ruled non-Gaussian at the 99% level.

### Exercise: In the code block below, repeat the above test with data from Sample 2

In [None]:
# Test for Gaussianity of the data for Sample 2


Now, let's compare the two distributions to each other. Pick two of the tests above and apply them. 

Remember that in jupyter, it is easy to look up the docs for packages and functions. Look at the following code blocks examples.

In [None]:
stats?

In [None]:
stats.uniform?

In [None]:
# Code up two different comparison tests here


### Example 2: Using radial velocities to constrain an unseen companion to a star

This is an example from my own research (see Andrews et al. 2016). 

Let's say you have a star that looks strange for some reason, and you want to know that could be because the star is actually a binary system. A faint companion will not appear in the photometry or spectroscopy. However, by taking several consecutive measurements of the radial velocity the possibility that the star hosts a companion can be constrained.

First, let's load up the radial velocity data and look at it.

In [None]:
folder = "../data/Week_3/"

RV_1 = np.genfromtxt(folder + "RV_sample_1.dat", names=True)

print(RV_1.dtype)

In [None]:
# Our RV plotting script

def generate_RV_plot(times, RV, RV_err, xmin=None, xmax=None, color='k', ax=None, 
                     xlabel=None, ylabel=None, label=None):
    
    if ax is None:
        if label is None:
            plt.errorbar(times, RV, yerr=RV_err, fmt='o', color=color)
        else:
            plt.errorbar(times, RV, yerr=RV_err, fmt='o', color=color, label=label)
            
        if xmin is not None and xmax is not None: plt.xlim(xmin, xmax)
        if xlabel is not None: plt.xlabel(xlabel)
        if ylabel is not None: plt.ylabel(ylabel)
        
    else:
        if label is None:
            ax.errorbar(times, RV, yerr=RV_err, fmt='o', color=color)
        else:
            ax.errorbar(times, RV, yerr=RV_err, fmt='o', color=color, label=label)  
            
        if xmin is not None and xmax is not None: ax.set_xlim(xmin,xmax)
        if xlabel is not None: ax.set_xlabel(xlabel)
        if ylabel is not None: ax.set_ylabel(ylabel)
        

In [None]:
# Now, we plot the three different sets of observations

fig, ax = plt.subplots(1, 3, figsize=(12,3), sharey=True)

# First observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7215.8, xmax=7216.0, color='b', 
                 ax=ax[0], xlabel='Time (MJD)', ylabel='RV (km/s)')

# Second observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7224.8, xmax=7225.0, color='b', 
                 ax=ax[1], xlabel='Time (MJD)')

# Third observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7311.6, xmax=7311.8, color='b', 
                 ax=ax[2], xlabel='Time (MJD)')



This data were taken with the FLWO 1.5-meter telescope, not the VLT, so the radial velocities are not super precise. But we can still use it: For instance, the last observation shows what could be periodic oscillations indicative of the orbital motion of a short period binary. However, these are not seen in the other two observations. Likewise, the first observation looks like it could indicate a slow increase in the radial velocity, but the other two observations do not clearly show anything similar. 

**So, how do we deal with this data?**

To really squeeze every last bit of information out of this data, we'll need to use time series analysis. Since that is (maybe) the subject of a future session, we'll ignore it for now. Instead, let's perform the first order analysis, using **hypothesis testing.** Here's how it goes.

We adopt the following *null hypothesis*:

**Hypothesis: These radial velocities have no variations**

Now, we ask the question: how likely is it that these data are consistent with the null hypothesis? The goal is to be able to reject the null hypothesis with some statistical significance. We will adopt a **p** value (in much scientifical and medical literature, a critical **p** value of 0.05 is used)

An easy way to answer this question is by calculating the $\chi^2$ value and the reduced $\chi^2$ value. The following two equations may look familiar:

$$ \chi^2 = \sum_{i=1}^{N} \left( \frac{y_i - \mu}{\sigma_i} \right)^2, $$

$$ \chi^2_{\rm red} = \frac{1}{N-k} \sum_{i=1}^{N} \left( \frac{y_i - \mu}{\sigma_i} \right)^2, $$

where $N$ is the number data points, $k$ are the number of model parameters, $y_i$ and $\sigma_i$ are the radial velocities and their associated uncertainties, respectively, and $\mu$ is the average radial velocity. 

### Question: What is the value of $k$ in the above equation, and why?

** Answer: ??**

First, let's calculate $\mu$. For heteroscedastic data (each data point has its own associated uncertainty), we need to use a weighted mean in which the weights are the inverse of $\sigma_i^2$. Use the function below to calculate the weighted mean, and replot the RV data with the weighted mean shown in the background.

In [None]:
# Again, we plot the three different sets of observations
fig, ax = plt.subplots(1, 3, figsize=(12,3), sharey=True)


# Calculate and plot the weighted mean
mu_1 = np.average(RV_1['vel'], weights=1.0/RV_1['err']**2)

for a in ax:    
    a.axhline(mu_1, color='k', linestyle='--')



# First observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7215.8, xmax=7216.0, color='b', 
                 ax=ax[0], xlabel='Time (MJD)', ylabel='RV (km/s)')

# Second observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7224.8, xmax=7225.0, color='b', 
                 ax=ax[1], xlabel='Time (MJD)')

# Third observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7311.6, xmax=7311.8, color='b', 
                 ax=ax[2], xlabel='Time (MJD)')

plt.show()

With a partner, **code up a function** that takes in the measurement values, their uncertainties, and returns $\chi_{\rm red}^2$ using the equation provided. Start by adapting the line in the above block of code that calculates $\mu$ to the inputs in the function below.

In [None]:
def calc_reduced_chi_2(y, y_err):
    

    
    
    
    return reduced_chi_2

In [None]:
reduced_chi_2 = calc_reduced_chi_2(RV_1['vel'], RV_1['err'])
print("Reduced chi^2:", reduced_chi_2)

For "perfectly" random data, $\chi_{\rm red}^2$ should be unity. The value we obtain is pretty close. To quantify this statement, we want to calculate the **p** value we discussed above. Use the code block below to calculate this. Note, we have to use the $\chi^2$, not the $\chi_{\rm red}^2$ here.

In [None]:
N_dof = len(RV_1['vel'])-1  # Number of degrees of freedom
chi_2 = reduced_chi_2 * (len(RV_1['vel'])-1)

p_value = stats.chi2.cdf(chi_2, N_dof)
print(p_value)

How do we interpret this number? Had **p** been greater than 0.95, it would have indicated that the data are *too* consistent given the uncertainties. Typically this means the uncertainties are overestimated for some reason. Had the **p** value been less than 0.05, it would have indicated that our null hypothesis could have been ruled out at the 95% (or roughly 2-$\sigma$) level. Practically speaking, this means that only 5% of the time (or 1 in 20), could a result with this level of consistency have been randomly generated.

Note that, depending on what you are doing, a **p** value of 0.05 may not be stringent enough. The annals of particle physics history are replete with 3-$\sigma$ detections (**p** < 0.01), only to be later shown as noise.

So, our conclusion from this exercise is the following: **Given the p value of 0.45, we cannot rule out the null hypothesis. Therefore, our data are consistent with non-varying radial velocities.**

### Example 3: But wait, there's a second star!

The star we have been looking has a wide binary companion, and we have radial velocity data for both. Load up the data below, and plot both RV's together, along with each of their means.

In [None]:
# Load the radial velocity data for star 2
RV_2 = np.genfromtxt(folder + "RV_sample_2.dat", names=True)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(12,3), sharey=True)


# Calculate and plot the weighted mean
mu_1 = np.average(RV_1['vel'], weights=1.0/RV_1['err']**2)
mu_2 = np.average(RV_2['vel'], weights=1.0/RV_2['err']**2)

for a in ax:    
    a.axhline(mu_1, color='b', linestyle='--')
    a.axhline(mu_2, color='r', linestyle='--')



# First observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7215.8, xmax=7216.0, color='b', 
                 ax=ax[0], xlabel='Time (MJD)', ylabel='RV (km/s)', label='Star 1')
generate_RV_plot(RV_2["date"], RV_2['vel'], RV_2['err'], color='r', ax=ax[0], label='Star 2')

# Second observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7224.8, xmax=7225.0, color='b', 
                 ax=ax[1], xlabel='Time (MJD)')
generate_RV_plot(RV_2["date"], RV_2['vel'], RV_2['err'], color='r', ax=ax[1])

# Third observation
generate_RV_plot(RV_1["date"], RV_1['vel'], RV_1['err'], xmin=7311.6, xmax=7311.8, color='b', 
                 ax=ax[2], xlabel='Time (MJD)')
generate_RV_plot(RV_2["date"], RV_2['vel'], RV_2['err'], color='r', ax=ax[2])


# Add the legend
ax[0].legend()



plt.show()

### Exercise: In the code block below, perform the same analysis that we did above for Star 1, to determine if Star 2 is consistent with the null hypothesis of zero variability.

In [None]:
# Is Star 2 consistent with the null hypothesis?


We can conclude that Star 2 is also consistent with the null hypothesis.

### Wait a second...

...this is supposed to be a binary, but the radial velocities above are different. Is this difference significant? Again, we can use hypothesis testing.

### Question: What is the null hypothesis? How can we test it using the tools we've discussed thus far? Once you have a plan, go ahead and calculate a p-value.

**Answer: ?? ** 

In [None]:
# Calculate the null hypothesis here


## The Central Limit Theorem (CLT)

Sample from an arbitrary distribution f(x), say N samples, and take their mean. The mean will not necessarily be the same as the mean of f(x). But if you repeat this a number of times, you'lee see that the sample means are distributed *normally* around the mean of f(x) with a standard deviation: $\sigma_N = \sigma_{f(x)}/\sqrt{N}$, where $\sigma_{f(x)}$ is the spread of the original distribution.

Assumptions: 
* initial distribution has well-defined standard deviation (tails fall of more rapidly than $x^{-2}$)
* data are uncorrelated

### CLT example

How does the spread of the sample mean change with the number of samples N? Let's compare the distributions of the sample means for N = 20 and N = 100. Let's also see how the spread of these distributions varies as a function of N.

In [None]:
# Select mean and spread of parent distribution f(x)
parent_mean = 6.0
parent_spread = 1.5

# Number of samples to average
Nsize_1 = 15

# define empty list to store sample means
sample_means = []

# Draw samples several times to see distribution
Nrepeats = 1000
for i in range(Nrepeats):
    # draw Nsize samples
    y = stats.norm.rvs(loc=parent_mean, scale=parent_spread, size=Nsize_1)
    # compute their mean
    y_mean = np.mean(y)
    # collect mean in a list
    sample_means.append(y_mean)
    
# Now select other number of samples to compare
Nsize_2 = 100
# define empty list to store sample means
sample_means_2 = []
for i in range(Nrepeats):
    y = stats.norm.rvs(loc=parent_mean, scale=parent_spread, size=Nsize_2)
    y_mean = np.mean(y)
    sample_means_2.append(y_mean)
    
# How does the spread of the distribution of sample means change with sample size N?
# To explore this dependence, select several sample sizes 
Ns = [5,10,50,100,200,500]
spread_N = []
for N in Ns:
    sample_means_i = []
    # Repeat drawing samples and averaging this many times
    Nrepeats = 100
    for i in range(Nrepeats):
        # draw N samples
        y = stats.norm.rvs(loc=parent_mean, scale=parent_spread, size=N)
        # compute their mean
        y_mean = np.mean(y)
        # collect mean in a list
        sample_means_i.append(y_mean)
    spread_N.append(np.std(sample_means_i))

In [None]:
# Plot histogram of sample means for both cases of N
fig, ax = plt.subplots(1,2, figsize = (9,3.5))
ax[0].hist(sample_means, histtype = 'step', label = 'N = %d'%Nsize_1,  bins = 10, normed = True, linewidth=2)
ax[0].hist(sample_means_2, histtype = 'step', label = 'N = %d'%Nsize_2,  bins = 10,normed = True, linewidth=2)
# plot also original distribution
ax[0].hist(stats.norm.rvs(loc=parent_mean, scale=parent_spread, size=2000), 
           histtype = 'step', label = 'parent', normed = True, linewidth=2)
ax[0].legend(loc = 1)
ax[0].set_xlabel('$x$')
ax[0].set_ylabel('Frequency of occurences')
ax[0].set_ylim(ymax = 2.5)

# Plot spread of sample means versus N samples
ax[1].scatter(Ns, spread, facecolor = 'w')
xs = np.arange(1,max(Ns),10)
ax[1].plot(xs, parent_spread/np.sqrt(xs), ls = 'dashed', c = 'k', label = '$\sqrt{N}$')
ax[1].set_ylabel('Spread of sample means')
ax[1].set_xlabel('N samples')
ax[1].set_xlim(xmin = -5)
ax[1].set_ylim(ymin = 0,ymax = 1)
ax[1].legend()

plt.show()

Bonus: the more you repeat the excercise, the more the distribution of sample means approaches Gaussianity (repeat with 500 iterations).

### Example 4: Show the CLT holds for the uniform distribution

Let's apply the CLT to the uniform distribution, using **x_min=5** and **x_max=26**. 

### Question: The CLT implies that the distribution of sample means converges on one value in the limit of large N. What is that value for this particular uniform distribution?

**Answer: ?? ** 

### Exercise: Adapt the above code block for the Gaussian distribution to the uniform distribution

In [None]:
# Test the CLT for the uniform distribution here


### Exercise: Show that as you increase the number of sample means (Nrepeats), their distribution approaches a Gaussian. I will assert, without proof, that the uncertainty for the Gaussian representing Nrepeats (in the limit of large Nrepeats) of single draws from a uniform distribution (Nsize=1) is $(x_{\rm max} - x_{\rm min})/\sqrt{12}$. What are the parameters of the Gaussian that will be converged to for Nsize = 1000?

In [None]:
# Code up the exercise here
