# Statistical tests of Samples

## Authors
B.W. Holwerda

## Learning Goals
* How to test against a distribution
* How to test if two subsamples have been drawn from the same distribution.
* Kolmogorov-Smirnov test
* Anderson-Darling test

## Keywords
distributions, Kolmogorov-Smirnov test, Anderson-Darling test, K-S test, A-D test, 2 sample tests



## Companion Content


## Summary

In physics, we often have to compare a set of data to an idealized distribution or we have two sets of data and we need to know if they are from the same parent distribution. There are some statistical tests to help determine both cases: the Kolmogorov-Smirnov test and the Anderson-Darling test. 

<hr>


## Student Name and ID:



## Date:

<hr>

## Statistical tests if the data is from the same parent sample or consistent with a distribution

**Kolmogorov-Smirnov (K-S)** test is a nonparametric test of the equality of continuous (or discontinuous, see Section 2.2), one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). *The K-S value is highest (1) when both sample are dissimilar, and lowest (0) when both are from the same parent distribution.*

**Anderson–Darling (A-D)** test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested. Null hypothesis is rejected if $A^{{*2}}$ exceeds 0.631, 0.752, 0.873, or 1.035 at 10%, 5%, 2.5%, and 1%.

There are two versions of both tests in Python: one where the test is a single sample against a distribution (kstest, anderson) and one where two samples are compared against each other (ks_2samp, anderson_ksamp). Both have their uses.

In [1]:
import matplotlib.pyplot as plt
from astropy.io import ascii
import numpy as np
from scipy.stats import kstest
from scipy.stats import ks_2samp
from scipy.stats import anderson
from scipy.stats import anderson_ksamp

# reading in the GAMA data
data = ascii.read("GAMA.csv", format='csv', names=['cataid','Mstar','u-r','n','r50','sSFR'],fast_reader=False)




### Exercise 1 - test Gaussian distribution of values with K-S and A-D 

The single sample K-S and A-D tests test against a distribution, which can be used for a variety of distributions (pretty flexible) with the default being the normal (Gaussian) distribution. 

define a Gaussian distribution with random.normal and test with both tests. Use a Gaussian distribution of the *same size* as the data in GAMA catalog read in above.

In [2]:
# student work


### Exercise 2 -- sample size

Now increase the sample size by a lot (100x the above sample) for the Gaussian test distribution. 

What happens to the K-S test value and the A-D test value?

In [3]:
#student work


*your answer here*

### Exercise 3 --  Test Mstar sample against the normal distribution with K-S and A-D

The K-S and A-D tests test against a distribution, which can be used for a variety of distributions (pretty flexible) with the default being the normal (Gaussian) distribution. 

Is this distribution Gaussian according to the K-S test? Is it Gaussian according to the A-D test at 2.5% significance?

In [4]:
# student work


*your answer here*

### Exercise 4 -- Least and most Gaussian of the data 

run through all the distributions in the data and see which one is the most and the least like a Gaussian distribution. 
Use the K-S test and the A-D test at 5% (p=0.05) significance. 


In [5]:
# student work


*your answer here*

### Exercise 5 -- plot the K-S test values against the A-D values

Plot the values for each collection in the GAMA data. Do both test values agree?


In [6]:
# student work


### Exercise 6 -- Plot the histogram of the most and least Gaussian data distributions.

In [7]:
# student work


In [8]:
# student work


### Exercise 7 -- Compare the most and least Gaussian distribution to the Gaussian distribution in a normalized, cumulative histogram.

The first set distributions should be pretty close according to the K-S and A-D tests. 
use: plt.hist(data,bins=100,cumulative=True,histtype='step',normed=True);

In [9]:
# student work here



### Exercise 8 -- Comparing two samples.

First we draw two random samples from the Mstar data to compare. These should be indistinguisheable from each other. If the draw is truely random, there should be no difference and the K-S or A-D 2-sample versions should reflect that. Pick 100 objects at random with np.random.choice

What do both test conclude?

In [10]:
# student work here


*your answer here*

### Exercise 9 -- samples based on a criterion

Now we will select the blue and red galaxies using the u-r color criterion and np.where to split the Mstar sample into two distinct populations. Blue galaxies with u-r > 0 and red <0. 

Using the K-S and A-D 2-sample tests, are these populations from the same distribution?



*your answer here*