In [1]:
# Here are some necessary packages that we need to import to run this notebook

import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive 
drive.mount("/content/drive/")

import os
os.chdir("/content/drive/My Drive/DSECOP/Colab Notebooks/")
!pwd

ModuleNotFoundError: No module named 'google'

# Quantifying goodness of fit BONUS notebook: The Kolmogorov-Smirnov goodness of fit test



In this notebook, we've focused on the $\chi^2$ test statistic. But this test is only truly useful for categorical data (i.e. datasets of counts or frequencies). For histogram-based analyses, the number of "categories" could maybe be considered equal to the number of binsm, but it might be easier to work with normalized histograms anyways. Then the $\chi^2$ test statistic might fail to gauge goodness-of-fit for a given model.

A useful goodness-of-fit procedure for histogram-based analyses is the *Kolmogorov-Smirnov* test.

Consider a data sample of size $n$ with a probability distribution $p_n(x)$ (in other words, the normalized histogram) and a cumulative distribution function (CDF) $F_n(x)$. Let's say that you want to test whether the data can be well-modeled by the probability distribution $p_0(x)$ with a cumulative distribution function $F_0(x)$.

$$D_n = \max_x |F_n(x) - F_0(x)|.$$

In essense, we're calculating the largest difference between the CDFs of the data and the postulated model. (Try to prove to yourself that for a model that perfectly fits the data, $D_n = 0$.)

Each $n$ and $\alpha$ ($p$-value) pairing is associated with a critical test statistic $d_{n, \alpha}$. You can find these in a [lookup table](https://www.real-statistics.com/statistics-tables/two-sample-kolmogorov-smirnov-table/). If $D_n > d_{n, \alpha}$, then the null hypothesis must be rejected at the given confidence level. 

*Note*: the Kolmogorov-Smirnov test is defined for *unbinned* data with a continuous CDF. Here, we're asking you to run the test on Poisson-distributed data, which is discrete (as you can only measure an integer number of counts). So take this example with a grain of salt. In writing code to do a pseudo- Kolmogorov-Smirnov test, you'll gain conceptual understanding in what the test actually measures. But you won't be able to meaningfully accept of reject the null hypothesis. 



**Activity**: write code to run a *binned* Kolmogorov-Smirnov test on the (1) gaussian and (2) poisson fit to data. Plot CDFs of the two fits and the actual data on the same plot. For each fit, print out the test statistic $D_n$ and its bin location.

In [2]:
def KS_test(pdf_1, pdf_2):

  # Your code here
  return 

The relevant built-in function for the Kolmogorov-Smirnoff test is ```scipy.stats.kstest()```. 

The function takes three main arguments: 

1. ```x```: a ```np.array``` (or ```list```)  containing the raw observed data. Note that is NOT the histogrammed data. 
2. ```cdf```: this is usually a string describing the random variable of the model that you want to fit. If you wanted to fit a Gaussian distribution, you could use the string ```"norm"```.
3. ```args```: parameters of the model. For a Gaussian distribution, this would be ```(mu, sigma)```.

The function returns the test statistic $D_n$ and the associated $p$-value. 

You can read more about the function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest).

---

Another common application of the Kolmogorov-Smirnoff test is to compare two distributions to determine if they came from the same underlying model, as opposed to comparing a single distribution to a model. Then you might want to use the function ```scipy.stats.ks_2samp()```.


The function takes two main arguments: 

1. ```data_1```: a ```np.array``` (or ```list```)  containing the raw observed data of sample 1. Note that is NOT the histogrammed data. 
2. ```data_2```: a ```np.array``` (or ```list```)  containing the raw observed data of sample 2.


The function returns the test statistic $D_n$ and the associated $p$-value. 


You can read more about the function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html).

*Move on to notebook 04*