In [1]:
## Import required Python modules
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy, scipy.stats
import io
import base64
#from IPython.core.display import display
from IPython.display import display, HTML, Image
from urllib.request import urlopen

try:
    import astropy as apy
    import astropy.table
    _apy = True
    #print('Loaded astropy')
except:
    _apy = False
    #print('Could not load astropy')

## Customising the font size of figures
plt.rcParams.update({'font.size': 14})

## Customising the look of the notebook
display(HTML("<style>.container { width:95% !important; }</style>"))
## This custom file is adapted from https://github.com/lmarti/jupyter_custom/blob/master/custom.include
HTML('custom.css')
#HTML(urlopen('https://raw.githubusercontent.com/bretonr/intro_data_science/master/custom.css').read().decode('utf-8'))

In [2]:
## Custom imports
from scipy.stats import binom, poisson, chi2, norm, uniform
from scipy.optimize import curve_fit
from math import ceil, pi
from numpy import exp
from matplotlib.collections import PatchCollection
from matplotlib.patches import Circle, Rectangle
from matplotlib.colors import makeMappingArray

In [3]:
## Adding a button to hide the Python source code
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the Python code."></form>''')

<div class="container-fluid">
    <div class="row">
        <div class="col-md-9" align="center">
            <h1>PHYS10791: Introduction to Data Science</h1>
            <!--<h3>2019-2020 Academic Year</h3><br>-->
        </div>
        <div class="col-md-3">
            <img align='center' style="border-width:0" src="images/UoM_logo.png"/>
        </div>
    </div>
</div>

<div class="container-fluid">
    <div class="row">
        <div class="col-md-3" align="right">
            <b>Course instructors:&nbsp;&nbsp;</b>
        </div>
        <div class="col-md-9" align="left">
            <a href="http://www.renebreton.org">Prof. Rene Breton</a> - Twitter <a href="https://twitter.com/BretonRene">@BretonRene</a><br>
            <a href="http://www.hep.manchester.ac.uk/u/gersabec">Dr. Marco Gersabeck</a> - Twitter <a href="https://twitter.com/MarcoGersabeck">@MarcoGersabeck</a>
        </div>
    </div>
</div>

# Chapter 7 - Summary Sheet

## Topics

**[7 Goodness of fit tests](#7-Goodness-of-fit-tests)**

**[7.1 Introduction](#7.1-Introduction)**

**[7.2 Chi-squared test](#7.2-Chi-squared-test)**
- 7.2.1 General formulae
- 7.2.2 Application
- 7.2.3 Rescaling $\chi^2$ distributions

**[7.3 Comparing fit models](#7.3-Comparing-fit-models)**
- 7.3.1 Definitions
- 7.3.2 Interpretation
- 7.3.3 Examples

**[7.4 Two sample problem](#7.4-Two-sample-problem)**
- 7.4.1 Comparing samples with known $\sigma$

**[7.5 Kolmogorov-Smirnov test and its application to the two-sample problem](#7.5-Kolmogorov-Smirnov-test-and-its-application-to-the-two-sample-problem)**
- 7.5.1 The Kolmogorov-Smirnov test
- 7.5.2 Kolmogorov-Smirnov test with two samples

### 7.1 Introduction

Goodness-of-fit tests assess and compare the quality of fits.

### 7.2 Chi-squared test

#### 7.2.1 General formulae

The formulae below are a recap from Chapter 5.

Main formula to calculate $\chi^2$:

$$\chi^2=\sum_{i=1}^n\frac{(y_i-f(x_i))^2}{\sigma_i^2}.$$

Main formula on which test is based:

$${\rm Prob}(\chi^2;N)=\int_{\chi^2}^{\infty}P(\chi'^2;N)d\chi'^2.$$

Need to distinguish $n$ and $N$.

#### 7.2.2 Application

Setting a threshold of $\chi^2$ or $\chi^2/N$ requires taking into account the corresponding probability.

As a consequence, a unique $\chi^2/N$ threshold for all $N$ does not make sense.

##### Example: fit to binned dataset

When applying a $\chi^2$ test to a binned data set, the number of measurements is the number of bins and the error is the error on the count rate within a bin, which in most cases is the Poisson error, i.e. the square root of the count rate.

#### 7.2.3 Rescaling $\chi^2$ distributions

[SECTION 7.2.3 WILL NOT BE COVERED IN THE LECTURE AND IS NOT EXAMINABLE]

### 7.3 Comparing fit models

The comparison of fit models discussed here is based on comparing the relative information loss of different models.

#### 7.3.1 Definitions

##### The Akaike Information Criterion

$$AIC \equiv -2\ln {\mathcal L}(x|\hat{a},M) + 2k,$$

##### The Bayesian Information Criterion

$$BIC \equiv -2\ln {\mathcal L}(x|\hat{a},M) + k\ln n,$$

#### 7.3.2 Interpretation

In the case of a least-squares minimisation the two criteria can be written as

$$AIC = \chi^2 + 2k,$$

and

$$BIC = \chi^2 + k\ln n.$$

The probability for a model $i$ minimising the information loss is proportional to

$$e^{(AIC_{\rm min} - AIC_i)/2}.$$

### 7.4 Two-sample problem

We often encounter situations where, rather than comparing a dataset to a fit curve, we need to compare two datasets. The questions asked can vary, from the comparison of a single parameter describing an aspect of their shape, e.g. mean or width, to a general comparison of whether two distributions agree with being drawn from a single parent distribution. 

#### 7.4.1 Comparing samples with known $\sigma$

$$x_1-x_2=0?$$ 

The variance of the difference is

$$V_{12} = \sigma_1^2 + \sigma_2^2.$$

Compare the difference, $x_1-x_2$, to the combined uncertainty $\sigma_{12} = \sqrt{V_{12}}$.

### 7.5 Kolmogorov-Smirnov test and its application to the two-sample problem

#### 7.5.1 The Kolmogorov-Smirnov test

The KS test is based on normalised cumulative distributions and evaluating their greatest difference.

$$D=max|{\rm cum}(x)-{\rm cum}(P)|.$$

This needs to be normalised for the sample size.

$$d = D \sqrt{N}.$$

The value of $d$ then needs to be compared to a table of critical values, $c$, to determine the level, $\alpha$, beyond which the statement that both distributions are compatible is rejected, i.e. you require $d<c(\alpha)$. (The tabulated values for $c(\alpha)$ do not need to be learned by heart)

#### 7.5.2 The Kolmogorov-Smirnov test with two samples

For a two-sample test the formula becomes

$$D={\rm max}|{\rm cum}(x)-{\rm cum}(y)|,$$

with the normalisation

$$d=\sqrt{\frac{N_xN_y}{N_x+N_y}}D.$$

<div class="well" align="center">
    <div class="container-fluid">
        <div class="row">
            <div class="col-md-3" align="center">
                <img align="center" alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" width="60%">
            </div>
            <div class="col-md-8">
            This work is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>).
            </div>
        </div>
    </div>
    <br>
    <br>
    <i>Note: The content of this Jupyter Notebook is provided for educational purposes only.</i>
</div>