# Pearson $\chi^2$ test
**Assumptions**:  
1. $\Delta y \gg \Delta x$
2. $y$ is a Gaussian distribution
3. all data points are independent

Given model: $y = y(x, m)$, based on above assumptions:
\begin{equation}
\begin{split}
P_i &= \frac{1}{\sqrt{2\pi\,\sigma_i^2}}\exp(-\frac{(y(x_i, m)-y_i)^2}{2\sigma_i^2})\\
P &= \prod_i P_i\\
&=\prod_i \frac{1}{\sqrt{2\pi\,\sigma_i^2}} \exp\Bigl(-\frac{1}{2} \sum_i \frac{(y(x_i, m)-y_i)^2}{\sigma_i^2}\Bigr)
\end{split}
\end{equation}
**Define $\chi^2$**
\begin{equation}
\sum_i \frac{(y(x_i, m)-y_i)^2}{\sigma_i^2}
\end{equation}

## if $y(x, m)=a + bx$ is a linear model:
$S_x = \sum_i\frac{x_i}{\sigma_i^2}$
\begin{align}
\Delta &= S\,S_{xx} - S_x^2\\
a &= \frac{S_{xx}S_y - S_xS_{xy}}{\Delta}\\
b &= \frac{S\,S_{xy} - S_xS_y}{\Delta}\\
\end{align}

### Another form of expression
\begin{align}
t_i &=\frac{1}{\sigma_i}\Bigl[x_i - \frac{S_x}{S}\Bigr]\\
S_{tt} & = \sum_i t_i^2\\
b & = \frac{1}{S_{tt}} \sum_i \frac{t_i\, y_i}{\sigma_i}\\
a & = \frac{S_y -S_x\,b}{S}\\
\end{align}

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib osx
#plt.style.use('ggplot')
import seaborn as sns

In [6]:
x = np.linspace(1, 4, 10)
sigma = 0.05


def linearFit(x, y, sigma=1.0):
    if type(sigma) is not list:
        sigma = np.array([sigma] * len(x))
    Sx = np.sum(x/sigma**2)
    Sy = np.sum(y/sigma**2)
    S = np.sum(1/sigma**2)
    t = 1/sigma * (x - Sx/S)
    Stt = np.sum(t**2)
    b = 1.0/(Stt) * np.sum(t * y / sigma)
    a = (Sy - Sx * b) / S

    Sxx= np.sum(x**2/sigma**2)
    Delta = S*Sxx - Sx**2
    da = np.sqrt(Sxx/Delta)
    db = np.sqrt(S/Delta)
    return a, b, da, db

aList = []
bList = []
daList = []
dbList = []
chisqList = []
for i in xrange(10000):
    y = 2*x + 1 + np.random.normal(0.0, scale = sigma, size = len(x))
    a, b, da, db = linearFit(x, y, sigma)
    chisq = np.sum((y - a - b*x)**2/sigma**2)
    chisqList.append(chisq)
    aList.append(a)
    bList.append(b)
    daList.append(da)
    dbList.append(db)
plt.close('all')   

da = daList[0]
db = dbList[0]
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
na, bina = np.histogram(aList, bins=30)
na = na/float(na.max())
bina = (bina[:-1] + bina[1:])/2
binsizea = bina[1] - bina[0]
ax1.plot(bina + 0.5 * binsizea, na, linestyle='steps')
ax1.plot(bina, np.exp(-(bina - 1.0)**2/(2*da**2)))
ax1.set_xlabel('$a$')
ax2 = fig.add_subplot(122)
nb, binb = np.histogram(bList, bins=30)
nb = nb/float(nb.max())
binb = (binb[:-1] + binb[1:])/2
binsizeb = binb[1] - binb[0]
ax2.plot(binb + 0.5 * binsizeb, nb, linestyle='steps')
ax2.plot(binb, np.exp(-(binb - 2.0)**2/(2*db**2)))
ax2.set_xlabel('$b$')
for ax in fig.axes:
    ax.set_ylabel('P density')


fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(aList, bList, xerr=da, yerr=db, fmt='.', ms=3, lw=0.1)

<Container object of 3 artists>

## Error Estimation
\begin{equation}
\frac{\partial{a}}{\partial{y_i}} = \frac{1}{\Delta} [S_{xx}\frac{1}{\sigma_i^2}-S_x\frac{x_i}{\sigma_i^2}]
\end{equation}
\begin{align}
\sigma_a^2 &= \frac{S_{xx}}{\Delta}\\
\sigma_b^2 &= \frac{S}{\Delta}
\end{align}

## Pearson's theorem
Suppose there are $K$ independent varibles $Z_k$ with 0 mean Gaussian distribution
$$
x_K^2 = Q = \sum_i Z_i^2
$$
$$
P(x_K^2) = \frac{1}{2^{K/2}\Gamma(K/2)} \, \chi^{K/2 - 1} \, \exp(-\chi/2)
$$
$P$ is called $\chi^2$ distribution

In [10]:
from scipy.special import gamma
nchisq, binchisq = np.histogram(chisqList, bins=30)
binsizechisq = binchisq[1] - binchisq[0]
plt.close('all')
plt.bar(binchisq[:-1], nchisq/float(max(nchisq)), binsizechisq)

def pearson(chisq, K):   
    return 1./(2**(K/2.) * gamma(K/2.)) * chisq**(K/2. - 1) * np.exp(-chisq/2)

probchisq = pearson(binchisq, len(x) - 2)
plt.plot(binchisq, probchisq/float(max(probchisq)))

[<matplotlib.lines.Line2D at 0x1168c9190>]

In [2]:
def linearFit(x, y, sigma=1.0):
    if type(sigma) is not list:
        sigma = np.array([sigma] * len(x))
    Sx = np.sum(x/sigma**2)
    Sy = np.sum(y/sigma**2)
    S = np.sum(1/sigma**2)
    t = 1/sigma * (x - Sx/S)
    Stt = np.sum(t**2)
    b = 1.0/(Stt) * np.sum(t * y / sigma)
    a = (Sy - Sx * b) / S

    return a, b
x = np.linspace(1, 4, 100)
sigma = 0.05
# 1 realization
y0 = 1 + 2*x + 0.2*x*x + np.random.normal(0, sigma, len(x))
a0, b0 = linearFit(x, y0, sigma)
plt.close('all')
fig1 = plt.figure()

ax1 = fig1.add_subplot(211)
ax1.plot(x, y0, '.')
ax1.plot(x, a0+x*b0)
ax2 = fig1.add_subplot(212)
residual = (y0 - (a0 + x*b0)) / sigma
ax2.plot(x, residual, '+', mew=1)

nRealization = 10000
aList = np.zeros(nRealization)
bList = np.zeros(nRealization)
chisqList = np.zeros(nRealization)
for i in xrange(nRealization):
    y = 1 + 2*x + 0.2*x*x + np.random.normal(0, sigma, len(x))
    a, b = linearFit(x, y, sigma)
    aList[i] = a
    bList[i] = b
    chisqList[i] = np.sum(((y - (a+b*x))**2/sigma**2)/(len(x) - 2))

fig2 = plt.figure(figsize=(12, 6))
ax21 = fig2.add_subplot(121)
ax21.hist(chisqList, bins=20)
ax22 = fig2.add_subplot(122)
ax22.plot(aList, bList, '.')

[<matplotlib.lines.Line2D at 0x1147e5790>]

In order to use $\chi^2$ to calcualte error, the assumed model distribution has to be the true distribution.
* contour of posterior likelihood in fact is a contour of constant $\chi^2$

In [38]:
N = 100
x = np.linspace(1, 4, 10)
sigma = 0.05
y = 1 + 2*x + np.random.normal(0, sigma, len(x))
plt.close('all')
aa, bb = np.meshgrid(np.linspace(0.9, 1.1, 1000), np.linspace(1.8, 2.2, 1000))
chisq0 = np.sum((y - (1 + 2*x))**2/sigma**2)/(len(x) - 2)
chisqMat = np.zeros((1000, 1000))
for i in xrange(1000):
    for j in xrange(1000):
        chisqMat[i][j] = np.sum((y - (aa[i][j] + bb[i][j]*x))**2/sigma**2) -\
        (len(x) - 2)
        
chisqMat = chisqMat - chisq0
plt.contourf(aa, bb, chisqMat, [2.3, 4.61, 6.17], cmap='viridis')

<matplotlib.contour.QuadContourSet at 0x1202490d0>

* both $x$ and $y$ have uncertainties, for linear models, minimize $\chi^2$ can determine $a$, $b$. However, it is not a linear problem for b.

* For non-linear problems: 
a generalized $\chi^2$
\begin{equation}
\chi^2 = \sum_1^M \frac{(y_i - f(\vec{x}, \vec{m}))^2}{\sigma_i^2}
\end{equation}
* Estimate errors
\begin{equation}
\chi^2 = \frac{1}{\sigma^2}\sum(y_i - a - bx_i)^2 = N_\mathrm{DoF}
\end{equation}
But this can only used as an estimation, not exact calculation

* unknown distribution of errors, --- bootstraping    
Suppose there are $N$ data points, make many realizaitions make $N$ points out of the data.

1.1959595959595959

Help on function contour in module matplotlib.pyplot:

contour(*args, **kwargs)
    Plot contours.
    
    :func:`~matplotlib.pyplot.contour` and
    :func:`~matplotlib.pyplot.contourf` draw contour lines and
    filled contours, respectively.  Except as noted, function
    signatures and return values are the same for both versions.
    
    :func:`~matplotlib.pyplot.contourf` differs from the MATLAB
    version in that it does not draw the polygon edges.
    To draw edges, add line contours with
    calls to :func:`~matplotlib.pyplot.contour`.
    
    
    Call signatures::
    
      contour(Z)
    
    make a contour plot of an array *Z*. The level values are chosen
    automatically.
    
    ::
    
      contour(X,Y,Z)
    
    *X*, *Y* specify the (x, y) coordinates of the surface
    
    ::
    
      contour(Z,N)
      contour(X,Y,Z,N)
    
    contour up to *N* automatically-chosen levels.
    
    ::
    
      contour(Z,V)
      contour(X,Y,Z,V)
    
    draw contour li