In [49]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cm import get_cmap
from itertools import cycle
from scipy.stats import chi2

In [4]:
cmap = get_cmap('cool')
colors = cycle(cmap(np.random.permutation(np.linspace(0, 1, 10))))

The next data comes from <b>Michael D. Smith, “Sociodemo-graphic risk factors in wife abuse: Results from a survey of Toronto women,”</b> <br/>
More info here: http://uregina.ca/~gingrich/ch10.pdf

In [8]:
data = pd.DataFrame.from_dict({
    'age': ['20-24', '25-34', '35-44'],
    'count-in-sample': [103, 216, 171],
    'percent-in-census': [18, 50, 32]
})
data

Unnamed: 0,age,count-in-sample,percent-in-census
0,20-24,103,18
1,25-34,216,50
2,35-44,171,32


We have a sample of ages of 490 woman that lives in Toronto (age between 20 & 44). <br/> But this data is categorical: we divide ages in 3 groups (20-24, 25-34, 35-44) <br/>
We also know the percentage of women in the census of Toronto on each group (only for those whose age is between 20 & 44)

Our task is to test if our sample is a good representation of woman's age distribution in Toronto <br/>
We will make a chi-square goodness of fit test comparing the obvserved values (our sample), and the expected values (the census) <br/>

In [18]:
obvserved = data['count-in-sample']
expected = data['percent-in-census'] / 100 * obvserved.sum()

pd.DataFrame.from_dict(dict(age=data['age'], expected=expected, obvserved=obvserved))

Unnamed: 0,age,expected,obvserved
0,20-24,88.2,103
1,25-34,245.0,216
2,35-44,156.8,171


Now we calculate the chi-square statistic using the formula: <br/>
$$\chi^2 = \sum_{i=1}^n \frac{(E_i - O_i)^2}{E_i}$$ <br/>
where $n$ is the number of categories (3)

In [28]:
statistic = ((expected - obvserved) ** 2 / expected).sum()
statistic

7.202069160997729

The null hypothesis states that the expected values are the same as the obversed ones: <br/>
$H_0 : E_i = O_i$ <br/>
$H_1 : E_i \neq O_i$

To decide if accept or reject the null hypothesis, we need calculate first the confident interval by setting the level of significance ($\alpha$) <br/>
The confident internal will be $[0, l]$ where $P(\chi_k^2 >= l) = 1 - P(\chi_k^2 <= l)  = \alpha$ <br/> where k is the number of freedom degrees (number of categories minus 1 in this case ) <br/>

In [38]:
alpha = 0.05

k = len(obvserved) - 1
l = chi2.ppf(1 - alpha, k)
print('Confident interval: {}%'.format((1 - alpha) * 100))
print('Critical value is: {}'.format(l))

Confident interval: 95.0%
Critical value is: 5.991464547107979


We will reject the null hypothesis if our calculated statistic $\chi^2$ is greater than the critical value $l$

In [40]:
if statistic <= l:
    print('Null hypothesis accepted. Our sample is statistically representative ' +
          'of the woman\'s age population in Toronto')
else:
    print('Null hypothesis rejected. Our sample doesnt represent statistically ' +
          'woman\'s age population in Toronto')

Null hypothesis rejected. Our sample doesnt represent statistically woman's age population in Toronto


We could also compare $\alpha$ and the p-value <br/>
If p-value >= $\alpha$ we accept the null hypothesis <br/>

In [53]:
p = 1 - chi2.cdf(statistic, k)
if p >= alpha:
    print('Null hypothesis accepted...')
else:
    print('Null hypothesis rejected...')

Null hypothesis rejected...
