# Statistics

Cover Chi Square Test and validate that a multiple choice exam's answers are truly random.




## Important Concepts

Hypothesis Testing is used to evaluate two mutually exclusive statements.

$H_0$ -- Null hypothesis

$H_A$ -- Alternate hypothesis

$\alpha$ -- Significance Level



https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests/chi-square-goodness-of-fit-tests/v/chi-square-distribution-introduction


# T Test

In [14]:
X = [120,105,100,130,115,100,185,105,130,170]

In [15]:
len(X)

10

In [16]:
import numpy as np

In [19]:
np.mean(X), np.std(X,ddof=1)

(126.0, 29.514591494904874)

In [20]:
from scipy import stats

In [35]:
def t_test(sample, mu, side='two'):
    assert side in ('two','left','right')
    mean = np.mean(sample)
    var = np.var(sample, ddof = 1) ###
    sem = (var / len(sample)) ** .5 # standard error of the mean
    t = abs(mu - mean)/sem
    df = len(sample) - 1
    if side == 'two':
        p = 2*(1-stats.t.cdf(t, df)) ###
    elif side == 'left':
        p = stats.t.cdf(t, df) ###
    else:
        p = 1 - stats.t.cdf(t, df) ###
    return (t, p)

In [34]:
t_test(X, 150, side='left')

(2.571428571428571, 0.9849414340620386)

## Background

1. Truly Random Observations that are normally distributed
2. Expected value for each at least equal to five
3. Independence -- need to sample with replacement or that our samples are equal to at least 10% of the population

The mean of the distribution is equal to the degrees of freedom ($k$)

### k=1 (d.f. = one )

$X=N(0,1)$

$\chi^2 = X^2$

### k=2 (d.f. = two)

$\chi^2 = X_1^2 + X_2^2$

### k=3 (d.f. = three)

$\chi^2 = X_1^2 + X_2^2 + X_3^2$


### Chi Square Properties



As $k$ gets bigger, the Chi-Square distribution looks more like the normal distribution



In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
location = '../data/'
files = os.listdir(location)
files

['CrossStats20150102.txt',
 'multiple_choice.csv',
 'state_codes.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'nst-est2019-popchg2010_2019.csv']

In [3]:
df = pd.read_csv(location + 'multiple_choice.csv')
df

Unnamed: 0,Question,Response
0,1,D
1,2,B
2,3,C
3,4,D
4,5,C
...,...,...
95,96,D
96,97,A
97,98,B
98,99,D


In [4]:
grp = df.groupby('Response').count()
grp

Unnamed: 0_level_0,Question
Response,Unnamed: 1_level_1
A,22
B,11
C,27
D,40


# Chi Square Test

### Let's set a Null Hypothesis

$H_0$ Equal probability of each value

Answer | Prob
--|--
A | .25
B | .25
C | .25
D | .25

$H_A$ is that there is a difference of frequency between answers



#### Significance Level

$\alpha = 0.05$

### Assumptions

$\Chi^2$


### Conditions



# Math


$$
\chi^2 = \frac{(22 - 25)}{25}^2 + \frac{(11 - 25)}{25}^2 + \frac{(27 - 25)}{25}^2 + \frac{(40-25)}{25}^2
= (9 + 14^2 + 4 + 15^2) / 25

$$


### Degrees of Freedom

number of categories = 
k = (4 - 1)



In [5]:
from scipy import stats

In [13]:
result = stats.chisquare(f_obs=[22,11,27,40])
result

Power_divergenceResult(statistic=17.36, pvalue=0.0005959141426149805)

In [7]:
result

Power_divergenceResult(statistic=3.6, pvalue=nan)

In [8]:
result = stats.chisquare(f_obs=[25,5,25,45],f_exp=[25,25,25,25], ddof=99+3)
result

Power_divergenceResult(statistic=32.0, pvalue=nan)

In [9]:
stats.chisquare(f_obs=[25,5,25,45])

Power_divergenceResult(statistic=32.0, pvalue=5.233466447749424e-07)

In [10]:
stats.chisquare(f_obs=[22,11,27,40])

Power_divergenceResult(statistic=17.36, pvalue=0.0005959141426149805)

In [11]:
stats.chisquare(f_obs=[25,23,25,27])

Power_divergenceResult(statistic=0.32, pvalue=0.9562241644890547)