# Statistics

This week consists of a compressed review of undergraduate statistics.  This notebook applies a Chi Square Test to determine if the correct responses to a multiple choice exam are truly random.

## Load the Data
1. Load the data
2. Inspect and Clean
3. Report the correct response distribution by letter

## Solve Mathematically
1. Apply the Chi Square test manually

## Use Scipy stats to Solve
1. Lookup scipy documentation for [Chi Square](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) and solve the problem


# Load & Inspect

In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
location = '../../data/'
files = os.listdir(location)
files

['CrossStats20150102.txt',
 'multiple_choice.csv',
 'iris_names.txt',
 'state_codes.csv',
 'iris.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'Candidate Assessment.xlsx',
 'nst-est2019-popchg2010_2019.csv']

In [3]:
df = pd.read_csv(location + 'multiple_choice.csv')
df

Unnamed: 0,Question,Response
0,1,D
1,2,B
2,3,C
3,4,D
4,5,C
...,...,...
95,96,D
96,97,A
97,98,B
98,99,D


In [4]:
grp = df.groupby('Response').count()
grp

Unnamed: 0_level_0,Question
Response,Unnamed: 1_level_1
A,22
B,11
C,27
D,40


# Solve Mathematically

### Let's set a Null Hypothesis

$H_0$ Equal probability of each value

Answer | Prob
--|--
A | .25
B | .25
C | .25
D | .25

$H_A$ is that there is a difference of frequency between answers



#### Significance Level

$\alpha = 0.05$

### Assumptions

$\chi^2$


### Conditions



### Calculate Chi Square Test Value


$$
\chi^2 = \frac{(22 - 25)}{25}^2 + \frac{(11 - 25)}{25}^2 + \frac{(27 - 25)}{25}^2 + \frac{(40-25)}{25}^2
= \frac{3^2 + 14^2 + 2^2 + 15^2}{25}
$$


### Degrees of Freedom

number of categories = 
k = (4 - 1)



In [5]:
from scipy import stats

In [6]:
rv = stats.chi2(df=3)

In [7]:
chi_square = (9 + 14 ** 2 + 4 + 15 **2) / 25
chi_square

17.36

### Apply Right Tail Test

We are looking to see if values are too far to the right.

In [8]:
1 - rv.cdf(chi_square)

0.0005959141426149506

### Conclusion

We reject the Null Hypothesis that the data has the same distribution as the reference values.

# Test with Scipy Test

Lookup documentation: [Chi Square](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html).

In [9]:
result = stats.chisquare(f_obs=[22,11,27,40])
result

Power_divergenceResult(statistic=17.36, pvalue=0.0005959141426149805)