## Chi Square Test of Independence 

Problem: A Professor Complains that students' teacher ratings depend on the grade a students receive.
To test this claim, the Student Assembly took a random sample of 300 teacher ratings on which the students'
grades for the course also were indicated. The results are given in the following table. Test the hypothesis
that teacher ratings and student grades are independent at the 1% significance level.

|Rating  | A | B | C | F |
|---------|----|----|----|----| 
|Excellent | 14 | 18 | 15 | 3 |
|Average | 25 | 35 | 75 | 15 |
|Poor | 21 | 27 | 40 | 12 |

## Hypothesis
H_0: Teacher rating and student grades are independent.\
H_1: Teacher ratings are dependent on student grades.

In [168]:
import numpy as np
import pandas as pd
import sympy as sym
from IPython.display import Math,display
from scipy import stats

In [169]:
#first re-create the table from your table:
#enter data 
a1 = [14,18,15,3]
a2 = [25,35,75,15]
a3 = [21,27,40,12]

In [170]:
data = np.array([a1,a2,a3])#define the array
df = pd.DataFrame(data,['Excellent','Average','Poor'],['A','B','C','F']) # format the dataframe

In [171]:
print(df) # display the dataframe 

            A   B   C   F
Excellent  14  18  15   3
Average    25  35  75  15
Poor       21  27  40  12


In [172]:
#now define variables
n = 300 # from problem
r_e_t = sum(df.loc['Excellent'])#row total for excellent
r_a_t = sum(df.loc['Average'])#row total for average
r_p_t = sum(df.loc['Poor'])#row total for poor
c_A_t = sum(df['A'])#column total for A
c_B_t = sum(df['B'])#column total for B
c_C_t = sum(df['C'])#column total for C
c_F_t = sum(df['F'])#column total for F
num_R = len(df['A'])#number of rows
num_C = len(df.loc['Poor'])#number of columns

In [173]:
#define observed values as 'o'
o = [(a1[0]),(a1[1]),(a1[2]),(a1[3]),(a2[0]),(a2[1]),(a2[2]),(a2[3]),(a3[0]),(a3[1]),(a3[2]),(a3[3])]

The formula for the the expected value E is:
$$\ E = \frac{R \times C}{n}$$

In [174]:
#define expected values for each observed value as 'e'
e = [(r_e_t*c_A_t/n),(r_e_t*c_B_t/n),(r_e_t*c_C_t/n),(r_e_t*c_F_t/n),(r_a_t*c_A_t/n),(r_a_t*c_B_t/n),(r_a_t*c_C_t/n),(r_a_t*c_F_t/n),(r_p_t*c_A_t/n),(r_p_t*c_B_t/n),(r_p_t*c_C_t/n),(r_p_t*c_F_t/n)]

In [175]:
# create the table to calculate chi sq stats 
table = np.array([o,e])
df2 = pd.DataFrame(table,['observed','expected'],['1','2','3','4','5','6','7','8','9','10','11','12'])
df2

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12
observed,14.0,18.0,15.0,3.0,25.0,35.0,75.0,15.0,21.0,27.0,40.0,12.0
expected,10.0,13.333333,21.666667,5.0,30.0,40.0,65.0,15.0,20.0,26.666667,43.333333,10.0


### The formula for the test statistic for the independence of two variables is:


$$\chi^2 = \sum \frac{(o-e)^2}{e}$$

First we need to calculate the residual: $$\frac{(o-e)^2}{e}$$
for each column

In [176]:
#calculate the residual and add the values to the table then display the table again
df2.loc['residual'] = ((df2.loc['observed']-df2.loc['expected'])**2)/df2.loc['expected']
df2
# df2 = df2.drop("pick a row", axis=0) -- use if you need to drop a row added by accident

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12
observed,14.0,18.0,15.0,3.0,25.0,35.0,75.0,15.0,21.0,27.0,40.0,12.0
expected,10.0,13.333333,21.666667,5.0,30.0,40.0,65.0,15.0,20.0,26.666667,43.333333,10.0
residual,1.6,1.633333,2.051282,0.8,0.833333,0.625,1.538462,0.0,0.05,0.004167,0.25641,0.4


Now we calculate take the sum of the row chi_sqd to get 
$$\chi^2$$


In [177]:
css = sum(df2.loc['residual'])
display(Math('\\chi^2 = %s'%(css)))

<IPython.core.display.Math object>

### The formula for degrees of freedom of a Chi Square test is:


$$ df = (R-1)(C-1)\text { where } R = \text{ number of rows and } C = \text{ number of columns.}$$ 


In [178]:
df = (num_R - 1) * (num_C - 1)
display(Math('df = %g'%(df)))

<IPython.core.display.Math object>

In [186]:
#now enter alpha from your problem (signifigance level)
alpha = input('Enter the signifigance level from problem as a decimal:')
display(Math('\\alpha = %s'%(alpha)))

Enter the signifigance level from problem as a decimal:.01


<IPython.core.display.Math object>

In [187]:
#now calculate the p-value of a chi square test for the calculated statistic and degree of freedom

In [188]:
p_value = stats.chi2.sf(css,df)
display(Math('\\text {p-value} = %s'%(p_value)))

<IPython.core.display.Math object>

In [191]:
if float(p_value) > float(alpha):
    print('We cannot reject the null hypothesis. There is insufficient evidence at the 1% signifigance level\nto support the professors claim that teacher ratings are dependent on grades.')
else:
    print('We must reject the null hypothesis. There is sufficient evidence at the 1% signifigance level to support the teachers claim.')

We cannot reject the null hypothesis. There is insufficient evidence at the 1% signifigance level
to support the professors claim that teacher ratings are dependent on grades.


In [163]:
# if you are thinking there must be a better way, you are correct. Here is the shortcut: 
# the first number is the chi squared stat, the second number is the p-value and the third number is the df
# the array is a matric of expected values that corresponds to the array of observed values. 

In [164]:
answer = stats.chi2_contingency(data)

In [165]:
display(list(answer))

[9.791987179487181,
 0.13368965273402839,
 6,
 array([[10.        , 13.33333333, 21.66666667,  5.        ],
        [30.        , 40.        , 65.        , 15.        ],
        [20.        , 26.66666667, 43.33333333, 10.        ]])]

In [166]:
display(Math('\\chi^2 = {}'.format(list(answer)[0])))
display(Math('\\text{p} = %s'%(list(answer)[1])))
display(Math('\\ df = %s'%(list(answer)[2])))
#display(Math('\\text {The Expected Values} = %s'%((list(answer)[3]))))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>