<a href="https://colab.research.google.com/github/allan-gon/DS-Unit-1-Sprint-2-Statistics/blob/master/module2/LS_DS_122_Chi2_Tests_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment - Practice Chi^2 Tests

Use the following dataset relating to math scores of students in two different Portugese schools:

<https://archive.ics.uci.edu/ml/datasets/Student+Performance>

### 1) Load the dataset specifically relating to math scores as a new dataframe.

There are two datasets in the `student.zip` file, make sure you use `student-mat.csv`.


In [0]:
import pandas as pd
import numpy as np
from scipy import stats

In [0]:
df = pd.read_csv("student-mat.csv", delimiter=';')
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


### 2) Use Chi^2 tests and `stats.chi2_contingency()` to identify:
 - Two pairs of variables that are dependent (are associated with one another).
 - Two pairs of variables that are independent (have no significant relationship).

Does it make intuitive sense why the variables in these pairs might or might not show a relationship to one another? 


# Checking if variables are dependent

**Null Hypothesis:** Studytime and freetime are independent of eachother

**Alternative Hypothesis:** Studytime and freetime are no independent of eachother

**Significance Level:** 95%

In [0]:
#make a crosstab and pass it into stats.chi2_contingency() and compare p to 1 - confidence
contin = pd.crosstab(df['studytime'], df['freetime'])
chi2, p_value, dof, expected = stats.chi2_contingency(contin)

print("chi2 statistic", chi2)
print("p value", p_value)
print("degrees of freedom",dof)
print("expected frequencies table", expected)

chi2 statistic 23.53045426913059
p value 0.023545617681757027
degrees of freedom 12
expected frequencies table [[ 5.05063291 17.01265823 41.73417722 30.56962025 10.63291139]
 [ 9.52405063 32.08101266 78.69873418 57.64556962 20.05063291]
 [ 3.12658228 10.53164557 25.83544304 18.92405063  6.58227848]
 [ 1.29873418  4.37468354 10.73164557  7.86075949  2.73417722]]


**Conclusion:** Reject null hypothesis. With a chi2 stat of 23.53 and a p-value of .0235 i can say that studttime and freetime have a relationship.

# Checking for independence

**Null Hypothesis:** age and sex are independent of eachother

**Alternative Hypothesis:** age and sex are not independent of eachother

**Significance Level:** 95%

In [0]:
contingency = pd.crosstab(df['sex'], df['age'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("chi2 statistic", chi2)
print("p value", p_value)
print("degrees of freedom",dof)
print("expected frequencies table", expected)

chi2 statistic 5.99460281380294
p value 0.5403796955381378
degrees of freedom 7
expected frequencies table [[43.17974684 54.76455696 51.60506329 43.17974684 12.63797468  1.57974684
   0.52658228  0.52658228]
 [38.82025316 49.23544304 46.39493671 38.82025316 11.36202532  1.42025316
   0.47341772  0.47341772]]


**Conclusion:** Fail to reject null hypothesis. With a chi2 stat = 5.995 and a p-value of .54 it is unlikely that age and sex are related

### 3) Use NumPy to perform your own chi^2 test "from scratch" 

Pick any of the chi^2 tests that you ran in part 2 and try them on your own. You should get the same results that Scipy got for all four values returned from `chi2_contingency()`

In [0]:
observed = pd.crosstab(df['studytime'], df['freetime'], margins=True)
observed

freetime,1,2,3,4,5,All
studytime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2,19,35,29,20,105
2,11,28,82,61,16,198
3,3,11,29,21,1,65
4,3,6,11,4,3,27
All,19,64,157,115,40,395


# Get Observed

In [0]:
observe = observed.values[:-1,:5]
observe

array([[ 2, 19, 35, 29, 20],
       [11, 28, 82, 61, 16],
       [ 3, 11, 29, 21,  1],
       [ 3,  6, 11,  4,  3]])

In [0]:
col_tot = observed.values[-1][:-1]
row_tot = observed.values[:-1,-1]
tot = 395

In [0]:
print(col_tot)
print(row_tot)

[ 19  64 157 115  40]
[105 198  65  27]


# Get Expected

In [0]:
expected = []
x = []
for row in row_tot:
  for col in col_tot:
    expected.append((row * col) / tot)
expected = np.array(expected).reshape(4,5)
expected

array([[ 5.05063291, 17.01265823, 41.73417722, 30.56962025, 10.63291139],
       [ 9.52405063, 32.08101266, 78.69873418, 57.64556962, 20.05063291],
       [ 3.12658228, 10.53164557, 25.83544304, 18.92405063,  6.58227848],
       [ 1.29873418,  4.37468354, 10.73164557,  7.86075949,  2.73417722]])

[[ 5.05063291 17.01265823 41.73417722 30.56962025 10.63291139]                  
 [ 9.52405063 32.08101266 78.69873418 57.64556962 20.05063291]                  
 [ 3.12658228 10.53164557 25.83544304 18.92405063  6.58227848]                  
 [ 1.29873418  4.37468354 10.73164557  7.86075949  2.73417722]]

# Get sample size

In [0]:
sample_size = df.shape[0]
sample_size

395

# DOF

In [0]:
x,y = observe.shape
dof = (x-1)*(y-1)

array([[  2,  19,  35,  29,  20],
       [ 11,  28,  82,  61,  16],
       [  3,  11,  29,  21,   1],
       [  3,   6,  11,   4,   3],
       [ 19,  64, 157, 115,  40]])

# Get Chi2 stat

In [0]:
((observe - expected)**2/expected).sum()

23.53045426913059

## Stretch goals:

### 1. Refactor your code so it is elegant, readable, and holds reusable code in functions.

In [0]:
# YOUR WORK HERE



### 2. Check For Understanding - Study and write your own explanations/definitions for these topics:

- What is a sample "estimate" in statistics?

- What are hypothesis test? How are they useful?

- What is a "null hypothesis?"

- What is a p-value and what does it represent?

- What does it mean for something to be "statistically significant?"

- What is a test statistic and how does it relate to a p-value?

- What are "degrees of freedom" and how are they calculated in a 1-sample, 2-sample, and chi^2 test for independence?

## Resources

- [Interactive visualize the Chi-Squared test](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)
- [Calculation of Chi-Squared test statistic](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)
- [Visualization of a confidence interval generated by R code](https://commons.wikimedia.org/wiki/File:Confidence-interval.svg)
- [Expected value of a squared standard normal](https://math.stackexchange.com/questions/264061/expected-value-calculation-for-squared-normal-distribution) (it's 1 - which is why the expected value of a Chi-Squared with $n$ degrees of freedom is $n$, as it's the sum of $n$ squared standard normals)