<a href="https://colab.research.google.com/github/John-G-Thomas/DS-Unit-1-Sprint-2-Statistics/blob/master/module2/LS_DS_122_Chi2_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 122
## Hypothesis Testing - Chi2 Tests

## Prepare - examine other available hypothesis tests

If you had to pick a single hypothesis test in your toolbox, t-test would probably be the best choice - but the good news is you don't have to pick just one! Here's some of the others to be aware of:

In [1]:
import numpy as np
from scipy.stats import chisquare  # One-way chi square test

# Chi square can take any crosstab/table and test the independence of rows/cols
# The null hypothesis is that the rows/cols are independent -> low chi square
# The alternative is that there is a dependence -> high chi square
# Be aware! Chi square does *not* tell you direction/causation

ind_obs = np.array([[1, 1], [2, 2]]).T
print(ind_obs)
print(chisquare(ind_obs, axis=None))

dep_obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T
print(dep_obs)
print(chisquare(dep_obs, axis=None))

[[1 2]
 [1 2]]
Power_divergenceResult(statistic=0.6666666666666666, pvalue=0.8810148425137847)
[[16 32]
 [18 24]
 [16 16]
 [14 28]
 [12 20]
 [12 24]]
Power_divergenceResult(statistic=23.31034482758621, pvalue=0.015975692534127565)


In [2]:
# Distribution tests:
# We often assume that something is normal, but it can be important to *check*

# For example, later on with predictive modeling, a typical assumption is that
# residuals (prediction errors) are normal - checking is a good diagnostic

from scipy.stats import normaltest
# Poisson models arrival times and is related to the binomial (coinflip)
sample = np.random.poisson(5, 1000)
print(normaltest(sample))  # Pretty clearly not normal

NormaltestResult(statistic=55.13512400708235, pvalue=1.0655159185086688e-12)


In [3]:
# Kruskal-Wallis H-test - compare the median rank between 2+ groups
# Can be applied to ranking decisions/outcomes/recommendations
# The underlying math comes from chi-square distribution, and is best for n>5
from scipy.stats import kruskal

x1 = [1, 3, 5, 7, 9]
y1 = [2, 4, 6, 8, 10]
print(kruskal(x1, y1))  # x1 is a little better, but not "significantly" so

x2 = [1, 1, 1]
y2 = [2, 2, 2]
z = [2, 2]  # Hey, a third group, and of different size!
print(kruskal(x2, y2, z))  # x clearly dominates

KruskalResult(statistic=0.2727272727272734, pvalue=0.6015081344405895)
KruskalResult(statistic=7.0, pvalue=0.0301973834223185)


And there's many more! `scipy.stats` is fairly comprehensive, though there are even more available if you delve into the extended world of statistics packages. As tests get increasingly obscure and specialized, the importance of knowing them by heart becomes small - but being able to look them up and figure them out when they *are* relevant is still important.

## T-test Assumptions

<https://statistics.laerd.com/statistical-guides/independent-t-test-statistical-guide.php>

- Independence of means

Are the means of our voting data independent (do not affect the outcome of one another)?
  
The best way to increase thel likelihood of our means being independent is to randomly sample (which we did not do).


In [0]:
from scipy.stats import ttest_ind

?ttest_ind

- "Homogeneity" of Variance? 

Is the magnitude of the variance between the two roughly the same?

I think we're OK on this one for the voting data, although it probably could be better, one party was larger than the other.

If we suspect this to be a problem then we can use Welch's T-test

In [0]:
?ttest_ind

- "Dependent Variable" (sample means) are Distributed Normally

<https://stats.stackexchange.com/questions/9573/t-test-for-non-normal-when-n50>

Lots of statistical tests depend on normal distributions. We can test for normality using Scipy as was shown above.

This assumption is often assumed even if the assumption is a weak one. If you strongly suspect that things are not normally distributed, you can transform your data to get it looking more normal and then run your test. This problem typically goes away for large sample sizes (yay Central Limit Theorem) and is often why you don't hear it brought up. People declare the assumption to be satisfied either way. 



## Degrees of Freedom 

<https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics>

![Degrees of Freedom](https://blog.minitab.com/hubfs/Imported_Blog_Media/hats.png)

In [6]:
# Degress of Freedom is n-1 because we've already estimated the sample mean 
# one of our sample size, "used up" in keeping the mean as a fixed value
sample_mean = 20
n = 5

sample1 = [40, 10, 25, 30, last_num]

105 + last_num / 5 == 20

last_num = 5

NameError: ignored

## T Statistic -> P-value

[U of Iowa T-statistic Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

![T-statistic table](https://www.biologyforlife.com/uploads/2/2/3/9/22392738/ttable.png)

# Chi^2 Tests

##  $\chi^2$ Test for goodness of fit

(One sample chi^2 test - this will **not** be on the sprint challenge)


| Roll:     |  1  |  2  |  3  |  4  |  5  |  6  |
|-----------|-----|-----|-----|-----|-----|-----|
| Observed: |  27 | 13  |  10 | 15  | 30  |  32 |
| Expected: |  21.16 | 21.16  | 21.16  |  21.16 | 21.16  | 21.16  |

Being able to do chi^2 tests with only only 1 categorical variable is **NOT** an objective of this sprint. I'm merely starting simple to introduce the concept. You will need to know the version of the chi^2 test that compares two categorical variables (test for independence).


Chi^2 tests measure the degree to which observed frequencies match expected frequencies across many categories. 

An expected frequency is:

\begin{align}
\frac{\text{total observations}}{\text{# categories}}
\end{align}

In [0]:
import numpy as np

In [8]:
observed = np.array([27,13,10,15,30,32])
observed

array([27, 13, 10, 15, 30, 32])

In [9]:
n = observed.sum()
n

127

In [10]:
expected_frequency = 127/6
expected_frequency

21.166666666666668

Null Hypthesis: This is a fair die . Expected Frequencies  == Observed Frequencies

Alternative Hypothesis: This is not a fair die. Expected Frequencie != observed frequencies

In [11]:
expected = np.array([21.16, 21.16, 21.16, 21.16, 21.16, 21.16])
expected

array([21.16, 21.16, 21.16, 21.16, 21.16, 21.16])

### Calculate the chi^2 statistic (test statistic)

\begin{align}
\chi^2 = \sum \frac{(observed_i-expected_i)^2}{(expected_i)}
\end{align}

In [12]:
# For cell 3 (index posittion 2)

observed[2] - expected[2]

-11.16

In [13]:
observed - expected

array([  5.84,  -8.16, -11.16,  -6.16,   8.84,  10.84])

Squared term makes all the values positive and emphasizes places where we saw a large deviation between observed and expected frequencies.

In [14]:
(observed - expected)**2

array([ 34.1056,  66.5856, 124.5456,  37.9456,  78.1456, 117.5056])

In [0]:
# smaller Values

In [15]:
(observed - expected)**2 / expected

array([1.61179584, 3.14676749, 5.88589792, 1.79327032, 3.69308129,
       5.55319471])

In [0]:
# Sumerizes the difference bewteen all values

In [16]:
chi2 = ((observed - expected)**2 / expected).sum()
chi2

21.684007561436673

Based on a chi^2 statistic of 21.68 and a p-value of .0006, I reject the null hypothesis, that observed frequencies are equal to expected frequencies (that that this is a fair die) and suggest the alternative, that this is an unfair die.

## $\chi^2$ Test for independence

(two sample chi^2 test)

<https://en.wikipedia.org/wiki/Chi-squared_test>

We'll use this dataset of student performance from UCI as it has a lot of good variables to use: 

<https://archive.ics.uci.edu/ml/datasets/Student+Performance>

In [17]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"

--2020-06-09 17:53:24--  https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20478 (20K) [application/x-httpd-php]
Saving to: ‘student.zip.1’


2020-06-09 17:53:24 (154 KB/s) - ‘student.zip.1’ saved [20478/20478]



In [21]:
!unzip student.zip1

unzip:  cannot find or open student.zip1, student.zip1.zip or student.zip1.ZIP.


In [0]:
from scipy import stats


In [20]:
import pandas as pd
df = pd.read_csv('student-mat.csv', sep=';')

print(df.shape)
df.head()

(395, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


In [22]:
df['failures'].value_counts()

0    312
1     50
2     17
3     16
Name: failures, dtype: int64

In [23]:
df['studytime'].value_counts()

2    198
1    105
3     65
4     27
Name: studytime, dtype: int64

In [24]:
test_contingency = pd.crosstab(df['failures'], df['studytime'])
test_contingency

studytime,1,2,3,4
failures,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,74,158,54,26
1,16,26,7,1
2,6,7,4,0
3,9,7,0,0


In [0]:
# We don't have enough evidence that these are related. They "might" be independent

In [0]:
# p value has to be below .05 for them to be related

In [25]:
chi2, p_value, dof, expected = stats.chi2_contingency(test_contingency)

print("chi2 statistic", chi2)
print("p value", p_value)
print("degrees of freedom",dof)
print("expected frequencies table", expected)

chi2 statistic 16.21199080868576
p value 0.06258448399974005
degrees of freedom 9
expected frequencies table [[ 82.93670886 156.39493671  51.34177215  21.32658228]
 [ 13.29113924  25.06329114   8.2278481    3.41772152]
 [  4.51898734   8.52151899   2.79746835   1.16202532]
 [  4.25316456   8.02025316   2.63291139   1.09367089]]


In [26]:
df['sex'].value_counts()

F    208
M    187
Name: sex, dtype: int64

In [27]:
df['internet'].value_counts()

yes    329
no      66
Name: internet, dtype: int64

In [28]:
contingency = pd.crosstab(df['sex'], df['internet'])
contingency

internet,no,yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
F,38,170
M,28,159


In [29]:
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("chi2 statistic", chi2)
print("p value", p_value)
print("degrees of freedom",dof)
print("expected frequencies table", expected)

chi2 statistic 0.5500617337279294
p value 0.45829247086513125
degrees of freedom 1
expected frequencies table [[ 34.75443038 173.24556962]
 [ 31.24556962 155.75443038]]


In [0]:
# Females should have slighlty more just from eye value is cofrontation of Null Hypothesis

Null Hypothesis: These two varibles are 'independent' there's no relationship bewteen the two 

Alternative Hypothesis: The two varibles are "dependent" there is a relationship or association bewteen them. 

## Run a $\chi^{2}$ Test "by hand" (Using Numpy)

In [30]:
df['sex'].value_counts()

F    208
M    187
Name: sex, dtype: int64

In [31]:
df['internet'].value_counts()

yes    329
no      66
Name: internet, dtype: int64

In [32]:
observed = pd.crosstab(df['sex'], df['internet'])
observed

internet,no,yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
F,38,170
M,28,159


## Expected Value Calculation
\begin{align}
expected_{i,j} =\frac{(row_{i} \text{total})(column_{j} \text{total}) }{(\text{total observations})}  
\end{align}

In [33]:
observed_with_margins = pd.crosstab(df['sex'], df['internet'], margins = True)
observed_with_margins

internet,no,yes,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,38,170,208
M,28,159,187
All,66,329,395


In [0]:
# Make an expected value table that matches the size abd sgaoe of our observed value table

In [34]:
observed.values

array([[ 38, 170],
       [ 28, 159]])

In [46]:
observed = observed.values
observed

array([[ 38, 170],
       [ 28, 159]])

In [35]:
# (row_total)(column_total) / sample_size

expected_rowl_col1 = 208*66/395

expected_rowl_col1

34.754430379746836

In [36]:
row_totals  = df['internet'].value_counts().values

row_totals

array([329,  66])

In [38]:
col_totals  = df['sex'].value_counts().values

col_totals

array([208, 187])

In [39]:
sample_size = df.shape[0]

sample_size

395

In [44]:
expected = np.array([[0,0],
                     [0,0]])

for i, row in enumerate(row_totals): 
  for j, col in enumerate(col_totals):
    expected_value = (row*col / sample_size)
    expected[j][i] = expected_value

expected

array([[173,  34],
       [155,  31]])

In [54]:
## Swap Column positions
# my_array[:,[0, 1]] = my_arrau[:,[1, 0]]
expected = expected[:, [0,1]] = expected[:, [1,0]]

expected

array([[ 34, 173],
       [ 31, 155]])

In [47]:
observed

array([[ 38, 170],
       [ 28, 159]])

### Calculate the chi^2 statistic (test statistic)

\begin{align}
\chi^2 = \sum \frac{(observed_i-expected_i)^2}{(expected_i)}
\end{align}

In [55]:
((observed - expected)**2 / expected)

array([[0.47058824, 0.05202312],
       [0.29032258, 0.10322581]])

In [56]:
((observed - expected)**2 / expected).sum().sum()

0.9161597437781752

Degrees of Freedom is different in the 2-variable chi^2 test (test for independence)

1-sample (goodness of fit), DOF = # categories-1

2-sample (test for independence), DOF = (# rows_crosstab-1)*(# cols_crosstab-1)


DOF: 

Use the chi^2 statistic to get to a p-value:

[U Iowa chi^2 applet](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)

In [0]:
# Conclusion

Conclusion: Based on  a chi2 statistic of .926, and a p-value of .3385, we fail to reject the null hypothesis that sex and internet access at home are independent

## Run a $\chi^{2}$ Test using Scipy

1) Null Hypothesis:

2) Alternative Hypothesis:

3) Confidence Level: 

Conclusion: