<a href="https://colab.research.google.com/github/d-melamed/DS-Unit-1-Sprint-2-Statistics/blob/master/Ryan_DS19_122_Chi2_Tests_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science Module 122
## Hypothesis Testing - Chi2 Tests

## Prepare - examine other available hypothesis tests

If you had to pick a single hypothesis test in your toolbox, t-test would probably be the best choice - but the good news is you don't have to pick just one! Here's some of the others to be aware of:

In [None]:
import numpy as np
from scipy.stats import chisquare  # One-way chi square test

# Chi square can take any crosstab/table and test the independence of rows/cols
# The null hypothesis is that the rows/cols are independent -> low chi square
# The alternative is that there is a dependence -> high chi square
# Be aware! Chi square does *not* tell you direction/causation

ind_obs = np.array([[1, 1], [2, 2]]).T
print(ind_obs)
print(chisquare(ind_obs, axis=None))

dep_obs = np.array([[16, 18, 16, 14, 12, 12], [32, 24, 16, 28, 20, 24]]).T
print(dep_obs)
print(chisquare(dep_obs, axis=None))

[[1 2]
 [1 2]]
Power_divergenceResult(statistic=0.6666666666666666, pvalue=0.8810148425137847)
[[16 32]
 [18 24]
 [16 16]
 [14 28]
 [12 20]
 [12 24]]
Power_divergenceResult(statistic=23.31034482758621, pvalue=0.015975692534127565)


In [None]:
# Distribution tests:
# We often assume that something is normal, but it can be important to *check*

# For example, later on with predictive modeling, a typical assumption is that
# residuals (prediction errors) are normal - checking is a good diagnostic

from scipy.stats import normaltest
# Poisson models arrival times and is related to the binomial (coinflip)
sample = np.random.poisson(5, 1000)
print(normaltest(sample))  # Pretty clearly not normal

NormaltestResult(statistic=38.69323106073592, pvalue=3.961609200867749e-09)


In [None]:
# Kruskal-Wallis H-test - compare the median rank between 2+ groups
# Can be applied to ranking decisions/outcomes/recommendations
# The underlying math comes from chi-square distribution, and is best for n>5
from scipy.stats import kruskal

x1 = [1, 3, 5, 7, 9]
y1 = [2, 4, 6, 8, 10]
print(kruskal(x1, y1))  # x1 is a little better, but not "significantly" so

x2 = [1, 1, 1]
y2 = [2, 2, 2]
z = [2, 2]  # Hey, a third group, and of different size!
print(kruskal(x2, y2, z))  # x clearly dominates

KruskalResult(statistic=0.2727272727272734, pvalue=0.6015081344405895)
KruskalResult(statistic=7.0, pvalue=0.0301973834223185)


And there's many more! `scipy.stats` is fairly comprehensive, though there are even more available if you delve into the extended world of statistics packages. As tests get increasingly obscure and specialized, the importance of knowing them by heart becomes small - but being able to look them up and figure them out when they *are* relevant is still important.

## T-test Assumptions

<https://statistics.laerd.com/statistical-guides/independent-t-test-statistical-guide.php>

- Independence of means

Are the means of our voting data independent (do not affect the outcome of one another)?
  
The best way to increase thel likelihood of our means being independent is to randomly sample (which we did not do).


In [None]:
from scipy.stats import ttest_ind

?ttest_ind

- "Homogeneity" of Variance? 

Is the magnitude of the variance between the two roughly the same?

I think we're OK on this one for the voting data, although it probably could be better, one party was larger than the other.

If we suspect this to be a problem then we can use Welch's T-test

In [None]:
?ttest_ind

- "Dependent Variable" (sample means) are Distributed Normally

<https://stats.stackexchange.com/questions/9573/t-test-for-non-normal-when-n50>

Lots of statistical tests depend on normal distributions. We can test for normality using Scipy as was shown above.

This assumption is often assumed even if the assumption is a weak one. If you strongly suspect that things are not normally distributed, you can transform your data to get it looking more normal and then run your test. This problem typically goes away for large sample sizes (yay Central Limit Theorem) and is often why you don't hear it brought up. People declare the assumption to be satisfied either way. 



## T Statistic -> P-value

[U of Iowa T-statistic Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

![T-statistic table](https://www.biologyforlife.com/uploads/2/2/3/9/22392738/ttable.png)

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

In [None]:
column_headers = ['symboling', 'normalized-losses', 'make', 'fuel-type', 
                  'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 
                  'engine-location', 'wheel-base', 'length', 'width', 'height', 
                  'curb-weight', 'engine-type', 'num-of-cylinders', 
                  'engine-size', 'fuel-system', 'bore', 'stroke', 
                  'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 
                  'highway-mpg', 'price']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', 
                 names=column_headers, 
                 na_values='?')

print(df.shape)
df.head()

(205, 26)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [None]:
df.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [None]:
# numeric, categorical variable
df['symboling'].value_counts()

 0    67
 1    54
 2    32
 3    27
-1    22
-2     3
Name: symboling, dtype: int64

In [None]:
# numeric, continuous variable
df['price'].value_counts()

8495.0     2
18150.0    2
7295.0     2
6229.0     2
8845.0     2
          ..
15580.0    1
6377.0     1
30760.0    1
16925.0    1
18920.0    1
Name: price, Length: 186, dtype: int64

In [None]:
sample = df.sample(20, random_state=30)

1) Null Hypothesis:

Highway Miles Per Gallon is equal to 30

2) Alternative Hypothesis:

Highway Miles Per Gallon is not equal to 30

3) Confidence Level: 95%

In [None]:
stats.ttest_1samp(df['highway-mpg'], 30)

Ttest_1sampResult(statistic=1.561884175957824, pvalue=0.11986523177827152)

4) Conclusion:

Based on a t-statistic of 1.56, and a p-value of .12, we (reject/**fail to reject**) the null hypothesis that highway miles per gallon is equal to 30. 

![t-statistic equation](https://www.ahajournals.org/cms/asset/850f8023-e028-4694-a946-bbdbdaa9009b/15mm6.jpg)

## Degrees of Freedom 

<https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics>

![Degrees of Freedom](https://blog.minitab.com/hubfs/Imported_Blog_Media/hats.png)

In [None]:
# Constraints
sample_size = 5
sample_mean = 10

sample = [3,6,12,19,x]

In [None]:
3+6+12+19+x = 50

In [None]:
50 -(3+6+12+19)

10

# Chi^2 Tests

WE ONLY USE CATEGORICAL VARIABLES FOR CHI^2 TESTS


##  $\chi^2$ Test for goodness of fit

(One sample chi^2 test - this will **not** be on the sprint challenge)


| Roll:     |  1  |  2  |  3  |  4  |  5  |  6  |
|-----------|-----|-----|-----|-----|-----|-----|
| Observed: |  27 | 13  |  10 | 15  | 30  |  32 |
| Expected: |  21.16 | 21.16  | 21.16  |  21.16 | 21.16  | 21.16  |

Being able to do chi^2 tests with only only 1 categorical variable is **NOT** an objective of this sprint. I'm merely starting simple to introduce the concept. You will need to know the version of the chi^2 test that compares two categorical variables (test for independence).


Chi^2 tests measure the degree to which observed frequencies match expected frequencies across many categories. 

An expected frequency is:

\begin{align}
\frac{\text{total observations}}{\text{# categories}}
\end{align}

In [None]:
(27+13+10+15+30+32) / 6

21.166666666666668

In [None]:
observed = np.array([27,13,10,15,30,32])

In [None]:
# expected is the average of the observed frequencies
expected_frequency = expected.sum()/len(expected)

expected_frequency

21.166666666666668

My expected Frequencies represent my Null Hypothesis.

That's how I would have expected the die rolls to turn out in a pefect world.

Null Hypothesis: 

This is a fair die.

How close are the observed frequencies (sample), to the expected frequencies?

### Why a NumPy array?

They do something cool called "Array Broadcasting"

In [None]:
# Python Lists
a = [1,2,3,4]
b = [6,3,4,7]

In [None]:
# concatenation of Python List
a+b

[1, 2, 3, 4, 6, 3, 4, 7]

In [None]:
a-b

TypeError: ignored

In [None]:
# Numpy Arrays

np_a = np.array(a)
np_b = np.array(b)

print(np_a)
print(np_b)

[1 2 3 4]
[6 3 4 7]


In [None]:
# added up "element-wise" based on the position of the item in the list
np_a + np_b

array([ 7,  5,  7, 11])

In [None]:
np_a - np_b

array([-5, -1, -1, -3])

In [None]:
np_a * np_b

array([ 6,  6, 12, 28])

In [None]:
np_a / np_b

array([0.16666667, 0.66666667, 0.75      , 0.57142857])

In [None]:
df = pd.DataFrame({'a': [1,2,3,4], 'b': [6,3,4,7]})

df.head()

Unnamed: 0,a,b
0,1,6
1,2,3
2,3,4
3,4,7


In [None]:
df['a']

0    1
1    2
2    3
3    4
Name: a, dtype: int64

In [None]:
df['a'].values + df['b'].values

array([ 7,  5,  7, 11])

In [None]:
# Pandas functionality is due to NumPy "Array Broadcasting"
df['c'] = df['a'] + df['b']

In [None]:
df.head()

Unnamed: 0,a,b,c
0,1,6,7
1,2,3,5
2,3,4,7
3,4,7,11


### Calculate the chi^2 statistic (test statistic)

\begin{align}
\chi^2 = \sum \frac{(observed_i-expected_i)^2}{(expected_i)}
\end{align}

In [None]:
observed

array([27, 13, 10, 15, 30, 32])

In [None]:
expected = np.array([expected_frequency,expected_frequency,expected_frequency,expected_frequency,expected_frequency,expected_frequency])

expected

array([21.16666667, 21.16666667, 21.16666667, 21.16666667, 21.16666667,
       21.16666667])

Squared term makes all the values positive and emphasizes places where we saw a large deviation between observed and expected frequencies.

In [None]:
(observed[0] - expected[0])**2 / expected[0]

1.6076115485564297

In [None]:
(observed[1] - expected[1])**2 / expected[1]

3.150918635170604

In [None]:
(observed[2] - expected[2])**2 / expected[2]

5.891076115485565

In [None]:
# Numpy Array Broadcasting-enabled equation
chi2 = ((observed - expected)**2 / expected).sum()

chi2

21.67716535433071

In [None]:
stats.chi2()

Degrees of Freedom: 

Chi^2 test: DOF  = # of categories - 1

T-test: DOF = Sample Size - 1 

Conclusion: 

Based on a chi^2 statistic of 21.677, and a p-value of 0.00076, we reject the null hypothesis that this is a fair die. 

## $\chi^2$ Test for independence

(two sample chi^2 test)

<https://en.wikipedia.org/wiki/Chi-squared_test>

We'll use this dataset of student performance from UCI as it has a lot of good variables to use: 

<https://archive.ics.uci.edu/ml/datasets/Student+Performance>

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip

--2020-08-11 17:49:44--  https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20478 (20K) [application/x-httpd-php]
Saving to: ‘student.zip’


2020-08-11 17:49:44 (688 KB/s) - ‘student.zip’ saved [20478/20478]



In [None]:
!unzip student.zip

Archive:  student.zip
  inflating: student-mat.csv         
  inflating: student-por.csv         
  inflating: student-merge.R         
  inflating: student.txt             


In [None]:
%cd student

[Errno 2] No such file or directory: 'student'
/content


In [None]:
!curl student-mat.csv

curl: (6) Could not resolve host: student-mat.csv


In [None]:
df = pd.read_csv('student-mat.csv', sep=";")

In [None]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


1) Null Hypothesis:

The two categorical variables are independent

2) Alternative Hypothesis:

The two categorical varaibles are dependent


In [None]:
contingency = pd.crosstab(df['failures'], df['studytime'])

contingency

studytime,1,2,3,4
failures,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,74,158,54,26
1,16,26,7,1
2,6,7,4,0
3,9,7,0,0


In [None]:
# include the row and column totals "margins"
observed_with_margins = pd.crosstab(df['failures'], df['studytime'], margins=True)

observed_with_margins

studytime,1,2,3,4,All
failures,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,74,158,54,26,312
1,16,26,7,1,50
2,6,7,4,0,17
3,9,7,0,0,16
All,105,198,65,27,395


In [None]:
# unpacking
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("chi2:", chi2)
print("p value:", p_value)
print("dof:", dof)
print("expected_frequencies: \n", expected)

chi2: 16.21199080868576
p value: 0.06258448399974005
dof: 9
expected_frequencies: 
 [[ 82.93670886 156.39493671  51.34177215  21.32658228]
 [ 13.29113924  25.06329114   8.2278481    3.41772152]
 [  4.51898734   8.52151899   2.79746835   1.16202532]
 [  4.25316456   8.02025316   2.63291139   1.09367089]]


Conclusion: 

Based on a chi2 statistic of 16.21 and a p-value of .06 we (reject/**fail to reject**) the null hypothesis that study-time and failures are independent. 

## Run a $\chi^{2}$ Test "by hand" (Using Numpy)

In [None]:
312 * 105 / 395

82.9367088607595

In [None]:
observed_with_margins

studytime,1,2,3,4,All
failures,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,74,158,54,26,312
1,16,26,7,1,50
2,6,7,4,0,17
3,9,7,0,0,16
All,105,198,65,27,395


## Expected Value Calculation
\begin{align}
expected_{i,j} =\frac{(row_{i} \text{total})(column_{j} \text{total}) }{(\text{total observations})}  
\end{align}

In [None]:
# .values to turn our pandas output into a numpy array
row_totals = df['failures'].value_counts().values

row_totals

array([312,  50,  17,  16])

In [None]:
column_totals = df['studytime'].value_counts().sort_index().values

column_totals

array([105, 198,  65,  27])

In [None]:
sample_size = df.shape[0]

sample_size

395

In [None]:
# placeholder values that we're going to fill with expected frequencies
expected = np.array([[0.0,0.0,0.0,0.0],
                     [0.0,0.0,0.0,0.0],
                     [0.0,0.0,0.0,0.0],
                     [0.0,0.0,0.0,0.0]])

for row_index, row in enumerate(row_totals):
  for column_index, col in enumerate(column_totals):
    # calculate the expected value
    expected_value = (row*col)/sample_size
    # put that value into the expected value table
    # at the correct location.
    expected[row_index][column_index] = expected_value
    # print(expected_value, row_index, column_index)

expected

array([[ 82.93670886, 156.39493671,  51.34177215,  21.32658228],
       [ 13.29113924,  25.06329114,   8.2278481 ,   3.41772152],
       [  4.51898734,   8.52151899,   2.79746835,   1.16202532],
       [  4.25316456,   8.02025316,   2.63291139,   1.09367089]])

In [None]:
observed = contingency.values

### Calculate the chi^2 statistic (test statistic)

\begin{align}
\chi^2 = \sum \frac{(observed_i-expected_i)^2}{(expected_i)}
\end{align}

In [None]:
chi2 = ((observed - expected)**2 / expected).sum()

chi2

16.21199080868576

Degrees of Freedom is different in the 2-variable chi^2 test (test for independence)

1-sample (goodness of fit), DOF = # categories-1

2-sample (test for independence), DOF = (# rows_crosstab-1)*(# cols_crosstab-1)

or

DOF = (# categories of var1 - 1)*(# categories of var2 - 1)

DOF: 9

Use the chi^2 statistic to get to a p-value:

[U Iowa chi^2 applet](https://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html)

## Run a $\chi^{2}$ Test using Scipy

1) Null Hypothesis:

2) Alternative Hypothesis:

3) Confidence Level: 

Conclusion: