### Task 2: 
***
November 2nd, 2020 : The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example [4], stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.
     
      
|                | A       | B       | C       | D       | Total|
| :------------- | :-----: | ------: | ------: | ------: | ------: |
|  White collar  | 90      | 60      | 104     | 95      |349|
|  Blue collar   | 30      | 50      | 51      | 20      |151|
|  No  collar   | 30      | 40     | 45    | 35    |150|
|  Total   | 150   | 150     | 200   | 150   |650|


In [1]:
# Import libraries.
import numpy as np
# Data frames.
import pandas as pd
# Alternative statistics package.
import statsmodels.stats.weightstats as stat
import scipy.stats as ss   
from scipy.stats import chi2

In [2]:
# initialise data of lists. 
data = {'A':[90, 30, 30, ], 'B':[60, 50, 40, ], 'C':[104, 51, 45, ],'D':[95, 20, 35, ], } 
  
# Creates pandas DataFrame. 
df = pd.DataFrame(data, index =['White Collar', 'Blue Collar', 'No Collar']) 
  
# print the data 
df 

Unnamed: 0,A,B,C,D
White Collar,90,60,104,95
Blue Collar,30,50,51,20
No Collar,30,40,45,35


In [3]:
df.values
# Print out all the values in the dataframe above

array([[ 90,  60, 104,  95],
       [ 30,  50,  51,  20],
       [ 30,  40,  45,  35]], dtype=int64)

In [4]:
# Observed Values Code adapted [2.1]
Observed_Val = df.values 
print("Observed Values :-\n",Observed_Val)

Observed Values :-
 [[ 90  60 104  95]
 [ 30  50  51  20]
 [ 30  40  45  35]]


In [5]:
# scipy stats function to calculate chi square and p value [2.2]
values=ss.chi2_contingency(df) 

In [6]:
values

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

Above formula calculates the chi square value, p value, degrees of freedom and array showing the expected values.  I have manually calculated the same below to find out how they are arrived at, if they are accurate and what they mean in relation to the information within the data frame.

In [7]:
# Double check the calculation of Degrees of Freedom
# Calculate the number of rows in the table
norows=len(df.iloc[0:3,0])
nocolumns=len(df.iloc[0,0:4])
# Calculate the degree of freedom, (number of rows -1) multiply by (number of columns -1)
ddof=(norows-1)*(nocolumns-1)
print("Degrees of Freedom:-", ddof)
# Set the variance
alpha = 0.05

Degrees of Freedom:- 6


In [8]:
Expected_Val=values[3]

# Set chi-square value derived from ss.chi2_contingency to calculate critical and p value.
chi_square_statistic=24.5712028585826
print("chi-square statistic:-",chi_square_statistic)

critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)

chi-square statistic:- 24.5712028585826
critical_value: 12.591587243743977


In [9]:
# Calculate p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('p-value:',p_value)

p-value: 0.0004098425861096544
Significance level:  0.05
Degree of Freedom:  6
p-value: 0.0004098425861096544


In [10]:
# Which hypothesis does the results of the above (chi-square, p value etc.) prove.
if chi_square_statistic>=critical_value:
    print("Chi square value is greater than critical value, Reject H0, there is a relationship between location and collar status.")
else:
    print("Chi square value is less than critical value, Retain H0, there is no relationship between location and collar status.")
    
if p_value<=alpha:
    print("P Value less than alpha (5%), Reject H0, there is a relationship between location and collar status.")
else:
    print("P Value greater than alpha (5%), Retain H0, there is no relationship between location and collar status.")

Chi square value is greater than critical value, Reject H0, there is a relationship between location and collar status.
P Value less than alpha (5%), Reject H0, there is a relationship between location and collar status.


### Run a Hypothesis Test [2.3]
***

**1. Null & Alternative Hypothesis** <br> <br> 
If we take A, B, C & D as locations and their populations.<br>
Null: Collar and location are not related.<br>
Alternative: Collar and location are related.<br>
<br>
**2. Alpha Level 0.05 (5%)** <br> <br>
Alpha is the probability of rejecting a true null hypothesis.  If the alpha level was set to 10% this would increase the chance that the null hypothesis would be rejected.  An alpha level of 1% would be tiny and increase the chance of **NOT**  rejecting the null.  So it is a balancing act to avoid blindly rejecting or accepting the null and possibly incorrectly influencing the result. Therefore a 5% level is widely accepted as a good compromise.  I also look at it as an in inbuilt margin of error, if you are collecting data there may be inconsistencies due to various factors e.g. mishearing a reply or incorrectly recording the response.  This 5% margin of error is modelled into the calculations. [2.4]
<br>

**3. Calculate Degrees of Freedom** <br> <br>
The degrees of freedom (often abbreviated as df or d) tell you how many numbers in your grid are actually independent. For a Chi-square grid, the degrees of freedom can be said to be the number of cells you need to fill in before, given the totals in the margins, you can fill in the rest of the grid using a formula. [2.5] <br> <br>

df = (rows - 1) (columns - 1)<br>
df = (3-1)(4-1)<br>
df = (2)(3) = 6.  This analysis will use 6 degrees of freedom.<br>
<br>
**4. State Decision Rule**<br> <br>
Go to chi square table and using alpha 0.05 and 6 degrees of freedom = critical value of 12.5916<br>
So, our decision rule is if the calculated chi square is greater than 12.59 we will end up rejecting the null hypothesis. <br>
<br>
**5. Calculate chi-square**<br> <br>

$$\chi^2 = \sum \frac {(f_o - f_e)^2}{f_e}$$


Multiply the frequencies for the columns times the frequency for the rows and then divide by the total number of subjects to get
the expected frequency for each cell. e.g. How many white collar wearers are from Area A. We take the column total for A which is 150
and multiply it by the row total for white collar which is 349 and we divide by the total number of subjects which is 650. 
(150*349) / 650 = 80.53.  In this sample we would have expected 80.53 people white collars to be from location A.  Continue the calculation to get the expected values for all the cells.



In [11]:
# initialise data of lists. 
data = {'A (Expected)':[80.53, 34.85, 34.62 ], 'B(Expected)':[80.53, 34.84, 34.62 ], 'C(Expected)':[107.36, 46.46, 46.15],'D(Expected)':[80.54, 34.84, 34.61], } 
  
# Creates pandas DataFrame. 
df2 = pd.DataFrame(data, index =['Expected White Collar', 'Expected Blue Collar', 'Expected No Collar']) 
df2

Unnamed: 0,A (Expected),B(Expected),C(Expected),D(Expected)
Expected White Collar,80.53,80.53,107.36,80.54
Expected Blue Collar,34.85,34.84,46.46,34.84
Expected No Collar,34.62,34.62,46.15,34.61


In [12]:
# initialise data of lists. 
data = {'A (Expected)':[80.53, 34.85, 34.62 ], 'B(Expected)':[80.53, 34.84, 34.62 ], 'C(Expected)':[107.36, 46.46, 46.15],'D(Expected)':[80.54, 34.84, 34.61], } 
  
# Creates pandas DataFrame. 
df2 = pd.DataFrame(data, index =['Expected White Collar', 'Expected Blue Collar', 'Expected No Collar']) 
df2

Unnamed: 0,A (Expected),B(Expected),C(Expected),D(Expected)
Expected White Collar,80.53,80.53,107.36,80.54
Expected Blue Collar,34.85,34.84,46.46,34.84
Expected No Collar,34.62,34.62,46.15,34.61


In [13]:
# initialise data of lists. 
data = {'A':["90 (80.53)", "30 (34.85)", "30 (34.62)", ], 'B':["60 (80.53)", "50 (34.84)", "40 (34.62)", ], 'C':["104 (107.36)", "51 (46.46)", "45 (46.15)", ],'D':["95 (80.54)", "20 (34.85)", "35 (34.61)", ], } 
  
# Creates pandas DataFrame. 
df3 = pd.DataFrame(data, index =['White Collar', 'Blue Collar', 'No Collar']) 
  
# print the data 
df3 

# Below shows the observed values and next to them in brackets are the expected values.

Unnamed: 0,A,B,C,D
White Collar,90 (80.53),60 (80.53),104 (107.36),95 (80.54)
Blue Collar,30 (34.85),50 (34.84),51 (46.46),20 (34.85)
No Collar,30 (34.62),40 (34.62),45 (46.15),35 (34.61)


Again, calculate chi squared this time taking all the observed values, subtracting the expected values from it, squaring the result and finally dividing by the expected values to get 12 different fractions.

**5.**  $$\chi^2 = \sum \frac {(f_o - f_e)^2}{f_e}$$

$$\chi^2 = \frac {(90 - 80.53)^2}{80.53} + \frac {(60 - 80.53)^2}{80.53} + \frac {(104 - 107.36)^2}{107.36} + \frac {(95 - 80.54)^2}{80.54} + \frac {(30 - 34.85)^2}{34.85}+ \frac {(50 - 34.84)^2}{34.84} + \frac {(51 - 46.46)^2}{46.46} + \frac {(20 - 34.85)^2}{34.85} + \frac {(30 - 34.62)^2}{34.62} + \frac {(40 - 34.62)^2}{34.62} + \frac {(45 - 46.15)^2}{46.15} + \frac {(35 - 34.61)^2}{34.61} $$

Adding all the above together results in a chi squared of 24.66.  Allowing for rounding differences is quite close to 24.6.  The chi-square was greater than 12.5916 so we reject the null hypothesis that collar and location are not related and can say that there is a relationship between location and collar status.

#### Rough Work to double check the P value
***

In [14]:
from scipy import stats
# Set the chi-square value and degrees of freedom [2.6]
a=stats.chi2.pdf(24.66 , 6)
# Format the output to 20 decimal places
f'{a:.20f}'
stats.chi2.sf(24.571, 6)

0.00040987793499886133

In [15]:
# Calculate critical value. [2.7]
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 6)   # Df = number of variable categories - 1
print("Critical value")
print(crit)

# Set the degree of freedom and chi-square value to calculate p value.
p_value = 1 - stats.chi2.cdf(x=24.571,  # Find the p-value
                             df=6)
print("P value")
print(p_value)

Critical value
12.591587243743977
P value
0.00040987793499891456


In [16]:
# Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point [2.8]
1 - stats.chi2.cdf(24.571, 6)


0.00040987793499891456

***

### Conclusion:  
***

The formula to calculate both the Chi-squared and its associated p value can be easily arrived at by using cipy.stats.chi2_contingency.  I went a bit further and looked at how to manually calculate it so as to acquaint myself with the theory behind it and how it works, both to get a better understanding about it and to verify the results.  

My results are that the Chi-squared value is approximately 24.6 and its associated P value is 0.00040987793499891456.  If the chi-squared value was 0 then that would indicate that the actual data and expected data were identical with no difference.  This is not the case here as the value is 24.6.  The lower the number the greater chance of both actual and expected data being similar.  24.6 is not very high so would indicate that both actual and expected values were broadly similar.  

The P value is far less than alpha 0.00040987793499891456 vs 0.05 and is statistically significant.   It indicates strong evidence against the null hypothesis.  Therefore, the null hypothesis is rejected, the alternative hypothesis is accepted and conclude that location and collar status are not independent of each other and there is a relationship between them.  What it doesn’t do is provide insights on how the variables are dependent or what kind of relationship exists, just there is a relationship.

<br> <br>From taking a quick look at the actual vs expected values, it appears that there is a big difference in white collar in area B.  It is significantly lower than expected.  This trend is continued through to the blue collar in the same area but to the opposite effect.  There are a greater number of actual values as opposed to the expected.  

The no collar category also has the biggest difference in area B where actual is greater than expected.  Area C seems broadly in line where expected meets actual values.  It could be interesting that Area B with the biggest differences between expected and actual is beside area C with the nearest matching values.  

There is a noticeable difference in area D for blue collar with the actual value a lot smaller than the expected.  Conversely the opposite again appears for white collar in D where there the actual values are a lot higher than expected.  There seems to be an opposite effect on the values between blue collar and white collar.  Where there is a higher than expected value for one, it has the effect of negatively influencing the other resulting in lower than expected values.  Without further information it would be hard to identify a reason for this but it may be that area D is more expensive hence the greater white collar values.  If that was the case then the opposite could be said for area B with a lot less than expected white collar but higher than expected blue collar.  
***
***

#### References:
    
[2.1] krishnaik06, (2020), Hypothesis_Testing, https://github.com/krishnaik06/T-test-an-Correlation-using-python/blob/master/Hypothesis_Testing.ipynb, accessed October 2020.<br>
[2.2] docs.python.org, scipy.stats.chi2_contingency, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html, accessed October 2020. <br>
[2.3] statslectures, (2010), Chi-Square Test for Independence, https://www.youtube.com/watch?v=LE3AIyY_cn8&ab_channel=statslectures, accessed November 2020. <br>
[2.4] Stephanie, (2012), Statistics How To, Alpha Level (Significance Level): What is it?, https://www.statisticshowto.com/what-is-an-alpha-level/, accessed November 2020.  <br> 
[2.5] Ling 300, (2008), Tutorial: Pearson's Chi-square Test for Independence, https://www.ling.upenn.edu/~clight/chisquared.htm, accessed November 2020. <br> 
[2.6] docs.scipy.org, (2020), scipy.stats.chi2_contingency, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html, accessed November 2020.   <br> 
[2.7] Greg Hamel, (2018), Python for Data 25: Chi-Squared Tests, https://www.kaggle.com/hamelg/python-for-data-25-chi-squared-tests, accessed November 2020.    <br>
[2.8] learner, (2013), P-value from Chi sq test statistic in Python, https://stackoverflow.com/questions/11725115/p-value-from-chi-sq-test-statistic-in-python, accessed November 2020.
    

Other sites accessed:

 *   ResearchGate, (2015), https://www.researchgate.net/post/What_is_the_role_of_p-value_in_chi_square_test_of_difference, accessed November 2020.
    
 *   Stephanie Glen, (2020), Chi-Square Statistic: How to Calculate It / Distribution, https://www.statisticshowto.com/probability-and-statistics/chi-square/#:~:text=First%20state%20the%20null%20hypothesis,or%20%E2%80%9Csmall%20enough%E2%80%9D)., accessed November 2020.
    
 *   Mathsisfun, (2019), Chi-Square Test, https://www.mathsisfun.com/data/chi-square-test.html, accessed November 2020.
    
 *   Saul McLeod, (2019), What a p-value tells you about statistical significance, https://www.simplypsychology.org/p-value.html, accessed November 2020.
    
    
    