<p align="center">
    <h1 align="center">Machine Learning and Statistics Tasks 2020</h1>
    <h1 align="center"> Task 2: Chi-Squared (χ2) Test of Independence </h1>
    <h2 align="center"> Author: Ezekiel Onaloye</h2>
    <h2 align="center"> Created: December 2020 </h2> 
</p>

![Chi-square test fromula](img/Chi-square.jpg)

### Task 2
To test if two categorical variables have an association or significance dependency on one another. 

### Introduction

The Chi-square test is a non-parametric method used to compare or determine whether a relationship exists between categorical variables in a contingency table [1].The chi-square test of independence tests for dependence between categorical variables. 


Independence is a crucial concept in probability that describes a situation where knowing the value of one variable tells you something about another's value.  For instance, our birth month doesn't indicate which web browser we use, indicating our birth month and browser preference to be independent. On the other hand, the month of birth might correlate to the month and age pupils gain admission into primary school, so the month of birth and school admission might not be independent[3]. 

The chi-squared test of independence tests whether two categorical variables are independent. The independence test determines whether variables like education, political views, and other preferences vary based on demographic factors like gender, race, and religion [3]. 

If the significance of the test is below 0.05, the variables have a significant dependence. For this task, a scipy.stats  package was used to conduct a test of independence. I used stats.chi2_contingency() function to run a test of independence automatically given a frequency table of observed counts. 

### Hypothesis

<b>H<sub>0</sub></b> - <em>Null hypotheses</em>: The null hypothesis  is a default hypothesis which states that there is no effect or relationship between the variables.The two categorical variables are independent (no association between the two variables, they are independent).

<b>H<sub>a</sub></b> - Alternative hypotheses: The alternative hypothesis states the effect or relationship exists.The two categorical variables are dependent (there is an association between the two variables and are dependent).

### Assumptions
The variables are categorical (nominal), randomly sampled, and are independent.

The levels of variables are mutually exclusive.

The expected frequency count for at least 80% of the cell in a contingency table is at least 5.

No expected cell count is = 0.

Observation data should be frequency counts and not percentages or transformed data.

### Example of chi-squared test for categorical data for task 2:

Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification [2]. 

The data copied and saved as csv and read into jupyter notebook using panda below. 

In [1]:
# Efficient numerical arrays.
import numpy as np
# Data frames.
import pandas as pd
import matplotlib.pyplot as plt
# Mains statistics package.
import scipy.stats as ss
from scipy.stats import chi2_contingency

# Better sized plots.
plt.rcParams['figure.figsize'] = (12, 8)

# Nicer colours and styles for plots.
plt.style.use("fivethirtyeight")

In [2]:
# read in csv file 
# Adapted from https://www.youtube.com/watch?v=hTsxJqw2zMM
# http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html

ezedf = pd.read_csv("task2.csv")
ezedf

Unnamed: 0,A,B,C,D,total
White collar,90,60,104,95,349
Blue collar,30,50,51,20,151
No collar,30,40,45,35,150
Total,150,150,200,150,650


In [3]:
observed = pd.DataFrame(ezedf.iloc[0:3,0:4])
observed

Unnamed: 0,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35


In [4]:
expected =  np.outer(ezedf["total"][0:3], ezedf.loc["Total"][0:4]) / 650

expected = pd.DataFrame(expected, 
            index=["White collar","Blue collar","No collar"], 
            columns=["A","B","C","D"])
expected

Unnamed: 0,A,B,C,D
White collar,80.538462,80.538462,107.384615,80.538462
Blue collar,34.846154,34.846154,46.461538,34.846154
No collar,34.615385,34.615385,46.153846,34.615385


### Chi-Squared Test of Independence using Scipy 

In [5]:
# Adapted from https://www.youtube.com/watch?v=hTsxJqw2zMM
# http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html
# Calculate chisquare value, p value, degree of freedom and expected values 

chiVal, pVal, df, exp = chi2_contingency(observed)

expected =  np.outer(ezedf["total"][0:3], ezedf.loc["Total"][0:4]) / 650

expected = pd.DataFrame(expected, 
            index=["White collar","Blue collar","No collar"], 
            columns=["A","B","C","D"])

print('Chi-squared is:',chiVal, '\n\nP-value is:',pVal, '\n\nDegrees of freedom is:',df, '\n\nExpected Values are:\n',expected)

Chi-squared is: 24.5712028585826 

P-value is: 0.0004098425861096696 

Degrees of freedom is: 6 

Expected Values are:
                       A          B           C          D
White collar  80.538462  80.538462  107.384615  80.538462
Blue collar   34.846154  34.846154   46.461538  34.846154
No collar     34.615385  34.615385   46.153846  34.615385


### Interpretation

The output above shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected values.

Given the low p-value or siginificance level, the result show a significant relationship between the variables. If the pVal is below 0.05 which is considered low, we would state in this case that there is a significant association between the variables.

<b> The p value obtained from chi-square test for independence is significant (p < 0.05). </b>

The degrees of freedom is usually an indication of the size of the table and it's simply
the number of rows minus one and the number of columns minus one (degrees of freedom formula: (rows - 1) (cols - 1)*).

The expected values are the values, the counts that we would expect if there would be no association.
The chi-square test checks if these values(expected values) differ a lot from the values that we actually have. There should not be values in expected values that should be less than one.

### Evaluate Null / Alternative Hypothesis

In [6]:
# Evaluate whether to reject or fail to reject null hypothesis
if pVal <= 0.05:
    print('Dependent, reject null hypothesis (H0)')
else:
    print('Independent, fail to reject null hypothesis (H0)')

Dependent, reject null hypothesis (H0)


### Conclusion

As the p-value (0.00041) is below significance level 0.05, so we reject <b>H<sub>0</sub> - null hypothesis</b>. So there is evident association of dependency which support the alternative hypothesis that the effect or relationship exists.

<b> The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between neighborhood of residence and the person's occupational classification.</b>

### References

[1] <em>Chi-square test in Python</em> -  https://reneshbedre.github.io/blog/chisq.html

[2] <em> Chi-squared test </em>  -    https://en.wikipedia.org/wiki/Chi-squared_test

[3] <em> Python for Data Analysis Part 25: Chi-Squared Tests </em> - http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html

[4] <em> Python - Pearson chi square test of independence </em> - https://www.youtube.com/watch?v=hTsxJqw2zMM

[5] <em> Chi-Square Test of Independence </em> - https://www.pythonfordatascience.org/chi-square-test-of-independence-python/ 