<p align="center">
    <h1 align="center">Machine Learning and Statistics Tasks 2020</h1>
    <h1 align="center"> Task 2: Chi-square (χ2) test for independence </h1>
    <h2 align="center"> Author: Ezekiel Onaloye</h2>
    <h2 align="center"> Created: December 2020 </h2> 
    <br>
</p>

Chi-square test formular jpeg

> Chi-square test is a non-parametric (distribution-free) method used to compare the relationship between the two categorical (nominal) variables in a contingency table[1].


### Purpose of Task
To test if two categorical variables have an association or significance dependency on one another. 
If the significance of the test is below 0.05, the variables have a significant dependency. 

### Assumptions
The two variables are categorical (nominal) and data is randomly sampled

The levels of variables are mutually exclusive

The expected frequency count for at least 80% of the cell in a contingency table is at least 5

The expected frequency count should not be less than 1

Observations should be independent of each other

Observation data should be frequency counts and not percentages or transformed data

An example from wikipedia used for task 2 below: 
Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification [2]. The data copied and saved as csv and read into jupyter notebook using panda below. 

In [2]:
# Efficient numerical arrays.
import numpy as np
# Data frames.
import pandas as pd
# Mains statistics package.
import scipy.stats as ss
from scipy.stats import chi2_contingency

# Better sized plots.
plt.rcParams['figure.figsize'] = (12, 8)

# Nicer colours and styles for plots.
plt.style.use("fivethirtyeight")

In [15]:
ezedf = pd.read_csv("task2.csv")
ezedf

Unnamed: 0,A,B,C,D,total
White collar,90,60,104,95,349
Blue collar,30,50,51,20,151
No collar,30,40,45,35,150
Total,150,150,200,150,650


In [17]:
myField1 = ezedf['A']
myField2 = ezedf['B']
myField3 = ezedf['C']
myField4 = ezedf['D']

In [19]:
chiVal, pVal, df, exp = chi2_contingency(ezedf)

In [20]:
chiVal, pVal, df, exp

(24.571202858582602,
 0.016990737760739776,
 12,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154,
         349.        ],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385,
         151.        ],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462,
         150.        ],
        [150.        , 150.        , 200.        , 150.        ,
         650.        ]]))

In [23]:
print("The chi-square value is", chiVal)
print("P value is", pVal)
print("Degree of freedom is value is", df)
print("Expected values:", exp)

Chi value is 24.571202858582602
P value is 0.016990737760739776
Degree of freedom is value is 12
Expected values: [[ 80.53846154  80.53846154 107.38461538  80.53846154 349.        ]
 [ 34.84615385  34.84615385  46.46153846  34.84615385 151.        ]
 [ 34.61538462  34.61538462  46.15384615  34.61538462 150.        ]
 [150.         150.         200.         150.         650.        ]]


### Interpretation

Above resulted in chi-square value of 24.571202858582602. The chance of such a value or higher than in the sample, if there is no association in the population is 0.016990737760739776 which is the p-value or siginificance level. 

It is considered 'significant'usually if this value is below 0.05 which in this it is. This indicates that there is an association between the two variables (one has an impact on the other). 

The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between.

The 24.571202858582602 is the chi-square value and 0.016990737760739776 is the p value which tells that there is almost a 0.02 chance of having such a chi-square value or even more extreme if the assumption about the population is true, the assumption being that there is no association the variables. Once it is below 0.05 we will say there is a significant association between the variables 

### References

In [None]:
1. https://reneshbedre.github.io/blog/chisq.html
2. https://en.wikipedia.org/w/index.php?title=Chi-squared_test&oldid=983024096