<p align="center">
    <h1 align="center">Machine Learning and Statistics Tasks 2020</h1>
    <h1 align="center"> Task 2: Chi-Squared (χ2) Test of Independence </h1>
    <h2 align="center"> Author: Ezekiel Onaloye</h2>
    <h2 align="center"> Created: December 2020 </h2> 
</p>

![Chi-square test fromula](img/Chi-square.jpg)

### Task 2
To test if two categorical variables have an association or significance dependency on one another. 
If the significance of the test is below 0.05, the variables have a significant dependency.we can use scipy to conduct a test of independence quickly. Use stats.chi2_contingency() function to conduct a test of independence automatically given a frequency table of observed counts.

### Introduction

Chi-square test is a non-parametric method used to compare the relationship between the two categorical variables in a contingency table[1].The chi-square test of independence tests for dependence between categorical variables. 

Independence is a crucial concept in probability that describes a situation where knowing the value of one variable tells you something about another's value.  For instance, our birth month doesn't indicate which webbrowser we use, indicating our birth month and browser preference to beindependent. On the other hand, our month of birth might correlate to themonth and age we gain admission into primary school, so the month of birth and schooladmission might not be independent[3]. 

The chi-squared test of independence tests whether twocategorical variables are independent. The independence test determines whethervariables like education, political views, and other preferences vary based ondemographic factors like gender, race, and religion [3]. 



### Hypothesis

<em>Null hypotheses</em>: The two categorical variables are independent (no association between the two variables) - ( H_0:Variables are independent).

Alternative hypotheses: The two categorical variables are dependent (there is an association between the two variables) H_A:Variables are dependent.

### Assumptions
The variables are categorical (nominal), randomly sampled, and are independent.

The levels of variables are mutually exclusive

The expected frequency count for at least 80% of the cell in a contingency table is at least 5

No expected cell count is = 0

Observation data should be frequency counts and not percentages or transformed data

An example from wikipedia used for task 2 below: 
Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification [2]. The data copied and saved as csv and read into jupyter notebook using panda below. 

The  test of independence analysis utilizes a cross tabulation table between the variables of interest  rows and  columns. Based on the cell counts, it is possible to test if there is a relationship, dependence, between the variables and to estimate the strength of the relationship[5].

In [2]:
# Efficient numerical arrays.
import numpy as np
# Data frames.
import pandas as pd
# Mains statistics package.
import scipy.stats as ss
from scipy.stats import chi2_contingency

# Better sized plots.
plt.rcParams['figure.figsize'] = (12, 8)

# Nicer colours and styles for plots.
plt.style.use("fivethirtyeight")

In [15]:
ezedf = pd.read_csv("task2.csv")
ezedf

Unnamed: 0,A,B,C,D,total
White collar,90,60,104,95,349
Blue collar,30,50,51,20,151
No collar,30,40,45,35,150
Total,150,150,200,150,650


In [17]:
myField1 = ezedf['A']
myField2 = ezedf['B']
myField3 = ezedf['C']
myField4 = ezedf['D']

In [19]:
chiVal, pVal, df, exp = chi2_contingency(ezedf)

In [20]:
chiVal, pVal, df, exp

(24.571202858582602,
 0.016990737760739776,
 12,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154,
         349.        ],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385,
         151.        ],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462,
         150.        ],
        [150.        , 150.        , 200.        , 150.        ,
         650.        ]]))

In [23]:
print("The chi-square value is", chiVal)
print("P value is", pVal)
print("Degree of freedom is value is", df)
print("Expected values:", exp)

Chi value is 24.571202858582602
P value is 0.016990737760739776
Degree of freedom is value is 12
Expected values: [[ 80.53846154  80.53846154 107.38461538  80.53846154 349.        ]
 [ 34.84615385  34.84615385  46.46153846  34.84615385 151.        ]
 [ 34.61538462  34.61538462  46.15384615  34.61538462 150.        ]
 [150.         150.         200.         150.         650.        ]]


The output above shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected counts.
As expected, given the high p-value, the test result does not detect a significant relationship between the variables.

24.571202858582602 is the chi-square value
0.016990737760739776 is the p value which tells me that there's almost a 0.02 chance of 
having such a chi-square value or even more extreme if the assumption 
about the population is true, the assumption being that there is no association between the two
variables. So usually if the pVal is below 0.05 then that's considered low
and we would state in this case that there is a significant association between those variables.
The degrees of freedom is usually an indication of the size of the table and it's simply
the number of rows minus one and the number of columns minus one.
The expected values are the values, the counts that we would expect
if there would be no association.
The chi-square test checks if these values(expected values) differ a lot from the 
values that we actually have.
There should not be values in expected counts that is less than one.

4 row and 5 column
3 row and 4 column
The degrees of freedom for a test of independence equals 
the product of the number of categories in each variable minus 1. 
In this case we have a 4x5 table so df = 3x4 = 12.

### Interpretation

Above resulted in chi-square value of 24.571202858582602. The chance of such a value or higher than in the sample, if there is no association in the population is 0.016990737760739776 which is the p-value or siginificance level. 

It is considered 'significant'usually if this value is below 0.05 which in this it is. This indicates that there is an association between the two variables (one has an impact on the other). 

The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between.

The 24.571202858582602 is the chi-square value and 0.016990737760739776 is the p value which tells that there is almost a 0.02 chance of having such a chi-square value or even more extreme if the assumption about the population is true, the assumption being that there is no association the variables. Once it is below 0.05 we will say there is a significant association between the variables 

https://reneshbedre.github.io/blog/chisq.html

As p-value is below 0.05, so we reject H0 and accept H1. So there is association between  and .

Interpretation
The p value obtained from chi-square test for independence is significant (p < 0.05), and therefore, we conclude that there is a significant association between treatments (treated and nontreated) with treatment outcome (cured and noncured).

The p value obtained from the chi-square Goodness of Fit test is non-significant (p > 0.05 and fail to reject the null hypothesis), and therefore, we conclude that the observed genotypes counts after crosses is similar to that of expected counts as per the Mendelian ratio.

### References

In [None]:
1. https://reneshbedre.github.io/blog/chisq.html
2. https://en.wikipedia.org/wiki/Chi-squared_test
3. http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html
4. https://www.youtube.com/watch?v=hTsxJqw2zMM
5. https://www.pythonfordatascience.org/chi-square-test-of-independence-python/6. 