**Chi-Square Test** - $χ^2$ Test
> The Chi-square test is a statistical method used to determine if there's a significant association between two categorical variables. here is no equivalent parametric test. There are several types of commonly used chi-square tests: <br>
> &emsp; → **Chi-Square Goodness-of-Fit Test** <br>
> &emsp; → **Chi-Square Test of Independence** <br>


**Chi-Square Goodness-of-Fit Test**
> It is a statistical test used to evaluate how well the frequency distribution (observed) of a categorical variable fits a theoretical expected frequency distribution in a sample. This test is used to determine whether the difference between observed and expected frequencies is statistically significant.
> 
> |  |  |  |
> |--|--|--|
> | $H_0$ | $o_i=e_i$ | → There is no difference between the observed frequencies (o) and the expected frequencies (e). |
> | $H_1$ | $o_i≠e_i$ | → There is a difference between the observed frequencies (o) and the expected frequencies (e). |
>
> **Frequency table** shows the names of each group and the number of occurrences as a result of classifying a variable. <br>
> **Expected value** is the number of observations that should fall in each group.
> 
> | $e_i\ =\ total\ observations\ /\ number\ of\ categories$ |
> |-|


**Chi-Square Test of Independence**
> It is a statistical test used to determine whether the relationship between two categorical variables is statistically significant. It assesses the difference between the observed and expected frequencies.
> 
> |  |  |
> |--|--|
> | $H_0$ | → The variables are independent. (no association) |
> | $H_1$ | → There is a relationship between the variables. (there is an association) |
>
> 
> **Contingency tables** are used to visualize and analyze the relationship between two categorical variables.
>
> 
> **2x2 Tables:** These are simple contingency tables used when both categorical variables have two levels. The test to be used is selected by examining the smallest value in the table of expected frequencies.
>
> |  | Variable 1 / Group 1 | Variable 1 / Group 2 |
> |--|--|--|
> | **Variable 2 / Group 1** | frequency 11 | frequency 12 |
> | **Variable 2 / Group 2** | frequency 21 | frequency 22 |
>
> | Smallest Expected Frequency | Test to Use |
> |--|--|
> | $SEF\ >\ 25$ | Pearson's Chi-Square Test |
> | $5\ <\ SEF\ <\ 25$ | Yates' Chi-Square Test (Yates' continuity correction) |
> | $SEF\ <\ 5$ | Fisher's Exact Test |
>
> 
> **RxC Tables:** These contingency tables are used when each of the two categorical variables has three or more levels. The test to be used is selected by checking whether the number of frequencies smaller than 5 is less than 20% of the cells in the table.
> 
> |  | Variable 1 / Group 1 | ... | Variable 1 / Group C |
> |:-:|--|--|--|
> | **Variable 2 / Group 1** | frequency 11 | frequency 1C |
> | **Variable 2 / Group 2** | frequency 21 | frequency 2C |
> | **︙** |  |  |
> | **Variable 2 / Group R** | frequency R1 | frequency RC |
>
> | Percentage of Frequencies Smaller than 5 | Test to Use |
> |:-:|--|
> | $5\ SFP\ <\ %20$ | Pearson's Chi-Square Test |
> | $5\ SFP\ >\ %20$ | Fisher's Exact Test |

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Functions to Use </p>

In [2]:
α = alpha = 0.05

def decision(p, alpha=0.05):
    'acceptance or rejection of the null hypothesis'
    if p < alpha: return 'H0 rejected.'
    else: return 'H0 cannot be rejected.'

def test_selection_2x2(value):
    if value > 25: return "Pearson's Chi-Square Test"
    if 5 < value < 25: return "Yates' Chi-Square Test"
    if value < 5: return "Fisher's Exact Test"

def test_selection_RxC(value):
    if value < 20: return "Pearson's Chi-Square Test"
    if value > 20: return "Fisher's Exact Test"

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
 &nbsp; CHI-SQUARE GOODNESS-OF-FIT TEST </p>
    
|  |  |
|--|--|
| $H_0$ | → The observed data fits the expected distribution. |
| $H_1$ | → The observed data does not fit the expected distribution. |

In [3]:
data = ['Underweight'] * 42 + ['Healthy'] * 30 + ['Overweight'] * 28
np.random.shuffle(data)

data = pd.DataFrame(data, columns=['Weight'])
data.sample(3)

Unnamed: 0,Weight
18,Overweight
97,Overweight
5,Overweight


In [4]:
frequency = data['Weight'].value_counts()
frequency

Weight
Underweight    42
Healthy        30
Overweight     28
Name: count, dtype: int64

In [5]:
chi, p = stats.chisquare(frequency)

print('Chi-Square:', chi)
print('p:', round(p, 4))
print('Decision:', decision(p))

Chi-Square: 3.44
p: 0.1791
Decision: H0 cannot be rejected.


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%);"> 
<b> &nbsp; CHI-SQUARE TEST OF INDEPENDENCE </b> - Contingency 2x2 </p>
    
|  |  |
|--|--|
| $H_0$ | → There is no relationship between gender and person’s handedness variables. |
| $H_1$ | → There is a relationship between gender and person’s handedness variables. |

In [6]:
data = pd.read_csv('data/12_gender_and_handedness.csv')
data.sample(3)

Unnamed: 0,Gender,Handedness
65,Female,Right
3,Male,Right
85,Female,Right


**1. Data Preparation**

In [7]:
table = pd.crosstab(index=data['Gender'], columns=data['Handedness'])
display(table)

Handedness,Left,Right
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,9,43
Male,4,44


**2. Finding the Expected Frequency Value**

In [8]:
test, p, df, expected = stats.chi2_contingency(table)

print(expected)
print(f'Test for the smallest expected value {expected.min()} :', test_selection_2x2(expected.min()))

[[ 6.76 45.24]
 [ 6.24 41.76]]
Test for the smallest expected value 6.24 : Yates' Chi-Square Test


**3. Test Implementation** - Yates' Chi-Square Test

In [9]:
test, p, df, expected = stats.chi2_contingency(table, correction=True)
print(f'p: {p:.4f} \t Decision: {decision(p)}')

p: 0.3004 	 Decision: H0 cannot be rejected.


<p style="background-image: linear-gradient(#f87674, #ffffff 10%);"> 
<b> 3.1. Test Implementation</b> - Pearson's Chi-Square Test </p>

In [10]:
test, p, df, expected = stats.chi2_contingency(table, correction=False)
print(f'p: {p:.4f} \t Decision: {decision(p)}')

p: 0.1825 	 Decision: H0 cannot be rejected.


**3.2. Test Implementation** - Fisher's Exact Test

In [11]:
test, p = stats.fisher_exact(table)
print(f'p: {p:.4f} \t Decision: {decision(p)}')

p: 0.2392 	 Decision: H0 cannot be rejected.


<p style="background-image: linear-gradient(to right, #ee2965, #e31837)"> &nbsp; </p>

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%);"> 
<b> &nbsp; CHI-SQUARE TEST OF INDEPENDENCE </b> - Contingency RxC </p>
    
|  |  |
|--|--|
| $H_0$ | → There is no relationship between gender and brand preference variables. |
| $H_1$ | → There is a relationship between gender and brand preference variables. |

In [12]:
data = pd.read_csv('data/13_gender_brand_preferences.csv')
data.sample(3)

Unnamed: 0,Gender,Brand
12,Female,D
23,Female,D
16,Female,C


**1. Data Preparation**

In [13]:
table = pd.crosstab(index=data['Gender'], columns=data['Brand'])
display(table)

Brand,A,B,C,D
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,3,4,6,5
Male,1,1,3,1


**2. Finding the Expected Frequency Value**

In [14]:
test, p, df, expected = stats.chi2_contingency(table)
print(expected)

percentage = (expected<5).sum() / expected.size * 100
print(f'The test to be applied for %{percentage} :', test_selection_RxC(percentage))

[[3.   3.75 6.75 4.5 ]
 [1.   1.25 2.25 1.5 ]]
The test to be applied for %87.5 : Fisher's Exact Test


**3. Test Implementation** - Fisher's Exact Test

In [15]:
fisher = robjects.r['fisher.test']

# values = table.to_numpy().T.flatten() 
# r_table = robjects.r.matrix(robjects.IntVector(values), nrow=2)

test = fisher(table)
print(test)


	Fisher's Exact Test for Count Data

data:  structure(list(A = c(3L, 1L), B = c(4L, 1L), C = c(6L, 3L), D = c(5L, 1L)), class = "data.frame", row.names = c("Female", "Male"))
p-value = 0.9198
alternative hypothesis: two.sided




<p style="background-image: linear-gradient(#f87674, #ffffff 10%);"> 
<b>3.1. Test Implementation</b> - Pearson's Chi-Square Test </p>

In [16]:
test, p, df, expected = stats.chi2_contingency(table, correction=False)
print(f'p: {p:.4f} \t Decision: {decision(p)}')

p: 0.8913 	 Decision: H0 cannot be rejected.


<p style="background-image: linear-gradient(to right, #ee2965, #e31837)"> &nbsp; </p>

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>