# Chi-square Test of Independence (Contingency tables)

Chi-squared tests use the same calculations and the same probability distribution for different applications:

Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does.
Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with height (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables).
Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors.

https://en.wikipedia.org/wiki/Test_statistic

			iv. One variance test
				1) Condition for One variance test
					a) Random samples
					b) Each observation should be independent of other
						i) Sampling with replacement
						ii) If sampling without replacement, the sample size should not be more than 10% of the population
					c) The data follows a Normal Distribution
				2) Variance Tests
					a) Chi square test
						i) For testing the population variance against a specified value (this is for One variance test)
						ii) testing goodness of fit of some probability distribution
						iii) testing for independence of two attributes (Contingency Tables)
					b) F test
						i) for testing equality of two variances from different populations
						ii) for testing equality of several means with technique of ANOVA.



Chi-squared test for goodness of fit	
![image.png](attachment:image.png)

df = k − 1 − # parameters estimated, and one of these must hold.
• All expected counts are at least 5.[4]

• All expected counts are > 1 and no more than 20% of expected counts are less than 5[5]

## Test of Independence (Contingency tables)

In [1]:
import scipy.stats as stats
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

![image.png](attachment:image.png)

In [2]:
shift_operator = np.array([[22, 26, 23], [28, 62, 26], [72, 22, 66]])
shift_operator

array([[22, 26, 23],
       [28, 62, 26],
       [72, 22, 66]])

In [3]:
stats.chi2_contingency(shift_operator)

(50.09315721064659,
 3.4527076339398545e-10,
 4,
 array([[24.96253602, 22.50720461, 23.53025937],
        [40.78386167, 36.77233429, 38.44380403],
        [56.25360231, 50.7204611 , 53.0259366 ]]))

In [4]:
print("Reject H0, there is a relationship between Shift and Operator")

Reject H0, there is a relationship between Shift and Operator


In [6]:
# if the data is passed as a dataframe how do we handle
# Constructing the Dataframe with the same above data
df_shift_operator = pd.DataFrame(shift_operator, columns=['Shift1', 'Shift2', 'Shift3'], index=['Worker1', 'Worker2', 'Worker3'])
df_shift_operator

Unnamed: 0,Shift1,Shift2,Shift3
Worker1,22,26,23
Worker2,28,62,26
Worker3,72,22,66


In [7]:
stats.chi2_contingency(df_shift_operator)

(50.09315721064659,
 3.4527076339398545e-10,
 4,
 array([[24.96253602, 22.50720461, 23.53025937],
        [40.78386167, 36.77233429, 38.44380403],
        [56.25360231, 50.7204611 , 53.0259366 ]]))

### Example 2:
Get the tips library

In [10]:
df_tips = sns.load_dataset('tips')
df_tips.sample(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
160,21.5,3.5,Male,No,Sun,Dinner,4
147,11.87,1.63,Female,No,Thur,Lunch,2
189,23.1,4.0,Male,Yes,Sun,Dinner,3
233,10.77,1.47,Male,No,Sat,Dinner,2
44,30.4,5.6,Male,No,Sun,Dinner,4
90,28.97,3.0,Male,Yes,Fri,Dinner,2
145,8.35,1.5,Female,No,Thur,Lunch,2
49,18.04,3.0,Male,No,Sun,Dinner,2
142,41.19,5.0,Male,No,Thur,Lunch,5
101,15.38,3.0,Female,Yes,Fri,Dinner,2


In [11]:
df_tips.groupby(['day', 'smoker']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,time,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thur,Yes,17,17,17,17,17
Thur,No,45,45,45,45,45
Fri,Yes,15,15,15,15,15
Fri,No,4,4,4,4,4
Sat,Yes,42,42,42,42,42
Sat,No,45,45,45,45,45
Sun,Yes,19,19,19,19,19
Sun,No,57,57,57,57,57


In [12]:
df_tips.pivot_table(index='day', columns='smoker', aggfunc='count')

Unnamed: 0_level_0,sex,sex,size,size,time,time,tip,tip,total_bill,total_bill
smoker,Yes,No,Yes,No,Yes,No,Yes,No,Yes,No
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Thur,17,45,17,45,17,45,17,45,17,45
Fri,15,4,15,4,15,4,15,4,15,4
Sat,42,45,42,45,42,45,42,45,42,45
Sun,19,57,19,57,19,57,19,57,19,57


In [14]:
df_day_smokers = df_tips.pivot_table(index='day', columns='smoker', aggfunc='count')['tip']
df_day_smokers

smoker,Yes,No
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,17,45
Fri,15,4
Sat,42,45
Sun,19,57


In [15]:
stats.chi2_contingency(df_day_smokers)

(25.787216672396262,
 1.0567572499836523e-05,
 3,
 array([[23.63114754, 38.36885246],
        [ 7.24180328, 11.75819672],
        [33.15983607, 53.84016393],
        [28.96721311, 47.03278689]]))

In [16]:
print('H0 is rejected, There is a relationship between day and smokers')

H0 is rejected, There is a relationship between day and smokers


### Example Ind_1 

The table below contains the number of perfect, satisfactory and defective products are manufactured by both male and female.

| Gender  | Perfect | Satisfactory | Defective |
| ------- | ---- | --------- | -------- |
| Male    | 138 | 83 | 64 |
| Female  | 64 | 67 | 84 |


Do these data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among genders (Male and Female)?

In [17]:
print("""
Ho: There is no difference between the parts manufactured by Male or Female

Ha: There is difference between the parts manufactured by Male or Female""")


Ho: There is no difference between the parts manufactured by Male or Female

Ha: There is difference between the parts manufactured by Male or Female


In [18]:
print("""
Significance alpha: 0.05""")


Significance alpha: 0.05


In [19]:
quality_array = np.array([[138, 83, 64], [64, 67, 84]])
chi_sq_ind_stat, p_value, deg_free, exp_freq = stats.chi2_contingency(quality_array)

In [20]:
print(f"""
We reject the Ho since {p_value} is < 0.05""")


We reject the Ho since 1.547578021398957e-05 is < 0.05


In [21]:
print(chi_sq_ind_stat)

22.152468645918482
