# Chi-square test

Chi-squared tests use the same calculations and the same probability distribution for different applications:

Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does.
Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with height (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables).
Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors.

https://en.wikipedia.org/wiki/Test_statistic

Chi-squared test for variance	
![image.png](attachment:image.png)

df = n-1
• Normal population

Chi-squared test for goodness of fit	
![image-2.png](attachment:image-2.png)
df = k − 1 − # parameters estimated, and one of these must hold.
• All expected counts are at least 5.[4]

• All expected counts are > 1 and no more than 20% of expected counts are less than 5[5]

## Test for Goodness of Fit

###  Example 1

A1 airlines operated daily flights to several Indian cities. The operations manager believes that 30% of their passengers prefer vegan food, 45% prefer vegetarian food , 20% prefer non-veg food 5% request for Jain food. 

A sample of 500 passengers was chosen to analyse the food preferences and the data is shown in the following table:

|               | Food type | Vegan | Vegetarian | Non-Vegetarian | Jain |
| ------------------------- | ---- | ---- | ----- | ---- | ---- |
|Number of passengers |  | 190 | 185 | 90 | 35 |

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

In [3]:
print("""
Ho: Meal preferences are same between perceived vs actual
Ha: Meal preferences are different between perceived vs actual""")


Ho: Meal preferences are same between perceived vs actual
Ha: Meal preferences are different between perceived vs actual


In [4]:
print("""
Significance: alpha is choosen as 0.05""")


Significance: alpha is choosen as 0.05


In [5]:
print("""
The test statistic we will use Chi-Square goodness of fit""")


The test statistic we will use Chi-Square goodness of fit


In [8]:
# Calculate p-Value

import scipy.stats as stats
import scipy
import numpy as np
import pandas as pd

In [9]:
observed_values = np.array([190, 185, 90, 35])

total_samples = observed_values.sum()

expected_values = np.array([total_samples*.3, total_samples*.45, total_samples*.2, total_samples*.05])
expected_values

array([150., 225., 100.,  25.])

In [10]:
chi_sq_stat, p_value = stats.chisquare(f_obs=observed_values, f_exp=expected_values)

In [12]:
print(p_value)

4.492718590376291e-05


In [13]:
print(f"""
Observation: p-value is {p_value} is less than 0.05.  Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices""")


Observation: Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices


### Practice Exercise 2
Refer to the above example 1. Here the operations manager changes his belief and now believes that 28% of their passengers prefer vegan food, 42% prefer vegetarian food , 25% prefer non-veg food 5% request for Jain food. 

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

In [15]:
observed_values = np.array([190, 185, 90, 35])
total_samples = observed_values.sum()
expected_values = np.array([total_samples*.28, total_samples* .42, total_samples* .25, total_samples* .05])
expected_values

array([140., 210., 125.,  25.])

In [16]:
chi_sq_stat, p_value = stats.chisquare(f_obs=observed_values, f_exp=expected_values)
p_value

1.4561004918754443e-07

In [17]:
print(f"""
Observation: p-value is {p_value} is less than 0.05. Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices""")


Observation: p-value is 1.4561004918754443e-07 is less than 0.05. Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices


## Test of Independence

### Example Ind_1 

The table below contains the number of perfect, satisfactory and defective products are manufactured by both male and female.

| Gender  | Perfect | Satisfactory | Defective |
| ------- | ---- | --------- | -------- |
| Male    | 138 | 83 | 64 |
| Female  | 64 | 67 | 84 |


Do these data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among genders (Male and Female)?

In [19]:
print("""
Ho: There is no difference between the parts manufactured by Male or Female

Ha: There is difference between the parts manufactured by Male or Female""")


Ho: There is no difference between the parts manufactured by Male or Female

Ha: There is difference between the parts manufactured by Male or Female


In [20]:
print("""
Significance alpha: 0.05""")


Significance alpha: 0.05


In [21]:
quality_array = np.array([[138, 83, 64], [64, 67, 84]])
chi_sq_ind_stat, p_value, deg_free, exp_freq = stats.chi2_contingency(quality_array)

In [22]:
print(f"""
We reject the Ho since {p_value} is < 0.05""")


We reject the Ho since 1.547578021398957e-05 is < 0.05


In [25]:
print(chi_sq_ind_stat)

22.152468645918482
