# Chi-square Goodness Of Fit

Chi-squared tests use the same calculations and the same probability distribution for different applications:

Chi-squared tests for variance are used to determine whether a normal population has a specified variance. The null hypothesis is that it does.
Chi-squared tests of independence are used for deciding whether two variables are associated or are independent. The variables are categorical rather than numeric. It can be used to decide whether left-handedness is correlated with height (or not). The null hypothesis is that the variables are independent. The numbers used in the calculation are the observed and expected frequencies of occurrence (from contingency tables).
Chi-squared goodness of fit tests are used to determine the adequacy of curves fit to data. The null hypothesis is that the curve fit is adequate. It is common to determine curve shapes to minimize the mean square error, so it is appropriate that the goodness-of-fit calculation sums the squared errors.

https://en.wikipedia.org/wiki/Test_statistic

			iv. One variance test
				1) Condition for One variance test
					a) Random samples
					b) Each observation should be independent of other
						i) Sampling with replacement
						ii) If sampling without replacement, the sample size should not be more than 10% of the population
					c) The data follows a Normal Distribution
				2) Variance Tests
					a) Chi square test
						i) For testing the population variance against a specified value (this is for One variance test)
						ii) testing goodness of fit of some probability distribution
						iii) testing for independence of two attributes (Contingency Tables)
					b) F test
						i) for testing equality of two variances from different populations
						ii) for testing equality of several means with technique of ANOVA.



Chi-squared test for goodness of fit	
![image.png](attachment:image.png)

df = k − 1 − # parameters estimated, and one of these must hold.
• All expected counts are at least 5.[4]

• All expected counts are > 1 and no more than 20% of expected counts are less than 5[5]

## Test for Goodness of Fit

In [1]:
import scipy.stats as stats
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Example 1: 
A coin is flipped 100 times. Number of heads and tails are noted. Is this coin biased? Check with 95%
Confidence Level.

Head = 40

Tail = 60

In [3]:
exp = [50, 50]
obs = [40, 60]

In [9]:
stats.chisquare(obs, exp)

Power_divergenceResult(statistic=4.0, pvalue=0.04550026389635857)

### Example 2:
Rolling a dice 65 times. Number of times each value appears are noted. Is this dice biased? Check with 95% Confidence Level.

[10, 6, 8, 22, 11, 8]


In [7]:
obs_d = [10, 6, 8, 22, 11, 8]

In [8]:
stats.chisquare(f_obs=obs_d)

Power_divergenceResult(statistic=15.215384615384615, pvalue=0.009480606629220312)

### Example 3: 
A t shirt manufacturer expects vs actual sale.
H0: The data follow a specified distribution.
Ha: The data do not follow the specified distribution.
Alpha = 0.05

Size    Proportions Counts

Small       0.1      25

Medium      0.2      41

Large       0.4      91

Extra Large 0.3      68

In [10]:
exp_shirt_r = pd.Series([0.1, 0.2, 0.4, 0.3])
obs_shirt = pd.Series([25, 41, 91, 68])

In [12]:
exp_shirt = exp_shirt_r * sum(obs_shirt)

In [14]:
stats.chisquare(obs_shirt, exp_shirt)

Power_divergenceResult(statistic=0.648148148148148, pvalue=0.8853267818237286)

In [15]:
print("There is no significant difference in the Shirt sale and Shirt manufactured")

There is no significant difference in the Shirt sale and Shirt manufactured


###  Example 4:

A1 airlines operated daily flights to several Indian cities. The operations manager believes that 30% of their passengers prefer vegan food, 45% prefer vegetarian food , 20% prefer non-veg food 5% request for Jain food. 

A sample of 500 passengers was chosen to analyse the food preferences and the data is shown in the following table:

|               | Food type | Vegan | Vegetarian | Non-Vegetarian | Jain |
| ------------------------- | ---- | ---- | ----- | ---- | ---- |
|Number of passengers |  | 190 | 185 | 90 | 35 |

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

In [16]:
print("""
Ho: Meal preferences are same between perceived vs actual
Ha: Meal preferences are different between perceived vs actual""")


Ho: Meal preferences are same between perceived vs actual
Ha: Meal preferences are different between perceived vs actual


In [17]:
print("""
Significance: alpha is choosen as 0.05""")


Significance: alpha is choosen as 0.05


In [18]:
print("""
The test statistic we will use Chi-Square goodness of fit""")


The test statistic we will use Chi-Square goodness of fit


In [19]:
observed_values = np.array([190, 185, 90, 35])

total_samples = observed_values.sum()

expected_values = np.array([total_samples*.3, total_samples*.45, total_samples*.2, total_samples*.05])
expected_values

array([150., 225., 100.,  25.])

In [20]:
chi_sq_stat, p_value = stats.chisquare(f_obs=observed_values, f_exp=expected_values)

In [21]:
print(p_value)

4.492718590376291e-05


In [22]:
print(f"""
Observation: p-value is {p_value} is less than 0.05.  Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices""")


Observation: p-value is 4.492718590376291e-05 is less than 0.05.  Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices


### Practice Exercise 2
Refer to the above example 1. Here the operations manager changes his belief and now believes that 28% of their passengers prefer vegan food, 42% prefer vegetarian food , 25% prefer non-veg food 5% request for Jain food. 

At 5% level of significance, can you confirm that the meal preference is as per the belief of the operations manager?

In [23]:
observed_values = np.array([190, 185, 90, 35])
total_samples = observed_values.sum()
expected_values = np.array([total_samples*.28, total_samples* .42, total_samples* .25, total_samples* .05])
expected_values

array([140., 210., 125.,  25.])

In [24]:
chi_sq_stat, p_value = stats.chisquare(f_obs=observed_values, f_exp=expected_values)
p_value

1.4561004918754443e-07

In [25]:
print(f"""
Observation: p-value is {p_value} is less than 0.05. Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices""")


Observation: p-value is 1.4561004918754443e-07 is less than 0.05. Reject the Ho in favour of Ha

There is a difference between the observed values to the Predicted values by manager on the food choices
