In [1]:
import numpy as np
import pandas as pd

from scipy import stats as st

# Reference

> [Unit: Inference for categorical data (chi-square tests)](https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests)<br>

---

# Chi-square goodness-of-fit tests

> [Chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution)<br>
> [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test)

Test if one variable is likely to come from a given distribution.

---

## Chi-square distribution introduction

A standard normal distribution: $X \sim N(0, 1), E[X] = 0, Var(X) = 1$

If we are sampling from this standard normal distribution as $X$, and then squaring every number, then we have a new chi-squre distribution: $Q = X^2 \Rightarrow Q \sim \chi^{2}_{1}$, where $1$ is the degree of freedom.

If we are sampling from a independent normally distributed variable as $X_1$ and another normally distributed variable as $X_2$ (or maybe the same random variable), then we have $Q = X_{1}^{2} + X_{2}^{2} \Rightarrow Q \sim \chi^{2}_{2}$, where $2$ is the degree of freedom.


![Chi-squared probability density function](https://upload.wikimedia.org/wikipedia/commons/3/35/Chi-square_pdf.svg 'Chi-squared probability density function')

![Chi-squared cumculative distribution function](https://upload.wikimedia.org/wikipedia/commons/0/01/Chi-square_cdf.svg 'Chi-squared cumculative distribution function')

---

## Formula

Suppose that $n$ observations in a random sample from a population are classified into $k$ mutually exclusive classes with respective observed numbers $x_i$ (for $i = 1,2,…,k$), and a null hypothesis gives the probability $p_i$ that an observation falls into the $i^{th}$ class. So we have the expected numbers $m_i = np_i$ for all $i$, where

$\displaystyle \sum^{k}_{i=1}p_{i} = 1$

$\displaystyle \sum^{k}_{i=1}m_{i} = n\sum^{k}_{i=1}p_{i} = n = \sum^{k}_{i=1}x_{i}$

Pearson proposed that, under the circumstance of the null hypothesis being correct, as $n \rightarrow \infty$ the limiting distribution of the quantity given below is the $\chi^2$ distribution.

$\displaystyle \chi^2 = \sum^{k}_{i=1} \frac{(x_i - m_i)^2}{m_i} = \sum^{k}_{i=1} \frac{x_{i}^{2}}{m_i} - n$

The expression on the right is of the form that [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson "Karl Pearson") would generalize to the form

$\displaystyle \chi^2 = \sum^{n}_{i=1} \frac{(O_i - E_i)^2}{E_i}$

where

${\displaystyle \chi ^{2}}$ = Pearson's cumulative test statistic, which asymptotically approaches a ${\displaystyle \chi ^{2}}$ distribution.

${\displaystyle O_{i}}$ = the number of observations of type ${\displaystyle i}$.

${\displaystyle E_{i}=Np_{i}}$ = the expected (theoretical) frequency of type ${\displaystyle i}$, asserted by the null hypothesis that the fraction of type ${\displaystyle i}$ in the population is ${\displaystyle p_{i}}$

${\displaystyle n}$ = the number of cells in the table.

In the case of a binomial outcome (flipping a coin), the binomial distribution may be approximated by a normal distribution (for sufficiently large ${\displaystyle n}$. Because the square of a standard normal distribution is the chi-squared distribution with one degree of freedom, the probability of a result such as 1 heads in 10 trials can be approximated either by using the normal distribution directly, or the chi-squared distribution for the normalised, squared difference between observed and expected value. However, many problems involve more than the two possible outcomes of a binomial, and instead require 3 or more categories, which leads to the multinomial distribution. Just as de Moivre and Laplace sought for and found the normal approximation to the binomial, Pearson sought for and found a degenerate multivariate normal approximation to the multinomial distribution (the numbers in each category add up to the total sample size, which is considered fixed). Pearson showed that the chi-squared distribution arose from such a multivariate normal approximation to the multinomial distribution, taking careful account of the statistical dependence (negative correlations) between numbers of observations in different categories.

---

## Expected counts in a goodness-of-fit test

---

### Example 1

Jiao works as an usher at a theater. The theater has $1000$ seats that are accessed through five entrances. Each guest should use the entrance that's marked on their ticket. Jiao wants to test if the distribution of guests according to entrances matches the official distribution. He collects information about the number of guests that went through each entrance at a certain night. Here are the results:

|Entrance|A|B|C|D|E|Total|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Expected|30%|30%|20%|10%|10%|100%|
|# of people|398|202|205|87|108|1000|

Jiao wants to perform a $\chi^2$ goodness-of-fit test to determine if these results suggest that the actual distribution of people doesn't match the expected distribution.

**What is the expected count of guests in entrance $\text A$ in Jiao's sample?**  
_You may round your answer to the nearest hundredth._

In [2]:
total = 1000
1000 * 0.3

300.0

Explain:

The expected count of guests in entrance $A$ in Jiao's sample is $Total \times P(A) = 1000 \times 30\% = 300$.

---

## Conditions for a goodness-of-fit test

> [Chi-Square Goodness of Fit Test](https://stattrek.com/chi-square-test/goodness-of-fit.aspx)


The chi-square goodness of fit test is appropriate when the following conditions are met:
- **Random**: The data came from a random sample from the population of interest, or a randomized experiment.
- **Large counts**: All expected counts are at least $5$. There are no conditions attached to the observed counts.
- **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

---

### Example 1

Pauline sits near the snacks in the office. There are $5$ flavors of chocolate, and she wonders if some flavors get chosen more than others. She plans to record how often each flavor gets chosen in a sample of selections to carry out a $\chi^2$ goodness-of-fit test on the resulting data.

**Which of these are conditions for carrying out this test?**

- She expects each snack to be selected at least $5$ times.
- She takes a random sample of selections.

---

### Example 2

Miriam wants to test if her $10$-sided die is fair. In other words, she wants to test if some sides get rolled more often than others. She plans on recording how often each side appears in a series of rolls and carrying out a $\chi^2$ goodness-of-fit test on the results.

**What is the smallest sample size Miriam can take to pass the large counts condition?**

In [3]:
print(10 * 5, 'total rolls')

50 total rolls


---

## Test statistic and P-value in a goodness-of-fit test

---

### Example 1

In the game rock-paper-scissors, Kenny expects to win, tie, and lose with equal frequency. Kenny plays rock-paper-scissors often, but he suspected his own games were not following that pattern, so he took a random sample of $24$ games and recorded their outcomes. Here are his results:

|Outcome|Win|Loss|Tie|
|:-:|:-:|:-:|:-:|
|Games|4|13|7|

He wants to use these results to carry out a $\chi^2$ to determine if the distribution of his outcomes disagrees with an even distribution.

**What are the values of the test statistic and P-value for Kenny's test?**

In [4]:
p = np.array([1/3, 1/3, 1/3])
total = 24
k = len(p) - 1
expected = p * total
observed = np.array([4, 13, 7])

# calculate chi-square statistic manually
chisq = np.sum((observed - expected)**2 / expected)

# calculate p-value by standard chi-square distribution
pval = st.chi2.sf(x=chisq, df=k)

chisq, pval

(5.25, 0.07243975703425146)

In [5]:
# OR
p = np.array([1/3, 1/3, 1/3])
total = 24
expected = p * total
observed = np.array([4, 13, 7])
chisq, pval = st.chisquare(f_obs=observed, f_exp=expected, ddof=0)

chisq, pval

(5.25, 0.07243975703425146)

---

### Example 2

In the following table, Meryem modeled the number of rooms she believes are in use at any given time at the veterinary hospital where she works.

|Number of rooms in use|1|2|3|4|5|
|:-:|:-:|:-:|:-:|:-:|:-:|
|Percent of the time|10%|10%|25%|45%|10%|

To test her model, she took a random sample of $80$ times and recorded the numbers of rooms in use at those times. Here are her results:

|Number of rooms in use|1|2|3|4|5|
|:-:|:-:|:-:|:-:|:-:|:-:|
|Percent of the time|12|4|20|36|8|

She wants to use these results to carry out a $\chi^2$, squared goodness-of-fit test to determine if the distribution of numbers of rooms in use at her veterinary hospital disagrees with the claimed percentages.

**What are the values of the test statistic and P-value for Meryem's test?**

In [6]:
p = np.array([.1, .1, .25, .45, .1])
total = 80
expected = p * total
observed = np.array([12, 4, 20, 36, 8])
chisq, pval = st.chisquare(f_obs=observed, f_exp=expected, ddof=0)

chisq, pval

(4.0, 0.40600584970983794)

---

## Conclusions in a goodness-of-fit test

---

### Example 1

A forest is divided into $4$ regions by a road and a stream. A wildlife researcher was curious if the number of deer living in each region corresponded to the total area of each region. They used drones to take overhead images and counted how many deer were in each region. The following table shows the proportion of total area for each region of the forest and how many deer were observed in each region along with a chi-square goodness-of-fit test.

|Region|A|B|C|D|
|:-:|:-:|:-:|:-:|:-:|
|Land area distribution|38%|30%|31%|1%|
|Observed deer|408|329|330|3|
|Expected deer|406.6|321|331.7|10.7|
|Components|0.01|0.2|0.01|5.54|
|$\chi^2 = 5.75, DF=3$|
|P-value = 0.124|

**Which region contributed the most to the test statistic?**

Region D

Explain:

The terms that make up the sum of a $\chi^2$ test statistic are called **components.** We look for which components contributed the most to a test statistic because large components show instances where there was a large discrepancy between the observed and expected counts.

Region $\text{D}$ has the largest component because its observed count was farthest away from its expected count (relative to the expected count). So we can say that region D contributed the most to the $\chi^2$ test-statistic.

In [7]:
p = np.array([.38, .3, .31, .01])
observed = np.array([408, 329, 330, 3])
total = observed.sum()
expected = p * total

# calculate chi-square statistic manually
chisq = np.sum((observed - expected)**2 / expected)

# calculate components
(observed - expected)**2 / expected

array([4.82046237e-03, 1.99376947e-01, 8.71269219e-03, 5.54112150e+00])

---

### Example 2

When Silvia feeds her cat, she places food in $3$ different locations. She wondered if her cat was equally likely to eat first from each of the locations, so she tallied which location her cat chose first in a sample of $30$ feedings. Here are the results along with a chi-square goodness-of-fit test:

|Location|Kitchen|Bedroom|Living room|
|:-:|:-:|:-:|:-:|
|Hypothesized|0.333|0.333|0.333|
|Observed deer|13|13|4|
|Expected deer|10|10|10|
|Components|0.9|0.9|3.6|

$\chi^2 = 5.4, \text{DF}=2, \text{P-value}=0.067$

Assume that all conditions for inference were met.

**At the $\alpha=0.05$ significance level, what should Silvia conclude about her cat's feeding preferences?**

There's not enough evidence to conclude that her cat prefers to eat first from some locations more than others.

Explain:

**Hypotheses and conclusions**

In a $\chi^2$ goodness-of-fit test, our null and alternative hypotheses generally look like this:

- $\text{H}_0$: The hypothesized distribution is correct.  
- $\text{H}_\text{a}$: The hypothesized distribution is not correct.

We compare the P-value to the significance level to make a conclusion:

- $\text{P-value} < \alpha \rightarrow \text{reject } \text{H}_0 \rightarrow \text{accept } \text{H}_\text{a}$
- $\text{P-value} \geq \alpha \rightarrow \text{fail to reject } \text{H}_0 \rightarrow \text{cannot accept } \text{H}_\text{a}$

**In this context**

We could state the null and alternative hypotheses in this problem as:

- $\text{H}_0$: Her cat is equally likely to eat first from each of the locations.
- $\text{H}_\text{a}$: Her cat is not equally likely to eat first from each of the locations.

Since the $\chi^2$ goodness-of-fit test produced a P-value that wasn't low enough $(0.067>\alpha=0.05)$, Silvia has failed to reject $\text{H}_0$. She doesn't have enough evidence to conclude $\text{H}_\text{a}$.

**Answer**

There's not enough evidence to conclude that her cat prefers to eat first from some locations more than others.

---

# Chi-square tests for relationships

> [Chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution)<br>
> [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test)

Test if two variables might be correlated or not.

---

## Expected counts in chi-squared tests with two-way tables

---

### Example 1

Edison, Xiu, and Amia are childhood friends who go to different high schools. They were wondering if there's a difference between their schools in the number of friends people have on social media.

They each surveyed a random sample of students from their schools, asking for the number of friends they have on social media. Here are the results:

They want to perform a $\chi^2$ test of homogeneity on these results.

**What is the expected count for the cell corresponding to students from Edison's school that have more than $300$ friends?**  
_You may round your answer to the nearest hundredth._

In [8]:
# data
df = pd.DataFrame(
    {'Less than 100 friends': [8, 9, 7],
     '100 to 300 friends': [61, 54, 58],
     'More than 300 friends': [21, 25, 19]},
    index = ["Edison's schools", "Xiu's school", "Amia's school"]
)
df_copy = df.copy()
df_copy['Total'] = df_copy.sum(axis=1)  # calculate Total by columns
df_copy.loc['Total'] = df_copy.sum(axis=0)  # calculate Total by rows
df_copy

Unnamed: 0,Less than 100 friends,100 to 300 friends,More than 300 friends,Total
Edison's schools,8,61,21,90
Xiu's school,9,54,25,88
Amia's school,7,58,19,84
Total,24,173,65,262


In [9]:
# manually calculate expected by pandas
p = df.sum() / df.sum().sum()
expected = df.sum(axis=1).to_numpy()[:, None] * p.to_numpy()

expected

array([[ 8.24427481, 59.42748092, 22.32824427],
       [ 8.0610687 , 58.10687023, 21.83206107],
       [ 7.69465649, 55.46564885, 20.83969466]])

In [10]:
# OR manually calculate expected by numpy
observed = np.array([[8, 61, 21], [9, 54, 25], [7, 58, 19]])

# calculate total, p, expected manually
total = observed.sum(axis=1)
p = observed.sum(axis=0) / observed.sum()
expected = total[:, None] * p  # convert total from 1D to 2D, then multiply them

expected

array([[ 8.24427481, 59.42748092, 22.32824427],
       [ 8.0610687 , 58.10687023, 21.83206107],
       [ 7.69465649, 55.46564885, 20.83969466]])

---

## Test statistic and P-value in chi-square tests with two-way tables

---

### Example 1

A group of researchers wondered whether vegetarians favored different seasons than non-vegetarians. The researchers took a random sample of people and surveyed them about their favorite season and whether or not they were vegetarian. Here are the responses and partial results of a chi-square test (expected counts appear below observed counts):

In [11]:
# data
df = pd.DataFrame(
    {'Fall': [6, 78],
     'Spring': [9, 55],
     'Summer': [15, 133],
     'Winter': [6, 58]},
    index = ["Vegetarian", "Not vegetarian"]
)
df_copy = df.copy()
df_copy['Total'] = df_copy.sum(axis=1)  # calculate Total by columns
df_copy.loc['Total'] = df_copy.sum(axis=0)  # calculate Total by rows
df_copy

Unnamed: 0,Fall,Spring,Summer,Winter,Total
Vegetarian,6,9,15,6,36
Not vegetarian,78,55,133,58,324
Total,84,64,148,64,360


In [12]:
# manually calculate test statistic and P-value by pandas
df = pd.DataFrame(
    {'Fall': [6, 78],
     'Spring': [9, 55],
     'Summer': [15, 133],
     'Winter': [6, 58]},
    index = ["Vegetarian", "Not vegetarian"]
)

# calculate p, expected
p = df.sum() / df.sum().sum()
expected = df.sum(axis=1).to_numpy()[:, None] * p.to_numpy()

# calculate chi2 manually
chi2 = ((df - expected)**2 / expected).sum().sum()

# calculate pvalue standard chi-square distribution
dof = (df.shape[0] - 1) * (df.shape[1] - 1)
pval = st.chi2.sf(x=chi2, df=dof)

chi2, pval, dof, expected

(1.9662966537966544,
 0.5794313456996103,
 3,
 array([[  8.4,   6.4,  14.8,   6.4],
        [ 75.6,  57.6, 133.2,  57.6]]))

In [13]:
# OR manually calculate test statistic and P-value by numpy
observed = np.array([[6, 9, 15, 6], [78, 55, 133, 58]])

# calculate total, p, expected manually
total = observed.sum(axis=1)
p = observed.sum(axis=0) / observed.sum()
expected = total[:, None] * p  # convert total from 1D to 2D, then multiply them

# calculate chi2 manually
chi2 = np.sum((observed - expected)**2 / expected)

# calculate pvalue standard chi-square distribution
dof = (len(observed) - 1) * (len(observed[0]) - 1)
pval = st.chi2.sf(x=chi2, df=dof)

chi2, pval, dof, expected

(1.966296653796654,
 0.5794313456996103,
 3,
 array([[  8.4,   6.4,  14.8,   6.4],
        [ 75.6,  57.6, 133.2,  57.6]]))

In [14]:
# OR calculate test statistic and P-value by pandas and scipy
df = pd.DataFrame(
    {'Fall': [6, 78],
     'Spring': [9, 55],
     'Summer': [15, 133],
     'Winter': [6, 58]},
    index = ["Vegetarian", "Not vegetarian"]
)
chi2, pval, dof, expected = st.chi2_contingency(observed=df)

chi2, pval, dof, expected

(1.966296653796654,
 0.5794313456996103,
 3,
 array([[  8.4,   6.4,  14.8,   6.4],
        [ 75.6,  57.6, 133.2,  57.6]]))

In [15]:
# OR calculate test statistic and P-value by numpy and scipy
observed = np.array([[6, 9, 15, 6], [78, 55, 133, 58]])
chi2, pval, dof, expected = st.chi2_contingency(observed=observed)

chi2, pval, dof, expected

(1.966296653796654,
 0.5794313456996103,
 3,
 array([[  8.4,   6.4,  14.8,   6.4],
        [ 75.6,  57.6, 133.2,  57.6]]))

---

## Making conclusions in chi-square tests for two-way tables

---

### Example 1: One sample (test of independence)

Joslyn wondered if there was a relationship between a student's exercise habits and their typical mood. She surveyed **a random sample** of $80$ students. Here is a summary of the data and the results from a chi-square test:

Chi-square test: Exercise vs. Mood

||Good mood|Bad mood|
|:-|:-|:-|
|Doesn't exercise|3|9|
|Expercises sometimes|28|18|
|Exercises often|15|7|

$\chi^2=6.428, \text{DF}=2, \text{P-Value}=0.040$

Assume that all conditions for inference were met.

**At the $\alpha=0.05$ significance level, what is the most appropriate conclusion to draw from this test?**

This is convincing evidence that exercise and mood are not independent.

The data in this study came from one sample, so a **test of independence** is appropriate. Since the P-value is less than the significance level, we should reject the null hypothesis of independence.

Explain:

**Selecting appropriate hypotheses**

The chi-square statistic is such a versatile tool that we can use the exact same calculations to answer very different questions with it, depending on whether we draw our data from one sample or from independent samples or groups.

**Separate, independent samples or groups**

A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:

- $\text{H}_0$: The distribution of a variable is the same in each population or group.
- $\text{H}_\text{a}$: The distribution of a variable differs between some of the populations or groups.

We call this the chi-square test for **homogeneity** (the state of being alike or of the same kind).

**One sample or group**

A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:

- $\text{H}_0$: There is no association between the two variables (they are independent).  
- $\text{H}_\text{a}$: There is an association between the two variables (they are not independent).

We call this the chi square test of **association** (or **independence**).

**Using a P-value to make a conclusion**

We compare the P-value to the significance level to make a conclusion, just like we do in our other significance tests.

- $\text{P-value} < \alpha \rightarrow \text{reject } \text{H}_0 \rightarrow \text{accept } \text{H}_\text{a}$
- $\text{P-value} \geq \alpha \rightarrow \text{fail to reject } \text{H}_0 \rightarrow \text{cannot accept } \text{H}_\text{a}$

**In this context**

The students in this study came from **one sample**, so a test of **independence/association** is most appropriate. We could state the null and alternative hypotheses in this problem as:

- $\text{H}_0$: Exercise and mood are independent.  
- $\text{H}_\text{a}$: Exercise and mood are not independent.

Since the $\chi^2$ test produced a low P-value ($0.04<\alpha=0.05$), we should reject $\text{H}_0$ and accept $\text{H}_\text{a}$.

**Answer**

This is convincing evidence that exercise and mood are not independent.

---

### Example 2: Two samples (test of homogeneity)

A market researcher was curious about the colors of different types of vehicles. They obtained **a random sample of $180$ sedans and a separate random sample of $180$ trucks**. Here is a summary of the colors in each sample and the results from a chi-squared test:

Chi-square test: Color vs. type

||Trucks|Sedans|
|:-|:-|:-|
|Red|37|57|
|Blue|36|41|
|Black|77|48|
|Other|30|34|

$\chi^2=11.558, \text{DF}=3, \text{P-value}=0.009$

Assume that all conditions for inference were met.

**At the $\alpha=0.05$, what is the most appropriate conclusion to draw from this test?**

This is convincing evidence that the distribution of color differs between trucks and sedans in this population.

The data in this study came from separate samples, so a test for homogeneity is appropriate. Since the P-value is less than the significance level, we should reject the null hypothesis that the color distribution is the same for trucks and sedans.

Explain:

**Selecting appropriate hypotheses**

The chi-square statistic is such a versatile tool that we can use the exact same calculations to answer very different questions with it, depending on whether we draw our data from one sample or from independent samples or groups.

**Separate, independent samples or groups**

A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:

- $\text{H}_0$: The distribution of a variable is the same in each population or group.
- $\text{H}_\text{a}$: The distribution of a variable differs between some of the populations or groups.

We call this the chi-square test for **homogeneity** (the state of being alike or of the same kind).

**One sample or group**

A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:

- $\text{H}_0$: There is no association between the two variables (they are independent).  
- $\text{H}_\text{a}$: There is an association between the two variables (they are not independent).

We call this the chi square test of **association** (or **independence**).

**Using a P-value to make a conclusion**

We compare the P-value to the significance level to make a conclusion, just like we do in our other significance tests.

- $\text{P-value} < \alpha \rightarrow \text{reject } \text{H}_0 \rightarrow \text{accept } \text{H}_\text{a}$
- $\text{P-value} \geq \alpha \rightarrow \text{fail to reject } \text{H}_0 \rightarrow \text{cannot accept } \text{H}_\text{a}$

**In this context**

The cars in this study came from **separate samples**, so a test for **homogeneity** is most appropriate. We could state the null and alternative hypotheses in this problem as:

- $\text{H}_0$: The distribution of color is the same between trucks and sedans.
- $\text{H}_\text{a}$: The distribution of color differs between trucks and sedans.

Since the $\chi^2$ test produced a low P-value ($0.009<\alpha=0.05$), we should reject $\text{H}_0$ and accept $\text{H}_\text{a}$.

**Answer**

This is convincing evidence that the distribution of color differs between trucks and sedans in this population.