## 1 Chi-Squared-test

### 1.1 Conditions for Chi-Squared-test

1. Only categorical features.
2. Each data-point in the dataset must be independent.
3. Each cell (in contingency table) is mutually exclusive.
4. Expected value should be greater than 5 in at least 80% of the cells in contingency table.

### 1.2 Degree of Freedom (DOF)

<div style="display: inline-block">

| Data Structure                  | Formula                 |
| :------------------------------ | :---------------------- |
| Single sample                   | $n - 1$                 |
| Two samples of same length      | $2 \cdot (n - 1)$       |
| Two samples of different length | $(n_1 - 1) + (n_2 - 1)$ |
| Contingency table               | $(n - 1) \cdot (m - 1)$ |

</div>

## 2 Goodness of Fit

### 2.1 Nature of hypothesis

- $H_0: \text{There is NO significant difference between the observed and expected results}$
- $H_a: \text{There is significant difference between the observed and expected results}$

### 2.2 Test Statistic

1. The name of the test statistic in Chi-Squared test is called as Chi-Squared value $(\chi^2)$.
2. Sample follows Chi-Squared distribution.

As $\chi^2$ increase chances of rejecting the Null-Hypothesis increases, Null-Hypothesis will be rejected beyond significance level $(\alpha)$.

### 2.3 API

```python
from scipy.stats import chisquare
```

https://docs.scipy.org/doc/scipy-1.16.2/reference/generated/scipy.stats.chisquare.html

## 3 Test of independence

### 3.1 Nature of hypothesis

- $H_0: \text{There is NO significant difference between the groups}$
- $H_a: \text{There is significant difference between the groups}$

### 3.2 Test Statistic

The name of the test statistic in Chi-Squared test is called as Chi-Squared value $(\chi^2)$.

### 3.3 API

```python
from scipy.stats import chi2_contingency
```

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

## Quizzes

In [7]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

### Quiz #1

A researcher is studying the preferences of people in a city for three different modes of transportation:  
car, bicycle, and public transit.

The researcher surveyed 500 individuals and found that 240 prefer cars, 160 prefer bicycles, and 100 prefer public transit.

The researcher wants to know if there is a significant difference between the observed preferences  
and the expected preferences based on historical data.

Which statistical test should the researcher use?

A. Chi-square goodness of fit test  
B. Two Sample Independent T-test  
C. Two Sample Z-test  
D. One Sample T-test  

#### Solution

##### PART 1

Option A: Chi-square goodness of fit test

##### PART 2

In [5]:
n = 500
x = [240, 160, 100]  # Observed values
exp = [n * 0.5, n * 0.3, n * 0.2]  # Expected values

In [6]:
test_stat, p_value = chisquare(f_obs=x, f_exp=exp)
test_stat.round(6).item(), p_value.round(6).item()

(1.066667, 0.586646)

### Quiz #2

A marketing manager wants to determine if there is a relationship between the type of advertising (online, print, or TV)  
and the purchase decision (buy or not buy) of a product.

The manager collects data from 300 customers and records their advertising exposure and purchase decisions.

What statistical test should the manager use to analyze this data?

A. Chi-square independence test  
B. Chi-square goodness of fit test  
C. Two Sample Independent T-test  
D. Two Sample Z-test

#### Solution

##### PART 1

Option A: Chi-square independence test

##### PART 2