## Chi-Square

Non-parametric test, without dealing with mean or variance.

Mainly used for categorical analysis.

### $\chi^2$ Goodness-of-Fit Test

Compare the observed frequencies of a categorical variable with the expected frequency. 

Hypothesis Question: Does the observed behavior follows the expected behavior?

e.g. Customer visit to a shop

Record: total 250 customers visited the shop in a week
Assumption: Expected frequency of customer each day is 50

| Day | Observed Customer | Expected Customer |
| :---: | :---: | :---: |
| Monday | 50 | 50 |
| Tuesday | 60 | 50 |
| Wednesday | 40 | 50 |
| Thursday | 47 | 50 |
| Friday | 53 | 50 |

$$\chi^2 = \sum{\frac{(expected_i = observed_i)^2}{expected_i}}$$

Therefore, we have the $\chi^2$ equals to,

$\chi^2 = (50-50)^2/50 + (50-60)^2/50 + ... + (50-53)^2/50 = 4.36$

Determine the p-value under $\chi^2$ distribution:

``` python
from scipy import stats

# Use chi2.cdf() 
p1 = 1 - stats.chi2.cdf(4.36, df=5-1)

# Alternatively, use chi2.sf()
p2 = 1 - stats.chi2.sf(4.36, df=5-1)

# Given the dataset
obs = [50, 60. 40 47, 53]
exp = [50, 50, 50, 50, 50]

# chi^2 test with chisquare()
result = stats.chisquare(f_obs=obs, f_exp=exp)
```

### $\chi^2$ Test for Independence

If we have two categorical variables and want to see if being a member of one category is independent of being a member of another category.

e.g. Customer visit to a shop

Record: total 250 customer visited the shop in a week

Any association between day and shift to explain the customer visit the shop?

**Observed Record:**

| Day | Morning | Afternoon |
| :---: | :---: | :---: |
| Monday | 25 | 25 |
| Tuesday | 20 | 30 |
| Wednesday | 30 | 15 |
| Thursday | 24 | 28 |
| Friday | 15 | 38 |

Calculate the expected frequency from the observed data:

e.g. 

Expected Frequency for Monday Morning =  (115)(50)/250 = 22.8
Expected Frequency for Wednesday Afternoon = (136)(45)/250 = 24.48

**Expected Frequency:**

| Day | Morning | Afternoon |
| :---: | :---: | :---: |
| Monday | 22.8	| 27.2 |
| Tuesday | 22.8 | 27.2 |
| Wednesday | 20.52 | 24.48 |
| Thursday | 23.712	| 28.288 |
| Firday | 24.168 |	28.832 |

$\chi^2 = (25-22.8)^2/22.8 + (20 - 22.8)^2/22.8 + ... + (38-28.832)^2/28.832 = 15.473$

p-value under $\chi^2$ distribution:

``` python
import scipy.stats as stats

# Create data set
morning = np.array([25, 20, 30, 24, 15])
afternoon = np.array([25, 30, 15, 28, 38])

# Use the chi2_contingency() from stats module
stats.contingency.chi2_contingency([morning, afternoon])
```


In [1]:
import numpy as np
import scipy.stats as stats

In [3]:
morning = np.array([25, 20, 30, 24, 15])
afternoon = np.array([25, 30, 15, 28, 38])

stats.contingency.chi2_contingency([morning, afternoon])

(15.472644542501133,
 0.0038149300005095744,
 4,
 array([[22.8  , 22.8  , 20.52 , 23.712, 24.168],
        [27.2  , 27.2  , 24.48 , 28.288, 28.832]]))