The **Chi-Square Test** is a statistical method used to determine whether there is a significant relationship between **categorical variables** or whether the observed data fit an **expected distribution**.

---

## **1. Why We Need It**

In many real-world situations, our data isn’t numerical like heights or weights — it’s **categorical**:

* **Nominal data** (no natural order): gender, color, brand choice, product category.
* **Ordinal data** (has order but not equal spacing): customer satisfaction ratings ("low", "medium", "high").

The Chi-Square Test helps us answer:

* *"Are these differences just due to random chance?"*
* *"Do these variables actually have a meaningful association?"*

---

## **2. The Problems It Solves**

### **A. Testing Independence (Chi-Square Test of Independence)**

* Checks if two categorical variables are related.
* Example: Is **gender** related to **brand preference**?
  If men and women choose brands differently enough (statistically), we can say the variables are **dependent**.

**Problem it solves:** Helps us know whether two factors influence each other, which is crucial in marketing, surveys, epidemiology, etc.

---

### **B. Testing Goodness of Fit (Chi-Square Goodness-of-Fit Test)**

* Checks if observed data fits an expected distribution.
* Example: A fair die should have each face appear **1/6 of the time**. Roll it 60 times — does the observed frequency match the expectation?

**Problem it solves:** Detects whether your observed distribution significantly deviates from what theory predicts.

---

## **3. How It Works (Intuition)**

1. **Compare** observed counts (O) with expected counts (E).
2. Calculate:

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

3. The bigger the difference between O and E, the larger the Chi-square statistic.
4. Compare this statistic to a **Chi-square distribution** (depends on **degrees of freedom**) to get a **p-value**.
5. Small p-value (< 0.05) → differences are unlikely due to chance → reject null hypothesis.

---

## **4. Where We Use It**

* **Market research**: Do age groups prefer different product categories?
* **Medical studies**: Is there an association between a treatment and recovery rate?
* **Manufacturing**: Do defect types occur in expected proportions?
* **Social sciences**: Are political preferences linked to education level?
* **Game testing**: Is a dice or roulette wheel fair?

---

## **5. Summary Table**

| **Test Type**                   | **Question Answered**                              | **Data Required**             |
| ------------------------------- | -------------------------------------------------- | ----------------------------- |
| Chi-Square Test of Independence | Are two categorical variables related?             | Contingency table (2D counts) |
| Chi-Square Goodness-of-Fit      | Does observed distribution match expected pattern? | Single categorical variable   |

---





# Numerical Example of Chi-Square **Test of Independence**

**Question:** Are two categorical variables related?
**Scenario:** Is smoking status related to having a certain disease?

### Data (contingency table)

|              | Disease = Yes | Disease = No | Row total |
| ------------ | ------------: | -----------: | --------: |
| Smoker       |            30 |           70 |       100 |
| Non-smoker   |            10 |           90 |       100 |
| Column total |            40 |          160 |       200 |

### Step 1 — Expected counts (if independent)

$$
E_{ij}=\frac{(\text{row total})(\text{col total})}{\text{grand total}}
$$

* Smoker & Yes: $E=\frac{100\cdot 40}{200}=20$
* Smoker & No: $E=\frac{100\cdot 160}{200}=80$
* Non & Yes: $E=\frac{100\cdot 40}{200}=20$
* Non & No: $E=\frac{100\cdot 160}{200}=80$

### Step 2 — Chi-square statistic

$$
\chi^2=\sum \frac{(O - E)^2}{E}
$$

$$
\chi^2=\frac{(30-20)^2}{20}+\frac{(70-80)^2}{80}+\frac{(10-20)^2}{20}+\frac{(90-80)^2}{80}
=5+1.25+5+1.25=12.5
$$

### Step 3 — Degrees of freedom & decision

* $df=(r-1)(c-1)=(2-1)(2-1)=1$
* Critical value at $\alpha=0.05$ for $df=1$ is **3.84**.
* Since **12.5 > 3.84**, **reject** independence (p ≈ 0.0004).
  **Conclusion:** Smoking status and disease are associated in this sample.

> Assumptions check: counts are frequencies, observations independent, and expected counts ≥ 5 (all good here).

---

# 2) Chi-Square **Goodness-of-Fit**

**Question:** Do observed category frequencies match a specified distribution?
**Scenario:** Is a six-sided die fair?

### Data

* Rolls = 60. Expected for a fair die = 10 per face.
* Observed: \[4, 8, 10, 12, 14, 12]

### Step 1 — Expected counts

* $E_i = 60\times \frac{1}{6} = 10$ for each face.

### Step 2 — Chi-square statistic

$$
\chi^2=\sum_{i=1}^{6}\frac{(O_i-E_i)^2}{E_i}
=\frac{(4-10)^2}{10}+\frac{(8-10)^2}{10}+\frac{(10-10)^2}{10}+\frac{(12-10)^2}{10}+\frac{(14-10)^2}{10}+\frac{(12-10)^2}{10}
$$

$$
=3.6+0.4+0+0.4+1.6+0.4=6.4
$$

### Step 3 — Degrees of freedom & decision

* $df = k-1 = 6-1 = 5$ (no parameters estimated).
* Critical value at $\alpha=0.05$, $df=5$ is **11.07**.
* Since **6.4 < 11.07**, **fail to reject** fairness (p ≈ 0.27).
  **Conclusion:** No evidence the die is biased.

> Assumptions check: counts are frequencies, trials independent, expected counts ≥ 5 (all are 10).

---

## Quick checklist (when you use Chi-square)

* Data are **counts** in categories (not means).
* Observations are **independent**.
* **Expected** counts ideally ≥ 5 in each cell.
* Use **Independence** test for 2-way tables; **Goodness-of-Fit** for 1-way tables.

Want me to turn either example into a tiny calculator (you paste your table and it returns χ², df, p-value)?
