### **Chi-Square Test: Detailed Explanation with Dataset and Step-by-Step Calculations**

#### **When to Use It?**
The **Chi-Square test** is used to test whether two categorical variables are independent or associated. It's suitable for datasets in which data is organized into contingency tables.

A table that displays data for one variable in rows and data for another variable in columns


#### **Scenario**
Suppose you want to analyze whether there is an association between **gender** (Male/Female) and **preference for a drink** (Tea/Coffee/Soft Drinks). You collect a small dataset:

| Gender  | Tea | Coffee | Soft Drinks | Total |
|---------|-----|--------|-------------|-------|
| Male    | 10  | 20     | 15          | 45    |
| Female  | 25  | 15     | 15          | 55    |
| Total   | 35  | 35     | 30          | 100   |

**Question:** Is there a significant association between gender and drink preference?









#### **Step-by-Step Calculation**

1. **Define Hypotheses**:
   - **Null Hypothesis ($H_0$)**: Gender and drink preference are independent.
   - **Alternative Hypothesis ($H_1$)**: Gender and drink preference are associated.

2. **Confidence Level**:

   Set at **95%** (α = 0.05).

3. **Calculate Expected Frequencies**:

   The formula for expected frequency ($E$) is:

   $$
   E = \frac{\text{(Row Total)} \times \text{(Column Total)}}{\text{Grand Total}}
   $$

   Let's calculate the expected frequency for each cell:

   - For Male & Tea:

     $$
     E = \frac{(Row Total for Male) \times (Column Total for Tea)}{\text{Grand Total}}
     = \frac{45 \times 35}{100} = 15.75
     $$

     
   - For Male & Coffee:

     $$
     E = \frac{45 \times 35}{100} = 15.75
     $$


   - For Male & Soft Drinks:

     $$
     E = \frac{45 \times 30}{100} = 13.5
     $$


   - Repeat for Female:
     - Female & Tea:

      $$E = \frac{55 \times 35}{100} = 19.25$$

     - Female & Coffee:

     $$E = \frac{55 \times 35}{100} = 19.25$$


  - Female & Soft Drinks:

      $$E = \frac{55 \times 30}{100} = 16.5$$

   The expected frequency table is:


   | Gender  | Tea   | Coffee | Soft Drinks | Total |
   |---------|-------|--------|-------------|-------|
   | Male    | 15.75 | 15.75  | 13.5        | 45    |
   | Female  | 19.25 | 19.25  | 16.5        | 55    |
   | Total   | 35    | 35     | 30          | 100   |


4. **Calculate Chi-Square Statistic ($\chi^2$)**:

   The formula is:
   $$
   \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
   $$
   where:

   - $O_i$ = Observed frequency,
   - $E_i$ = Expected frequency.

   Calculate \$chi^2$ for each cell:

   - Male & Tea:
   \
    $$\frac{(10 - 15.75)^2}{15.75} = \frac{(-5.75)^2}{15.75} = \frac{33.0625}{15.75} \approx 2.10 $$


   - Male & Coffee:

   $$\frac{(20 - 15.75)^2}{15.75} = \frac{(4.25)^2}{15.75} = \frac{18.0625}{15.75} \approx 1.15$$

   - Male & Soft Drinks:

    $$\frac{(15 - 13.5)^2}{13.5} = \frac{(1.5)^2}{13.5} = \frac{2.25}{13.5} \approx 0.17$$


   - Female & Tea:

   $$\frac{(25 - 19.25)^2}{19.25} = \frac{(5.75)^2}{19.25} = \frac{33.0625}{19.25} \approx 1.72$$


   - Female & Coffee:

   $$\frac{(15 - 19.25)^2}{19.25} = \frac{(-4.25)^2}{19.25} = \frac{18.0625}{19.25} \approx 0.94$$


   - Female & Soft Drinks:

   $$\frac{(15 - 16.5)^2}{16.5} = \frac{(-1.5)^2}{16.5} = \frac{2.25}{16.5} \approx 0.14$$

   Sum up all the values:
   $$
   \chi^2 = 2.10 + 1.15 + 0.17 + 1.72 + 0.94 + 0.14 = 6.22
   $$

5. **Degrees of Freedom (df)**:
   The formula is:
   $$
   df = (r - 1) \times (c - 1)
   $$
   where:

   - $r = 2$ (number of rows: Male, Female),
   -$c = 3$ (number of columns: Tea, Coffee, Soft Drinks).

   $$
   df = (2 - 1) \times (3 - 1) = 1 \times 2 = 2
   $$

6. **Compare with Critical Value**:
   - From the Chi-Square table, for $df = 2$ and $\alpha = 0.05$, the critical value is **5.991**.

   - Since \$chi^2 = 6.22 > 5.99$), we **reject the null hypothesis**.

### **Conclusion**
There is a significant association between gender and drink preference.

### **Python Implementation**


In [1]:
import numpy as np
from scipy.stats import chi2_contingency

# Observed data
observed = np.array([
    [10, 20, 15],  # Male
    [25, 15, 15]   # Female
])

# Perform Chi-Square test
chi2_stat, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-Square Statistic: {chi2_stat:.2f}")
print(f"P-Value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:\n{expected}")

# Decision based on p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Gender and drink preference are associated.")
else:
    print("Fail to reject the null hypothesis: No significant association.")

Chi-Square Statistic: 6.20
P-Value: 0.0449
Degrees of Freedom: 2
Expected Frequencies:
[[15.75 15.75 13.5 ]
 [19.25 19.25 16.5 ]]
Reject the null hypothesis: Gender and drink preference are associated.




---



---



### **Analysis of Variance (ANOVA - Analysis of variance): Detailed Explanation with Dataset and Step-by-Step Calculations**

#### **When to Use It?**
**ANOVA** is used to compare the means of three or more groups to determine if there are statistically significant differences among them. It is an extension of the t-test for multiple groups.


#### **Scenario**
Suppose we want to test whether three different teaching methods (A, B, C) result in significantly different average student scores.

#### **Dataset**
The scores of students taught with these methods are:

| Teaching Method | Scores               |
|------------------|----------------------|
| **A**           | [85, 90, 88, 86, 89] |
| **B**           | [78, 82, 84, 80, 83] |
| **C**           | [91, 94, 89, 92, 90] |


#### **Objective**
**Question:** Do the teaching methods (A, B, C) lead to different average scores?

### **Step-by-Step ANOVA Calculation**

1. **Define Hypotheses**:
   - **Null Hypothesis ($H_0$)**: The means of all groups are equal $\mu_A = \mu_B = \mu_C\$.

   - **Alternative Hypothesis ($H_1$)**: At least one group's mean is significantly different.

2. **Calculate the Means**:

   Calculate the mean ($\bar{x}$) for each group and the grand mean ($\bar{X}_{grand}$):

   - **Method A**:

     $$
     \bar{x}_A = \frac{85 + 90 + 88 + 86 + 89}{5} = 87.6
     $$


   - **Method B**:

     $$
     \bar{x}_B = \frac{78 + 82 + 84 + 80 + 83}{5} = 81.4
     $$

   - **Method C**:

     $$
     \bar{x}_C = \frac{91 + 94 + 89 + 92 + 90}{5} = 91.2
     $$

   - **Grand Mean ($\bar{X}_{grand}$)**:

     $$
     \bar{X}_{grand} = \frac{85 + 90 + 88 + 86 + 89 + 78 + 82 + 84 + 80 + 83 + 91 + 94 + 89 + 92 + 90}{15} = 86.733
     $$

3. **Calculate the Sum of Squares**:

   ANOVA partitions the variance into two components: **between-group variance** and **within-group variance**.


  - **Sum of Squares Between (SSB)**:


  $$
  SSB = n_A (\bar{x}_A - \bar{X}_{grand})^2 + n_B (\bar{x}_B - \bar{X}_{grand})^2 + n_C (\bar{x}_C - \bar{X}_{grand})^2
  $$


  Where $n_A = n_B = n_C = 5$.


  $$
  SSB = 5 (87.6 - 86.733)^2 + 5 (81.4 - 86.733)^2 + 5 (91.2 - 86.733)^2
  $$

  $$
  SSB = 5 (0.867)^2 + 5 (-5.333)^2 + 5 (4.467)^2
  $$


  $$
  SSB = 5 (0.752) + 5 (28.444) + 5 (19.953) = 3.76 + 142.22 + 99.765 = 245.745
  $$

  **Sum of Squares Within (SSW)**:

  For each group, calculate $\(x - \bar{x})^2\$, sum them, and then sum across groups:

  $$
  SSW = \sum_{i=1}^n (x_{Ai} - \bar{x}_A)^2 + \sum_{i=1}^n (x_{Bi} - \bar{x}_B)^2 + \sum_{i=1}^n (x_{Ci} - \bar{x}_C)^2
  $$

  Method A: $$\(85 - 87.6)^2 + (90 - 87.6)^2 + \ldots = 16.8\$$

  Method B: $$\(78 - 81.4)^2 + \ldots = 23.2\$$

  Method C: $$\(91 - 91.2)^2 + \ldots = 10.8\$$

  Total:
     
  $$
  SSW = 16.8 + 23.2 + 10.8 = 50.8
  $$

  **Total Sum of Squares (SST)**:

  $$
  SST = SSB + SSW = 245.745 + 50.8 = 296.545
  $$

4. **Calculate Mean Squares**:
   - **Between-Group Mean Square (MSB)**:

     $$
     MSB = \frac{SSB}{k - 1} = \frac{245.745}{3 - 1} = 122.8725
     $$

   - **Within-Group Mean Square (MSW)**:

     $$
     MSW = \frac{SSW}{N - k} = \frac{50.8}{15 - 3} = \frac{50.8}{12} = 4.233
     $$

5. **Calculate the F-Statistic**:

   $$
   F = \frac{MSB}{MSW} = \frac{122.8725}{4.233} \approx 29.02
   $$

6. **Degrees of Freedom**:

   - Between Groups: $$df_{between} = k - 1 = 3 - 1 = 2$$
   - Within Groups: $$df_{within} = N - k = 15 - 3 = 12$$

7. **Compare with Critical Value**:

   For $df_{between} = 2$,
     
   $df_{within} = 12$, and

   $\alpha = 0.05$,
   
   the critical F-value from the F-table is approximately **3.89**.

   - Since $F = 29.02 > 3.89$, we **reject the null hypothesis**.

### **Conclusion**
There is a significant difference in the mean scores among the three teaching methods.

### **Python Implementation**


In [3]:
import numpy as np
from scipy.stats import f_oneway

# Scores for each group
scores_A = [85, 90, 88, 86, 89]
scores_B = [78, 82, 84, 80, 83]
scores_C = [91, 94, 89, 92, 90]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(scores_A, scores_B, scores_C)

print(f"F-Statistic: {f_stat:.2f}")
print(f"P-Value: {p_value:.4f}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Significant differences exist among the groups.")
else:
    print("Fail to reject the null hypothesis: No significant differences among the groups.")

F-Statistic: 26.71
P-Value: 0.0000
Reject the null hypothesis: Significant differences exist among the groups.




---



---



#1.Z-test

- When population parameter is available!

- (its the only test we prefered when we have population parameter)

#2.One sample t-test

- One numerical variable without any situation/condition

#3.Paired t-test

- One numerical variable with any situation/condition

#4.Two sample t-test

- Two numerical variable

#5.Chi square test

- When you have to work with categorical variables.

#6.Anova

- When you have 03 or more field to work with!

