# 🔬 T-Tests: Comparing Means Between Groups  

---

## 🎯 **Purpose**
T-tests are statistical tests used to determine **whether the means of one or more groups differ significantly**.  
They are commonly used when data is approximately normally distributed and sample sizes are relatively small.  

---

## 🧰 **Types of T-Tests**

### 1️⃣ **One-Sample T-Test**  
- **What it tests:** Compares the mean of a sample to a known value or **population mean (\(\mu\))**.  
- **Example Use Case:** Testing if the **average test score** of a class differs from the **national average**.  

---

### 2️⃣ **Two-Sample T-Test (Independent T-Test)**  
- **What it tests:** Compares the means of **two independent groups**.  
- **Assumptions:**  
  - The two groups are independent.  
  - Variances are equal (if not, use Welch’s t-test).  
- **Example Use Case:** Comparing **test scores between two different classes**.  

---

### 3️⃣ **Paired Sample T-Test (Dependent T-Test)**  
- **What it tests:** Compares the means of **two related groups** (e.g., measurements on the same subjects before and after an intervention).  
- **Example Use Case:** Comparing **weight before and after a diet program**.  

---

## 📊 **Formulas**

### 🔹 **One-Sample T-Test**
$$
t = \frac{\bar{x} - \mu}{s / \sqrt{n}}
$$  
- \(\bar{x}\) = Sample mean  
- \(\mu\) = Population mean  
- \(s\) = Sample standard deviation  
- \(n\) = Sample size  

---

### 🔹 **Two-Sample T-Test**
$$
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$  
- \(\bar{x}_1, \bar{x}_2\) = Sample means  
- \(s_1, s_2\) = Sample standard deviations  
- \(n_1, n_2\) = Sample sizes  

---

### 🔹 **Paired Sample T-Test**
$$
t = \frac{\bar{d}}{s_d / \sqrt{n}}
$$  
- \(\bar{d}\) = Mean of the differences between paired observations  
- \(s_d\) = Standard deviation of the differences  
- \(n\) = Number of pairs  

---

## ✅ **Example Use Cases**
| **T-Test Type**      | **Scenario**                                               |
|----------------------|-----------------------------------------------------------|
| One-Sample           | Test if a factory’s **average battery life** is 10 hours.  |
| Two-Sample           | Compare **math scores** of two different schools.          |
| Paired Sample        | Compare **blood pressure** before and after medication.    |

---

## 🧠 **Key Insights**
- T-tests assume approximately **normal distributions** of the data.  
- For **small samples** (\(n < 30\)), t-tests are preferred over z-tests.  
- Use **p-values** and **confidence intervals** to interpret the results:  
  - \(p \leq \alpha\) → Reject \(H_0\) (significant difference).  
  - \(p > \alpha\) → Fail to reject \(H_0\) (no significant difference).  

---

💡 *Tip:* Always check **assumptions** (normality, independence, and equal variances) before using t-tests.


# 📊 Chi-Square (χ²) Test  

---

## 🎯 **Purpose**
The **Chi-Square test** is a non-parametric statistical test used to:
1. **Test for Independence** – Checks whether **two categorical variables** are independent.  
2. **Test for Goodness-of-Fit** – Checks whether **observed frequencies** match expected frequencies under a specific hypothesis.  

---

## 🧪 **Types of Chi-Square Tests**

### 1️⃣ **Chi-Square Test of Independence**
- **Goal:** Determines if two categorical variables are **statistically independent**.  
- **Example Use Case:** Testing if **gender** is independent of **preference for a product**.  

---

### 2️⃣ **Chi-Square Goodness-of-Fit Test**
- **Goal:** Tests whether the **observed frequency distribution** matches a **theoretical or expected distribution**.  
- **Example Use Case:** Testing if a die is **fair** (each side should appear equally often).  

---

## 🧰 **Steps for Chi-Square Test of Independence**
1. **Create a Contingency Table**  
   Organize observed frequencies for two categorical variables.  
2. **Calculate Expected Frequencies**  
   Use the formula:  
   $$
   E_{ij} = \frac{(\text{Row Total})_i \times (\text{Column Total})_j}{\text{Grand Total}}
   $$  
3. **Compute the Chi-Square Statistic**  
   $$
   \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
   $$  
   - \(O_{ij}\) = Observed frequency  
   - \(E_{ij}\) = Expected frequency  
4. **Find the Degrees of Freedom**  
   $$
   df = (r-1)(c-1)
   $$  
   - \(r\) = Number of rows  
   - \(c\) = Number of columns  
5. **Compare to Critical Value or Use p-value**  
   - If \(p \leq \alpha\) → Reject \(H_0\) (variables are **not independent**).  
   - If \(p > \alpha\) → Fail to reject \(H_0\) (variables are **independent**).  

---

## ✅ **Example Table**

|          | **Prefer A** | **Prefer B** | **Total** |
|----------|--------------|--------------|-----------|
| **Male**   | 30           | 20           | 50        |
| **Female** | 10           | 40           | 50        |
| **Total**  | 40           | 60           | 100       |

---

## 🧠 **Key Insights**
- The **Chi-Square test** works only for **categorical data**.  
- Expected frequencies should generally be **≥ 5** in each cell for validity.  
- Larger sample sizes make the test more reliable.  

---

💡 *Tip:* Use `scipy.stats.chi2_contingency` for independence tests or `chisquare` for goodness-of-fit in Python.  


In [3]:
from scipy.stats import chi2_contingency

# Contingency Table
data = [[50, 30], [20, 40]]

# Perform Chi-square Test
chi2, p, dof, excepted = chi2_contingency(data)

print("Chi-Square Test",chi2)
print("P-Value:", p)

Chi2ContingencyResult(statistic=10.529166666666667, pvalue=0.0011750518530845063, dof=1, expected_freq=array([[40., 40.],
       [30., 30.]]))

# 📈 ANOVA (Analysis of Variance)

---

## 🎯 **Purpose**
ANOVA is a statistical method used to **compare the means of three or more groups** to see if at least one differs significantly.

---

## 🧪 **Hypotheses**
- **Null Hypothesis (\(H_0\))**: All group means are equal.  
  $$
  \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k
  $$  
- **Alternative Hypothesis (\(H_a\))**: At least one group mean is different.  

---

## 🧰 **Example Use Case**
- Testing if **mean exam scores** of students from **three different schools** are significantly different.  

---

## 🧠 **Key Concepts**
- **Between-group variability**: Variability of group means around the overall mean.  
- **Within-group variability**: Variability of data points inside each group.  
- ANOVA compares these two variabilities using the **F-statistic**:  
  $$
  F = \frac{\text{Between-group variance}}{\text{Within-group variance}}
  $$  

---

## 🐍 **Python Implementation**

### ✅ **Using SciPy**
```python
from scipy.stats import f_oneway

# Example data: Scores from three schools
school_A = [85, 90, 88, 92, 87]
school_B = [78, 82, 80, 79, 81]
school_C = [90, 92, 94, 91, 89]

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(school_A, school_B, school_C)

print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value <= alpha:
    print("Reject H0: At least one group mean is different.")
else:
    print("Fail to Reject H0: No significant difference between group means.")


In [4]:
from scipy.stats import f_oneway

# Data for three groups
group1 = [12, 14, 15, 16, 17]
group2 = [11, 13, 14, 15, 16]
group3 = [10, 12, 13, 14, 15]

# Perform ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)
print("F-Statistic: ", f_stat)
print("P-Value: ", p_value)

F-Statistic:  1.3513513513513515
P-Value:  0.2955999950829255


In [5]:
#################################### Exercise 1 ####################################
# Conduct T-Test
# Perform one-sample, two-sample, paired t-tests
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel
# One-Sample T-Test
data = [12, 14, 15, 16, 17]
population_mean = 15
t_stat, p_value = ttest_1samp(data, population_mean)

print("One-Sample T-Test: ", t_stat, p_value)

# Two-Sample T-Test
group1 = [12, 14, 15, 16, 17]
group2 = [11, 13, 14, 15, 16]
t_stat, p_value = ttest_ind(group1, group2)
print("Two-Sample T-Test: ", t_stat, p_value)

# Paired T-Test
pre_test = [12, 14, 15, 16, 17]
post_test = [13, 14, 16, 17, 18]
t_stat, p_value = ttest_rel(pre_test, post_test)
print("Paired T-Test: ", t_stat, p_value)

One-Sample T-Test:  -0.23249527748763774 0.8275647196020324
Two-Sample T-Test:  0.8219949365267863 0.43489229767474047
Paired T-Test:  -3.9999999999999996 0.01613008990009254


In [6]:
#################################### Exercise 2 ####################################
# Perform a Chi-Square Test
from scipy.stats import chi2_contingency

# Contingency Table
data = [[50, 30, 20], [30, 40, 30]]

# Perform Chi_square Test
chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square Statistic: ", chi2)
print("P-values: ", p)
print("Expected Frequencies: \n", expected)

Chi-square Statistic:  8.428571428571429
P-values:  0.01478287719483942
Expected Frequencies: 
 [[40. 35. 25.]
 [40. 35. 25.]]


In [7]:
#################################### Exercise 3 ####################################
from scipy.stats import f_oneway

# Data for three groups
group1 = [12, 14, 15, 16, 17]
group2 = [11, 13, 14, 15, 16]
group3 = [10, 12, 13, 14, 15]

# Perform ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)
print("F-Statistic: ", f_stat)
print("P-Value: ", p_value)

F-Statistic:  1.3513513513513515
P-Value:  0.2955999950829255


In [None]:
#################################### Exercise 4 ####################################
from scipy.stats import ttest_1samp, ttest_ind, f_oneway, ttest_rel
import requests
from pathlib import Path
import pandas as pd
url = "https://raw.githubusercontent.com/shubhamtamhane/student-performance-python/refs/heads/master/StudentsPerformance.csv"

file_path = Path("data/student_grades")
file_path.parent.mkdir(exist_ok=True)

response = requests.get(url)
with open(file_path, "w") as file:
    file.write(response.text)


data = pd.read_csv(file_path)



Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [26]:
female_grades = data[(data["gender"] == "female") & (data["race/ethnicity"] == "group B")]
male_grades = data[(data["gender"] == "male") & (data["race/ethnicity"] == "group B")]

In [25]:
female_grades

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group B,master's degree,standard,none,90,95,93
5,female,group B,associate's degree,standard,none,71,83,78
6,female,group B,some college,standard,completed,88,95,92
9,female,group B,high school,free/reduced,none,38,60,50
...,...,...,...,...,...,...,...,...
923,female,group B,associate's degree,free/reduced,none,54,65,65
944,female,group B,high school,standard,none,58,68,61
969,female,group B,bachelor's degree,standard,none,75,84,80
980,female,group B,high school,free/reduced,none,8,24,23


In [27]:
male_grades

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
7,male,group B,some college,free/reduced,none,40,43,39
26,male,group B,some college,standard,none,69,54,55
39,male,group B,associate's degree,free/reduced,none,57,56,57
43,male,group B,some college,free/reduced,completed,59,65,66
45,male,group B,associate's degree,standard,none,65,54,57
...,...,...,...,...,...,...,...,...
919,male,group B,some college,standard,completed,91,96,91
946,male,group B,high school,standard,none,82,82,80
948,male,group B,some high school,free/reduced,completed,49,50,52
976,male,group B,some college,free/reduced,completed,60,62,60


In [30]:
male_grades_math = male_grades["math score"]
female_grades_math = female_grades["math score"]

In [39]:
population_mean = female_grades_math.mean()
t_stat, p_value = ttest_1samp(male_grades_math,population_mean)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: 2.965048722317664, P-value: 0.003926917775604343
