## 1 T-test

### 1.1 Assumptions for T-test

1. Population Standard Deviation $\sigma$ is NOT mandatory.
2. T-test is used for testing hypothesis on population parameter type - Mean.
3. Sampling distribution of sample mean is normal and its mean is available.

### 1.2 Conditions for T-test

1. Data (Sample) is of type Continuous random variate.
2. Random Sampling.
3. Observations are independent of each other.
4. When sample size $n < 30$, Mean Sampling distribution should be strictly normal.
5. When sample size $n \ge 30$, CLT takes care of normality for Mean Sampling distribution.
6. Population standard deviation $\sigma$ is unknown and is estimated using Sample standard deviation $s$.
7. Homogeneity of Variance i.e., sample must have less outliers.

#### Z-Test vs T-Test

1. As per Law of Large Numbers as sample size $n -> \infty$ sample standard deviation $s$ converges to the population standard deviation $\sigma$.
2. Typically for sample size $n \ge 30$ T-test behaves like Z-test.

Sample size and population standard deviation criteria for Z-test vs T-test:

<div style="display: inline-block">

| Sample size $n$ and Population Standard Deviation $\sigma$                      | Hypothesis Test |
| :------------------------------------------------------------------------------ | :-------------- |
| Sample size $n <  30$ and population standard deviation $\sigma$ is known.      | Z-test          |
| Sample size $n <  30$ and population standard deviation $\sigma$ is NOT known.  | T-test          |
| Sample size $n \ge 30$ and population standard deviation $\sigma$ is known.     | Z-test          |
| Sample size $n \ge 30$ and population standard deviation $\sigma$ is NOT known. | Z-test / T-test |

</div>

### 1.3 Types of T-test

1. One Sample T-test
2. Independent Two Sample T-test
3. Paired Two sample T-test

## 2 One Sample T-test

### 2.1 Nature of hypothesis

- $H_0: \mu = \mu_0$
- $H_a: \text{$\mu \neq \mu_0$ (or $\mu > \mu_0$, $\mu < \mu_0$)}$

### 2.2 Test Statistic

1. The name of the test statistic in T-test is called as T-statistic.
2. Sample distribution of sample means follows T distribution.

$
\begin{align}
\large
\text{T-Statistic} = \frac{x - \mu}{\frac{s}{\sqrt{n}}}
\end{align}
$

### 2.3 API

```python
from scipy.stats import ttest_1samp
```

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

## 3 Independent Two Sample T-test

### 3.1 Nature of hypothesis

- $H_0: \mu_1 = \mu_2$
- $H_a: \text{$\mu_1 \neq \mu_2$ (or $\mu_1 > \mu_2$, $\mu_1 < \mu_2$)}$

### 3.2 Test Statistic

1. The name of the test statistic used in Independent Two Sample T-test is called as T-statistic.
2. Sample distribution of sample means follows T-Distribution.

$
\begin{align}
\large
\text{T-Statistic} = \frac{x_1 - x_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
\end{align}
$

### 3.3 API

```python
from scipy.stats import ttest_ind
```

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

## 4 Paired Two sample T-test

### 4.1 Nature of hypothesis

- $H_0: \mu_d = 0$
- $H_a: \text{$\mu_d \neq 0$ (or $\mu_d > 0$, $\mu_d < 0$)}$

### 4.2 API

```python
from scipy.stats import ttest_rel
```

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

## 5 Examples

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm, ttest_1samp, ttest_ind, ttest_rel

### Example #1

Let's say you are a Research Scientist working on a new cognitive enhancement pill:

- The goal is to develop a pill that can significantly improve IQ scores in individuals.
- The researchers believe that the new pill will lead to a significant increase in average IQ scores for the population.

In [2]:
iq_scores = [110, 105, 98, 102, 99, 104, 115, 95]
print("Sample size:", len(iq_scores))

Sample size: 8


#### Solution

In [3]:
# Given that average IQ of human beings are considered as 100.
mu = 100  # Population mean.
n = len(iq_scores)

# H_0: mu <= 100  Given IQ scores are closer to global average.
# H_a: mu > 100   Given IQ scores are larger than global average.

In [4]:
# Distribution: T distribution
# Test statistic: T-score
# Significance level: 0.05
alpha = 0.05

In [5]:
# Since H_a has > symbol, perform right tailed test.

In [6]:
mu_0 = np.mean(iq_scores)  # Sample mean
mu_0.item()

103.5

In [7]:
t_stat, p_value = ttest_1samp(a=iq_scores, popmean=mu, alternative="greater")
print("t-statistic:", t_stat.round(6).item())
print("p-value", p_value.round(6).item())

t-statistic: 1.507157
p-value 0.08775


In [8]:
if p_value <= alpha:
    print("Reject Null hypothesis")
else:
    print("Failed to reject Null hypothesis")

Failed to reject Null hypothesis


Average IQ of sample, 103, is **NOT** statistically significant, hence given IQ scores are closer to global average.

In [9]:
sample_se = np.std(iq_scores) / np.sqrt(n)
sample_se.round(6).item()

2.172268

In [10]:
t_critical = norm.ppf(1 - (alpha / 2))
t_critical.round(6).item()

1.959964

In [11]:
# x = mu + (t_score * sample_se)
x = mu + (t_critical * sample_se)
x.round(1).item()

104.3

### Example #2

Suppose we have IQ data samples across 2 schools, and we want to compare and see which school's students have better IQ.

- Use $\alpha$ = 0.05

In [12]:
iq_df = pd.read_csv("../0_data/01_students/iq_two_schools.csv")
iq_df.head(3)

Unnamed: 0,School,iq
0,school_1,91
1,school_1,95
2,school_1,110


In [13]:
iq_df["School"].value_counts()

School
school_1    26
school_2    24
Name: count, dtype: int64

In [14]:
mask = iq_df["School"] == "school_1"
sc1_sample = iq_df[mask]["iq"].values.tolist()
sc1_sample[:5]

[91, 95, 110, 112, 115]

In [15]:
mask = iq_df["School"] == "school_2"
sc2_sample = iq_df[mask]["iq"].values.tolist()
sc2_sample[:5]

[112, 115, 95, 92, 91]

#### Solution

In [16]:
# H_0: mu_1 = mu_2  # IQs of students both the schools are same
# H_a: mu_1 != mu_2  # IQs of students both the schools are different

In [17]:
alpha = 0.05

In [18]:
# Since H_a has != symbol, perform two-tailed test.

In [19]:
# Compute test statistic and p-value.
t_stat, p_value = ttest_ind(a=sc1_sample, b=sc2_sample, alternative="two-sided")
print("t-statistic:", t_stat.round(6).item())
print("p-value", p_value.round(6).item())

t-statistic: -2.405647
p-value 0.020046


In [20]:
# Compare p-value with alpha.
if p_value <= alpha:
    print("Reject Null hypothesis")
else:
    print("Failed to reject Null hypothesis")

Reject Null hypothesis


### Example #3

In [21]:
ps_df = pd.read_csv("../0_data/01_students/problem_solving.csv")
ps_df.head(3)

Unnamed: 0,id,test_1,test_2
0,0,40,38
1,1,49,44
2,2,65,69


In [22]:
row_count, _ = ps_df.shape
print("Sample size:", row_count)

Sample size: 137


In [23]:
sample_before = ps_df["test_1"].values.tolist()
sample_before[:5]

[40, 49, 65, 59, 44]

In [24]:
sample_after = ps_df["test_2"].values.tolist()
sample_after[:5]

[38, 44, 69, 63, 43]

#### Solution

Perform two tailed test to see if there are any difference in test scores.

In [25]:
# H_0: mu_before = mu_after  # Problem Solving has no effect on scores.
# H_a: mu_before != mu_after  # Problem Solving has some effect on scores.

alpha = 0.05

# Since H_a has > symbol, perform right-tailed test.

# Compute test statistic and p-value.
t_stat, p_value = ttest_rel(a=sample_before, b=sample_after, alternative="two-sided")
print("t-statistic:", t_stat.round(6).item())
print("p-value", p_value.item())

# Compare p-value with alpha.
if p_value <= alpha:
    print("Reject Null hypothesis")
else:
    print("Failed to reject Null hypothesis")

t-statistic: -5.502886
p-value 1.7958403537923126e-07
Reject Null hypothesis


Perform right tailed test to see if there are any improvements in test scores.

In [26]:
# H_0: mu_after <= mu_before  # Problem Solving has no effect on scores.
# H_a: mu_after > mu_before  # Problem Solving has improved scores.

alpha = 0.05

# Since H_a has > symbol, perform right-tailed test.

# Compute test statistic and p-value.
t_stat, p_value = ttest_rel(a=sample_after, b=sample_before, alternative="greater")
print("t-statistic:", t_stat.round(6).item())
print("p-value", p_value.item())

# Compare p-value with alpha.
if p_value <= alpha:
    print("Reject Null hypothesis")
else:
    print("Failed to reject Null hypothesis")

t-statistic: 5.502886
p-value 8.979201768961563e-08
Reject Null hypothesis
