# Student's t-tests
1. One Sample test
2. Two Sample test
   1. Unpaired or Independent t-test
   2. Paired or relational/dependent t-test

## One sample t-test

Tests whether the means of two independent samples are significantly different.
One-sample student's t-test
Test a sample with a known standard value. 
*Assumptions*
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.\


 **Interpretation**
> H0: the means of the samples are equal to the known value.\
> H1: the means of the samples are unequal to the known value.

In [2]:
# One sample t-test
import pandas as pd
import seaborn as sns
from scipy.stats import ttest_1samp

# Load dataset
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [12]:
# Subsetting a dataset
df1 = df[['age', 'sex', 'fare']]

In [14]:
df1.head()

Unnamed: 0,age,sex,fare
0,22.0,male,7.25
1,38.0,female,71.2833
2,26.0,female,7.925
3,35.0,female,53.1
4,35.0,male,8.05


In [15]:
df1.describe()

Unnamed: 0,age,fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


In [16]:
# Check the age and compare with a known value of 45 years
ttest_1samp(df1['fare'], 50)
stat, p = ttest_1samp(df1['fare'], 50)

print('stat=%.3f, p=%.3f' % (stat, p))

# Make a conditional argument
if p > 0.05:
    print('Probably the same distribution')
else: 
    print('Probably different distributions')

stat=-10.689, p=0.000
Probably different distributions


## Two sample t-test
**Independent student's t-test**

**Assumptions**
- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

**Interpretation**

> **H0:** the means of the samples are equal.\
> **H1:** the means of the samples are unequal.

In [22]:
# We will compare the age and fare  of male vs female passengers

# Splitting dataset
df_male = df1.loc[df1['sex'] == 'male']
df_female = df1.loc[df1['sex'] == 'female']

# Library 
from scipy.stats import ttest_ind
stat, p = ttest_ind(df_male['fare'], df_female['fare']) # age can't be compared becasue range is more

print('stat=%.3f, p=%.3f' % (stat, p))

# Make a conditional argument
if p > 0.05:
    print('Probably the same distribution')
else: 
    print('Probably different distributions')



stat=-5.529, p=0.000
Probably different distributions


In [20]:
df_male.describe()

Unnamed: 0,age,fare
count,453.0,577.0
mean,30.726645,25.523893
std,14.678201,43.138263
min,0.42,0.0
25%,21.0,7.8958
50%,29.0,10.5
75%,39.0,26.55
max,80.0,512.3292


In [21]:
df_female.describe()

Unnamed: 0,age,fare
count,261.0,314.0
mean,27.915709,44.479818
std,14.110146,57.997698
min,0.75,6.75
25%,18.0,12.071875
50%,27.0,23.0
75%,37.0,55.0
max,63.0,512.3292


# Paired student's t-test
Tests whether the means of two paired samples are significantly different.

**Assumptions**

- Observations in each sample are independent and identically distributed.
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

**Interpretation**

> **H0:** the means of the samples are equal.\
> **H1:** the means of the samples are unequal.

In [23]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [32]:
# Select only male's data
df_male = df.loc[df['sex'] == 'male']
df_male.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [35]:
# Select only two classes
df_male_first = df_male.loc[df_male['class']== 'First']
df_male_second = df_male.loc[df_male['class']== 'Second']
df_male_third = df_male.loc[df_male['class']== 'Third']

In [41]:
# Check our data
df_male_first.describe()
df_male_second.describe()
df_male_third.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,347.0,347.0,253.0,347.0,347.0,347.0
mean,0.135447,3.0,26.507589,0.498559,0.224784,12.661633
std,0.342694,0.0,12.159514,1.288846,0.623404,11.681696
min,0.0,3.0,0.42,0.0,0.0,0.0
25%,0.0,3.0,20.0,0.0,0.0,7.75
50%,0.0,3.0,25.0,0.0,0.0,7.925
75%,0.0,3.0,33.0,0.0,0.0,10.0083
max,1.0,3.0,74.0,8.0,5.0,69.55


In [47]:
# To make the instances same
df_first = df_male_first.sample(n=100)
df_second = df_male_second.sample(n=100)

# To check now
print('The number of instances in first class are: ', df_first.shape)
print('The number of instances in second class are: ', df_second.shape)
df_third = df_male_third.sample(n=100)

The number of instances in first class are:  (100, 15)
The number of instances in second class are:  (100, 15)


In [49]:
# import library
from scipy.stats import ttest_rel

# Apply test to compare class 1 and class 2
# Instances(row) should be same

stat, p = ttest_rel(df_first['age'], df_second['age'])

print('stat=%.3f, p=%.3f' % (stat, p))

# Make a conditional argument
if p > 0.05:
    print('Probably the same distribution')
else: 
    print('Probably different distributions')

stat=nan, p=nan
Probably different distributions
