# Student t-tests
1) one sample t-test
2) Two sample t-test
   - Unpaired or independent t-test
   - paired or depenedent t-test

## 1) one sample student t-test
Take a sample with a known standard value.

Assumptions

- Obesrvations in each sample is independent and identically distributed.
- Observations in each sample is normally distributed.

Interpretation:

H0: The means of the sample is equal to the known value.

H1: The means of the sample is unequal to the known value.

In [74]:
# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sklearn
import joblib as joblib
import scipy as sp

In [75]:
# Load Dataset

ship = sns.load_dataset("titanic")
ship.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [76]:
# Choose the columns

ship1 = ship[["sex","age"]]
ship1.head()

Unnamed: 0,sex,age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0


In [77]:
# describe the data

ship1.describe()

Unnamed: 0,age
count,714.0
mean,29.699118
std,14.526497
min,0.42
25%,20.125
50%,28.0
75%,38.0
max,80.0


In [78]:
# Check the age and compare with a known value
from scipy.stats import ttest_1samp
ttest_1samp(ship1["age"],50)
stat, p = ttest_1samp(ship1['age'],45)
print("stat=%.3f,p=%.3f" % (stat,p))
if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat=nan,p=nan
Probably the different distribution


## 2)  Two sample t-test
## Unpaired / Independent student t-test.

Assumptions

- Obesrvations in each sample is independent and identically distributed.
- Observations in each sample is normally distributed.
- Observations in each sample have the same variance.

Interpretation:

H0: The means of the sample are equal.

H1: The means of the sample are unequal.

In [79]:
# We will compare age and fare of males and females

# import library
from scipy.stats import ttest_ind

# splitting datasets
ship1_male = ship1.loc[ship1["sex"]=="male"]
ship1_female = ship1.loc[ship1["sex"]=="female"]

# t-test (un-paired,two samples/ or independent t-test)
ttest_ind(ship1_male["age"],ship1_female["age"])

# check stat and p value
stat,p = ttest_ind(ship1_male["age"],ship1_female["age"])

print("stat=",stat)
print("p=",p)
if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat= nan
p= nan
Probably the different distribution


In [80]:
ship1_male.describe()

Unnamed: 0,age
count,453.0
mean,30.726645
std,14.678201
min,0.42
25%,21.0
50%,29.0
75%,39.0
max,80.0


In [81]:
ship1_female.describe()

Unnamed: 0,age
count,261.0
mean,27.915709
std,14.110146
min,0.75
25%,18.0
50%,27.0
75%,37.0
max,63.0


## Paired t-test:


Assumptions

- Obesrvations in each sample is independent and identically distributed.
- Observations in each sample is normally distributed.
- Observations in each sample have the same variance.
- Observation accross each sample are paired.

Interpretation:

H0: The means of the sample are equal.

H1: The means of the sample are unequal.

In [82]:
# We will compare male with class (first and second)

# import library
from scipy.stats import ttest_ind

# Select only two classes of male
ship_male = ship.loc[ship1["sex"]=="male"]
ship_male.head()
ship_male_first = ship_male.loc[ship_male['class']=='First']
ship_male_second = ship_male.loc[ship_male['class']=='Second']
ship_male_third = ship_male.loc[ship_male['class']=='Third']

### We can not apply t-test on given data because rows of class first,second,third are not equal.

so first make them equal.

In [83]:
# set rows according to our requirement

ship_ist = ship_male_first.sample(n=100)
ship_2nd = ship_male_second.sample(n=100)
ship_3rd = ship_male_third.sample(n=100)

print("The number of instances and labels in ship_male_first are",ship_ist.shape)
print("The number of instances and labels in ship_male_second are",ship_2nd.shape)
print("The number of instances and labels in ship_male_third are",ship_3rd.shape)

The number of instances and labels in ship_male_first are (100, 15)
The number of instances and labels in ship_male_second are (100, 15)
The number of instances and labels in ship_male_third are (100, 15)


In [84]:
from scipy.stats import ttest_rel

# Apply test to compare class 1 and class 3
stat, p = ttest_rel(ship_ist['age'],ship_3rd['age'])
print("stat=",stat)
print("p=",p)
if p > 0.05:
    print("Probably the same distribution")
else:
    print("Probably the different distribution")

stat= nan
p= nan
Probably the different dataset


"nan" values are because of data is not normally distributed.