## Students t-tests
1. One Sample t-test
2. Two sample t-test
   1. unpaired or independent t-test
   2. paired or dependent/ relational t-test

### 1. One sample t-test
Tests a sample with a non standard value

**Assumptions**
1. Observations in each sample are independent and identically distributed(iid)
2. Observations in each sample are Normally distributed
3. Observations in each sample have the same varience

**Interpretation**

H0: The means of samples are equal to the known value \
H1: The means of samples are unequal to known values.

### One-sample t-test

In [1]:

# import libraries

import seaborn as sns
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp

In [2]:
# load data set

boat = sns.load_dataset('titanic')
boat.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
# Dropping null values of Age and fare 
boat.dropna(subset=['age','fare'], axis=0, inplace=True)
boat.isna().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           530
embark_town      2
alive            0
alone            0
dtype: int64

In [4]:
# taking subsets

boat_subset = boat[['sex', 'age']]
boat_subset.head()

Unnamed: 0,sex,age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0


In [5]:
boat_subset.describe()

Unnamed: 0,age
count,714.0
mean,29.699118
std,14.526497
min,0.42
25%,20.125
50%,28.0
75%,38.0
max,80.0


In [6]:
# Checking and comparing age with a known value 

from scipy.stats import ttest_1samp
stat, p = ttest_1samp(boat_subset['age'], 45)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
  print('Probably the same distribution')
else:
  print('Probably different distribution')

stat=-28.145, p=0.000
Probably different distribution


Above results are not quiet correct, because data is not normalized,standard deviation value is greater, first normalize the data then apply test.

### 2. Two sample t-test
**Independent/unpaired student's t-test**

**Assumptions**
1. Observations in each sample are independent and identically distributed(iid)
2. Observations in each sample are Normally distributed
3. Observations in each sample have the same varience

**Interpretation**

**H0:** The means of samples are equal to the known value \
**H1:** The means of samples are unequal to known values.

> In two sample t-test we compare 2 different samples, e.g Male and Female compare their age or fare

In [7]:
# compare age and fare of male Vs Female passenger

# splitting data sets
df_male = boat_subset.loc[boat_subset['sex']=='male']
df_female = boat_subset.loc[boat_subset['sex']=='female']

In [8]:
# import library

from scipy.stats import ttest_ind
stat, p = ttest_ind(df_male['age'], df_female['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# Adding conditional argument for ease
if p > 0.05:
  print('Probably the same distribution')
else:
  print('Probably different distribution')

stat=2.499, p=0.013
Probably different distribution


> From above results we can see that male and female are differently distributed, we can also see it through EDA 

In [9]:
# lets see through EDA

df_male.describe()

Unnamed: 0,age
count,453.0
mean,30.726645
std,14.678201
min,0.42
25%,21.0
50%,29.0
75%,39.0
max,80.0


In [10]:
df_female.describe()

Unnamed: 0,age
count,261.0
mean,27.915709
std,14.110146
min,0.75
25%,18.0
50%,27.0
75%,37.0
max,63.0


### 2. Two sample t-test
**Paired/Dependent/ Relational student's t-test**\
Tests whether the means of 2 paired samples are significantly diffrent.
**Assumptions**
1. Observations in each sample are independent and identically distributed(iid)
2. Observations in each sample are Normally distributed
3. Observations in each sample have the same varience.
4. Observations across each sample are paired.

**Interpretation**

**H0:** The means of samples are equal.\
**H1:** The means of samples are unequal.

In [11]:
# in paired ttest we will compare only male passengers from first class and 2nd class, 
# because its paired ttest meaning within one sample
df_male.head()

Unnamed: 0,sex,age
0,male,22.0
4,male,35.0
6,male,54.0
7,male,2.0
12,male,20.0


In [12]:
# In above data we only have sex and age cols, but here we need Class data to compare, so, lets use the main dataframe

df_male = boat.loc[boat['sex']=='male']
df_male.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
13,0,3,male,39.0,1,5,31.275,S,Third,man,True,,Southampton,no,False
16,0,3,male,2.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
20,0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
21,1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True


In [13]:
# select only 2 classes

df_male_first = df_male.loc[df_male['class']=='First']
df_male_second = df_male.loc[df_male['class']=='Second']
df_male_third = df_male.loc[df_male['class']=='Third']

In [14]:
df_male_first.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
23,1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True
27,0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False
30,0,1,male,40.0,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
34,0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False


In [15]:
df_male_second.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
20,0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
21,1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True
33,0,2,male,66.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
70,0,2,male,32.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
72,0,2,male,21.0,0,0,73.5,S,Second,man,True,,Southampton,no,True


In [16]:
df_male_third.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
13,0,3,male,39.0,1,5,31.275,S,Third,man,True,,Southampton,no,False


In [17]:
# import library

from scipy.stats import ttest_rel
# Apply test to compare class First and Third

stat, p = ttest_rel(df_male_first['age'], df_male_third['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# Adding conditional argument for ease
if p > 0.05:
  print('Probably the same distribution')
else:
  print('Probably different distribution')

ValueError: unequal length arrays

> Above code returns an error because number of rows/instences should be equal to apply ttest.
> Now, we'll take sample of 1st and 2nd class with equal number of rows to compare.

In [None]:
print("Total number of instances in First Class Are:" , df_male_first.shape)
print("Total number of instances in Second Class Are:" , df_male_second.shape)

Total number of instances in First Class Are: (101, 15)
Total number of instances in Second Class Are: (99, 15)


In [None]:
# since the total nmber of rows in df_male_second is "99" , so, lets take sample of atleast 90
df_1st =  df_male_first.sample(n=99)
df_2nd =  df_male_second.sample(n=99)

In [None]:
# Now lets do the test again

# import library

from scipy.stats import ttest_rel
# Apply test to compare class First and Third

stat, p = ttest_rel(df_1st['age'], df_2nd['age'])
print('stat=%.3f, p=%.3f' % (stat, p))

# Adding conditional argument for ease
if p > 0.05:
  print('Probably the same distribution')
else:
  print('Probably different distribution')

stat=4.741, p=0.000
Probably different distribution
