## Hypothesis testing in python

In this workbook we will explore how to conduct basic hypothesis testing between two or more groups for comparing means and comparing proportions.

We will use the hotel bookings data from 5 different cities for this exercise. This data is taken from [Agoda](https://www.agoda.com/). See the data dictionary in workbook 1.

**Part 1 : Comparing two groups**

In this part we will compare two samples and test if their means or proportions are significantly different using standard t-test.

**Part 2 : Comparing more than two groups**

In this part we will compare more than two samples and test if their means or proportions are significantly different using ANOVA.

In [1]:
##################################
# Import libraries and data
##################################

# libraries
import os
import time
import numpy as np
import pandas as pd
import scipy as sp
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set(style='darkgrid', font_scale=1.5)

# data
city_a = pd.read_csv('./data/city_a.csv')
city_a['city_name'] = 'city_a'
city_b = pd.read_csv('./data/city_b.csv')
city_b['city_name'] = 'city_b'
city_c = pd.read_csv('./data/city_c.csv')
city_c['city_name'] = 'city_c'
city_d = pd.read_csv('./data/city_d.csv')
city_d['city_name'] = 'city_d'
city_e = pd.read_csv('./data/city_e.csv')
city_e['city_name'] = 'city_e'

all_city = city_a.append([city_b, city_c, city_d, city_e]).reset_index(drop=True)

# convert dates to correct format
all_city['booking_date'] = pd.to_datetime(all_city['booking_date'],dayfirst=True)
all_city['checkin_date'] = pd.to_datetime(all_city['checkin_date'],dayfirst=True)
all_city['checkout_date'] = pd.to_datetime(all_city['checkout_date'],dayfirst=True)

#rename accommodation name
all_city = all_city.rename(columns={'accommodation_type_name':'acc_name'})

# add some derived variables
all_city['bc_gap'] = (all_city['checkin_date'] - all_city['booking_date']).dt.days
all_city['stay_dur'] = (all_city['checkout_date'] - all_city['checkin_date']).dt.days
all_city['month'] = all_city['checkin_date'].dt.month

all_city.head()

Unnamed: 0,book_id,ADR_USD,hotel_id,city_id,star_rating,acc_name,chain_hotel,booking_date,checkin_date,checkout_date,checkin_day,city_name,bc_gap,stay_dur,month
0,1,71.06,297388,9395,2.5,Hotel,non-chain,2016-08-02,2016-10-01,2016-10-02,weekend,city_a,60,1,10
1,2,76.56,298322,9395,3.0,Hotel,non-chain,2016-08-02,2016-10-01,2016-10-02,weekend,city_a,60,1,10
2,3,153.88,2313076,9395,5.0,Hotel,chain,2016-08-02,2016-10-01,2016-10-02,weekend,city_a,60,1,10
3,4,126.6,2240838,9395,3.5,Hotel,non-chain,2016-08-04,2016-10-02,2016-10-03,weekend,city_a,59,1,10
4,5,115.08,2240838,9395,3.5,Hotel,non-chain,2016-08-04,2016-10-02,2016-10-03,weekend,city_a,59,1,10


### Part 1 : Comparing two samples

 - comparing mean of a sample with population mean - 1 sample test
 - comparing means of two independent samples with each other
 - comparing means of two paired samples with each other
 - comparing proportions of two samples with each other

In [43]:
###########################################################
# 1. 1 sample test : comparing sample mean to population
# we will compare the mean ADR of one of the cities to global 
# ADR mean and see if its significantly different or not
###########################################################

def one_samp_ttest(data, mu, conf):

    print('***************************')
    print('Sample Mean :', np.round(np.mean(data),2), 'Population mean :', np.round(mu,2))
    print('T-statistics : ',stats.ttest_1samp(data,mu)[0])
    print('p-value : ',stats.ttest_1samp(data,mu)[1])
    
    if stats.ttest_1samp(data,mu)[1] < (1-conf):
        print('Null hypothesis rejected, the sample mean is not the same as the population mean!')
    else:
        print('Null hypothesis cannot be rejected, the sample mean is the same as the population mean!')
    print('\n')
        
        
pop_mean = np.mean(all_city['ADR_USD'])

print('Test the population mean to itself:')
one_samp_ttest(all_city['ADR_USD'], pop_mean, 0.95)

print('Test the population mean to City A mean:')
one_samp_ttest(all_city[all_city['city_name']=='city_a']['ADR_USD'], pop_mean, 0.95)

print('Test the population mean to City B mean:')
one_samp_ttest(all_city[all_city['city_name']=='city_b']['ADR_USD'], pop_mean, 0.95)

Test the population mean to itself:
***************************
Sample Mean : 148.09 Population mean : 148.09
T-statistics :  -1.069887195320358e-12
p-value :  0.9999999999991463
Null hypothesis cannot be rejected, the sample mean is the same as the population mean!


Test the population mean to City A mean:
***************************
Sample Mean : 100.51 Population mean : 148.09
T-statistics :  -106.94002311853069
p-value :  0.0
Null hypothesis rejected, the sample mean is not the same as the population mean!


Test the population mean to City B mean:
***************************
Sample Mean : 118.32 Population mean : 148.09
T-statistics :  -15.588383416613846
p-value :  1.6235661509189316e-53
Null hypothesis rejected, the sample mean is not the same as the population mean!




In [47]:
###########################################################
# 1. 2 sample test : comparing means of two samples
# we will compare the mean ADR of two different cities
# and see if they are significantly different or not
###########################################################

def two_samp_ttest(data1, data2, conf, pair=0):
    print('***************************')
    print('Sample 1 Mean :', np.round(np.mean(data1),2), 'Sample 2 mean :', np.round(np.mean(data2),2))
    
    if pair == 0:
        
        print('T-statistics : ',stats.ttest_ind(data1, data2)[0])
        print('p-value : ',stats.ttest_ind(data1, data2)[1])
        if stats.ttest_ind(data1, data2)[1] < (1-conf):
            print('Null hypothesis REJECTED, the two sample means are not the same!')
        else:
            print('Null hypothesis NOT REJECTED, the two sample means are the same!')
    else:
        
        print('T-statistics : ',stats.ttest_rel(data1, data2)[0])
        print('p-value : ',stats.ttest_rel(data1, data2)[1])
        if stats.ttest_rel(data1, data2)[1] < (1-conf):
            print('Null hypothesis REJECTED, the two sample means are not the same!')
        else:
            print('Null hypothesis NOT REJECTED, the two sample means are the same!')
            
    print('\n')

     
print('Test the population mean to itself:')
two_samp_ttest(all_city['ADR_USD'], all_city['ADR_USD'], 0.95)

print('Test the population mean to City A mean:')
two_samp_ttest(all_city[all_city['city_name']=='city_a']['ADR_USD'], all_city['ADR_USD'], 0.95)

print('Test the City A mean to City B mean:')
two_samp_ttest(all_city[all_city['city_name']=='city_a']['ADR_USD'], \
               all_city[all_city['city_name']=='city_b']['ADR_USD'], 0.95)

Test the population mean to itself:
***************************
Sample 1 Mean : 148.09 Sample 2 mean : 148.09
T-statistics :  0.0
p-value :  1.0
Null hypothesis NOT REJECTED, the two sample means are the same!


Test the population mean to City A mean:
***************************
Sample 1 Mean : 100.51 Sample 2 mean : 148.09
T-statistics :  -49.90025333439419
p-value :  0.0
Null hypothesis REJECTED, the two sample means are not the same!


Test the City A mean to City B mean:
***************************
Sample 1 Mean : 100.51 Sample 2 mean : 118.32
T-statistics :  -13.6519632610871
p-value :  2.7064790045250752e-42
Null hypothesis REJECTED, the two sample means are not the same!


