# Hypothesis Testing

## ANOVA: Analysis of Variance (F test)

- f test can be used to validate whether two or more samples are similar to each other (mean is similar or not).

    Example: Given that  a Bank launches a marketing campaign to market a Credit Card Product. and the marketing team wishes to analyze the impact of campaign, they assume that the sales of credit card, consumption of credit card has improved since the campaign. 
    To validate this assumption, we can compare sample data of sales of credit card before the campaign with sample data of sales of credit card after/during the campaign.

Hypothesis Testing

- Null Hypothesis (H0) = the means of samples are similar to each other
- Alternate Hypothesis (Ha) = the means of samples are NOT similar to each other


Result of f test is pvalue.

Decide CI (confidence Interval) = 95%, significance level (alpha) = 1 - CI = 1-0.95 = 0.05

 - if pvalue > alpha = we fail to reject Null Hypothesis
 - if pvalue < alpha = we successfully reject Null Hypothesis

In [1]:
import pandas as pd
df = pd.read_csv("datasets-1/Bank_churn_modelling.csv")
df.shape

  from pandas.core import (


(10000, 14)

In [2]:
df.head(2)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0


In [9]:
ageNE = df['Age'][df.Exited==0]
ageYE = df['Age'][df.Exited==1]

In [7]:
# drop the rows having missing values
df = df.dropna()

In [10]:
from scipy import stats
anova = stats.f_oneway(ageNE.values,ageYE.values)
print(anova)

F_onewayResult(statistic=878.3586817494576, pvalue=4.472631388964089e-185)


In [6]:
df.isnull().sum()

RowNumber           0
CustomerId          0
Surname             0
CreditScore         0
Geography           0
Gender              0
Age                 0
Tenure             15
Balance             0
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary    10
Exited              0
dtype: int64

In [11]:
xd = df[['Age','EstimatedSalary','Balance','Tenure','CreditScore']]
y = df['Exited']

from sklearn.feature_selection import f_classif
fscore, pvalue = f_classif(xd,y)


for i in range(len(xd.columns)):
    print(xd.columns[i]," : ",pvalue[i])

Age  :  4.472631388982903e-185
EstimatedSalary  :  0.20424577910568992
Balance  :  4.401968313402577e-33
Tenure  :  0.15316470497628182
CreditScore  :  0.008029088334075214


## Chi Square Test
- used to validate whether the distribution of two samples of categoric attribute are similar or not. - whether the distribution of categories across two or more groups are similar or not.

For example: A bank launches a new insruance product, and they want to analyze out of all customers who made inquiry whether Gender palyed an important role in decision of customers buying insurance.
- statistical definition: to analyze whether the ratio/distribution of male and female customers is similar or different for customers who purchased and customers who did not purchase the insurance product.

Hypothesis Testing
    - Null Hypothesis (H0) = the distribution of categories in two or more groups is similar
    - Alternate Hypothesis (Ha) = the distribution of categories in two or more groups is NOT similar

Result of chi square test is pvalue.

Decide CI (confidence Interval) = 95%, significance level (alpha) = 1 - CI = 1-0.95 = 0.05

 - if pvalue > alpha = we fail to reject Null Hypothesis
 - if pvalue < alpha = we successfully reject Null Hypothesis

In [12]:
df.head(2)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0


In [13]:
xd = df[['HasCrCard','Gender','IsActiveMember']]
y = df['Exited']

import numpy as np

print(xd['Gender'].value_counts())
xd['Gender'] = xd['Gender'].apply(lambda x:0 if x=="Female" else 1)
print(xd['Gender'].value_counts())

Gender
Male      5444
Female    4532
Name: count, dtype: int64
Gender
1    5444
0    4532
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  xd['Gender'] = xd['Gender'].apply(lambda x:0 if x=="Female" else 1)


In [14]:
from sklearn.feature_selection import chi2

chiscore,pvalue = chi2(xd,y)

for i in range(len(xd.columns)):
    print(xd.columns[i]," : ",pvalue[i])

HasCrCard  :  0.6514238750012205
Gender  :  1.1331858984946494e-12
IsActiveMember  :  1.3401066663184691e-27


In [15]:
print(xd['Gender'].value_counts(normalize=True))

Gender
1    0.54571
0    0.45429
Name: proportion, dtype: float64


In [18]:
# for exited: observed 
genYE = xd['Gender'][y==1]
print(genYE.count())
genYE.value_counts(normalize=True)

2031


Gender
0    0.558346
1    0.441654
Name: proportion, dtype: float64

In [19]:
# for NOT exited: observed 
genNE = xd['Gender'][y==0]
print(genNE.count())
genNE.value_counts(normalize=True)

7945


Gender
1    0.57231
0    0.42769
Name: proportion, dtype: float64

In [21]:
xd['HasCrCard'].value_counts(normalize=True)

HasCrCard
1    0.705694
0    0.294306
Name: proportion, dtype: float64

In [20]:
# for HasCrCard
hccYE = xd['HasCrCard'][y==1]
hccYE.value_counts(normalize=True)

HasCrCard
1    0.698178
0    0.301822
Name: proportion, dtype: float64

In [22]:
hccNE = xd['HasCrCard'][y==0]
hccNE.value_counts(normalize=True)


HasCrCard
1    0.707615
0    0.292385
Name: proportion, dtype: float64

In [23]:
xd['IsActiveMember'].value_counts(normalize=True)

IsActiveMember
1    0.515136
0    0.484864
Name: proportion, dtype: float64

In [24]:
# for IsActiveMember
iamYE = xd['IsActiveMember'][y==1]
iamYE.value_counts(normalize=True)

IsActiveMember
0    0.639586
1    0.360414
Name: proportion, dtype: float64

In [25]:
# for IsActiveMember
iamNE = xd['IsActiveMember'][y==0]
iamNE.value_counts(normalize=True)

IsActiveMember
1    0.554688
0    0.445312
Name: proportion, dtype: float64