# Hypothesis testing

In statistics, is not enough with exploring the data visually, but it's necessary to make statistical test over them to confirm or negate our hypotheses. In this case, we gonna focus on non-parametric statistical tests, such as Kruskal-Wallis and Chi-Square tests, to test the relationship between the numerical and categorical variables with our objective variable. 

# Preparing the environment

In [30]:
import pandas as pd
import sys
sys.path.append('../ecommerce_customer_churn_prevention')
from utils import paths, hypothesis_testing

# Importing the data

In [31]:
df = pd.read_csv(paths.data_interim_dir('df_etl_processed.csv'))
df.head()

Unnamed: 0,CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11.0,1.0,1.0,5.0,159.93
1,50002,1,,Mobile Phone,1,8.0,UPI,Male,3.0,4,Mobile Phone,3,Single,7,1,15.0,0.0,1.0,0.0,120.9
2,50003,1,,Mobile Phone,1,30.0,Debit Card,Male,2.0,4,Mobile Phone,3,Single,6,1,14.0,0.0,1.0,3.0,120.28
3,50004,1,0.0,Mobile Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23.0,0.0,1.0,3.0,134.07
4,50005,1,0.0,Mobile Phone,1,12.0,Credit Card,Male,,3,Mobile Phone,5,Single,3,0,11.0,1.0,1.0,3.0,129.6


In [32]:
# Converting the features to categorical like the data dictionary

cat_features = ['PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus', 'Complain']
num_features = [col for col in df.columns if col not in cat_features]
num_features.remove('Churn')
num_features.remove('CustomerID')

df[cat_features] = df[cat_features].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   CustomerID                   5630 non-null   int64   
 1   Churn                        5630 non-null   int64   
 2   Tenure                       5366 non-null   float64 
 3   PreferredLoginDevice         5630 non-null   category
 4   CityTier                     5630 non-null   category
 5   WarehouseToHome              5379 non-null   float64 
 6   PreferredPaymentMode         5630 non-null   category
 7   Gender                       5630 non-null   category
 8   HourSpendOnApp               5375 non-null   float64 
 9   NumberOfDeviceRegistered     5630 non-null   int64   
 10  PreferedOrderCat             5630 non-null   category
 11  SatisfactionScore            5630 non-null   int64   
 12  MaritalStatus                5630 non-null   category
 13  Num

# Applying hypothesis testing

In [33]:
sig_cols = []
no_sig_cols = []
for col in cat_features:
    hypothesis_testing.chi2_independence_test(df, 'Churn', col, sig_col=sig_cols, no_sig_col=no_sig_cols)

--------------------------------------------------------------------------------
                 test    lambda       chi2  dof      pval    cramer     power
0             pearson  1.000000  14.401253  1.0  0.000148  0.050576  0.966742
1        cressie-read  0.666667  14.278441  1.0  0.000158  0.050360  0.965523
2      log-likelihood  0.000000  14.046652  1.0  0.000178  0.049950  0.963106
3       freeman-tukey -0.500000  13.884409  1.0  0.000194  0.049660  0.961320
4  mod-log-likelihood -1.000000  13.731654  1.0  0.000211  0.049386  0.959565
5              neyman -2.000000  13.453089  1.0  0.000245  0.048883  0.956171
Reject null hypothesis: There is a statistically significant difference between PreferredLoginDevice and Churn
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                 test    lambda       chi2  dof          pval    cramer  \
0             pearson  1.

In [35]:
for col in num_features:
    hypothesis_testing.kruskal_wallis_test(df, 'Churn', col, sig_col=sig_cols, no_sig_col=no_sig_cols)

--------------------------------------------------------------------------------
        Source  ddof1           H          p-unc
Kruskal  Churn      1  878.092864  5.678503e-193
Reject null hypothesis: There is a statistically significant difference in Tenure between Churn.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        Source  ddof1          H         p-unc
Kruskal  Churn      1  35.406192  2.676347e-09
Reject null hypothesis: There is a statistically significant difference in WarehouseToHome between Churn.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        Source  ddof1         H    p-unc
Kruskal  Churn      1  1.460878  0.22679
Failed to reject null hypothesis: There is no statistically significant difference in HourSpendOnApp between Churn
---

In [36]:
print(f"Significant columns: {sig_cols}")
print(f"No significant columns: {no_sig_cols}")

Significant columns: ['PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus', 'Complain', 'Tenure', 'WarehouseToHome', 'NumberOfDeviceRegistered', 'SatisfactionScore', 'NumberOfAddress', 'OrderCount', 'DaySinceLastOrder', 'CashbackAmount']
No significant columns: ['HourSpendOnApp', 'OrderAmountHikeFromlastYear', 'CouponUsed']
