# **Hypothesis Testing**

# 1. Importing the data and inspecting the data

In [85]:
import pandas as pd
import numpy as np
from datetime import date
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_parquet('data/interim/data_outliers_cleared.parquet')
#print information of the cleaned data
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 2235 entries, 0 to 2239
Data columns (total 46 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   ID                       2235 non-null   int64         
 1   Year_Birth               2235 non-null   int64         
 2   Income                   2235 non-null   float64       
 3   Kidhome                  2235 non-null   int64         
 4   Teenhome                 2235 non-null   int64         
 5   Dt_Customer              2235 non-null   datetime64[ns]
 6   Recency                  2235 non-null   int64         
 7   MntWines                 2235 non-null   int64         
 8   MntFruits                2235 non-null   int64         
 9   MntMeatProducts          2235 non-null   int64         
 10  MntFishProducts          2235 non-null   int64         
 11  MntSweetProducts         2235 non-null   int64         
 12  MntGoldProds             2235 non-null 

**According to the result from above cell, the data has transferred correctly.**

# 2. Are older individuals less likely to prefer online shopping?
- People can be divided into young and old based on their age: if age of person is less than < 55 years old is young Else old                            
- Ratio of in-store shopping to online shopping shows tendency of the individual to in-store shopping
- Null hypothesis (H0): average of the ratio for old people < average of the ratio for young people
- Alternative hypothesis (H1):average of the ratio for old people > average of the ratio for young people

In [86]:
import numpy as np
from statsmodels.stats.weightstats import ztest
# Null hypothesis (H0): mu_old_people < mu_young_people 
# Alternative hypothesis (H1): mu_old_people > mu_young_people
# Create ratio with zero handling
df['instore_online_ratio'] = df['NumStorePurchases'] / (df['NumWebPurchases'] + 1e-10)

# Remove infinite and NaN values
df_clean = df[~np.isinf(df['instore_online_ratio']) & ~df['instore_online_ratio'].isna()]

young_people = df_clean[df_clean['Age'] < 55]
old_people = df_clean[df_clean['Age'] >= 55]

# Check if we have enough data
if young_people.shape[0] > 1 and old_people.shape[0] > 1:
    try:
        zstat, pvalue = ztest(
            old_people['instore_online_ratio'].values, 
            young_people['instore_online_ratio'].values,
            alternative = 'larger'
        )
        print(f"Z-statistic: {zstat}")
        print(f"P-value: {pvalue}")

    except Exception as e:
        print(f"Error in z-test: {e}")
else:
    print("Not enough data for z-test")


Z-statistic: -1.015336371046151
P-value: 0.8450272437194916


**Results from above cell:**
- P value is high (higher than the typical significance threshold of 0.05). It means that there is no statistically significant difference in the instore/online purchase ratio between people under 55 and people 55+ years old.
- Age doesn't appear to be a strong predictor of whether someone prefers in-store vs online shopping (at least based on this ratio)
- Both age groups have similar shopping channel preferences

# 3. Do customers with children prefer online shopping due to time constraints?
- People are divided into customers with children and customers without child based on number of their children: if number of children of person is larger than 0 they are customers with children. Otherwise, they are customers without children    
- Ratio of in-store shopping to online shopping shows tendency of the individual to in-store shopping, which is opposite of tendency to online shopping
- Null hypothesis (H0): average of the ratio for customers_with_child < average of the ratio for customers_no_child
- Alternative hypothesis (H1):average of the ratio for customers_with_child > average of the ratio for customers_no_child

In [87]:
# Divide dataset claculated in above cell (df_clean) with ratio of in-store to online into two gropus: customers_with_child and customers_no_child 
customers_with_child = df_clean[df_clean['Number_children'] > 0]
customers_no_child = df_clean[df_clean['Number_children'] == 0]

# Null hypothesis (H0): mu_customers_with_child > mu_customers_no_child 
# Alternative hypothesis (H1): mu_customers_with_child < mu_customers_no_child

# Check if we have enough data
if customers_no_child.shape[0] > 1 and customers_with_child.shape[0] > 1:
    try:
        zstat, pvalue = ztest(
            customers_with_child['instore_online_ratio'].values, 
            customers_no_child['instore_online_ratio'].values,
            alternative = 'smaller'
        )
        print(f"Z-statistic: {zstat}")
        if zstat > 0:
            print(f"P-value: {pvalue}")
        else:
            print(f"P-value: {1-pvalue}")
    except Exception as e:
        print(f"Error in z-test: {e}")
else:
    print("Not enough data for z-test")


Z-statistic: -0.12916411919703996
P-value: 0.5513861064675847


**Results from above cell:**
- P value is higher than the typical significance threshold of 0.05. It means that there is no statistically significant difference in the instore/online purchase ratio between people with children and people who do not have children.
- Having children doesn't appear to be a strong predictor of whether someone prefers in-store vs online shopping (at least based on this ratio)

# 4. Is there cannibalization between physical stores and online channels? 
- Two columns from the dataset are used to answer this question: `NumWebPurchases` and `NumStorePurchases`
- When we can say  there is cannibalization between physical stores and online channels if average of purchase in store < average of purchase online 
- Null hypothesis (H0): average of purchase in store > average of purchase online
- Alternative hypothesis (H1):average of purchase in store < average of purchase online


In [88]:

# Check if we have enough data
if df['NumWebPurchases'].shape[0] > 1 and df['NumStorePurchases'].shape[0] > 1:
    try:
        zstat, pvalue = ztest(
            df['NumStorePurchases'].values, 
            df['NumWebPurchases'].values,
            alternative='smaller'
        )
        print(f"Z-statistic: {zstat}")
        print(f"P-value: {pvalue:.6f}")

    except Exception as e:
        print(f"Error in z-test: {e}")
else:
    print("Not enough data for z-test")

Z-statistic: 18.850084158703154
P-value: 1.000000


**Results from above cell:**
- P value is much higher than the typical significance threshold of 0.05. It means that there is no statistically significant difference that there is cannibalization between physical stores and online channels

# 5. Does the United States significantly outperform other countries in purchases?
- Another column is created by adding purhase in Country_CA, Country_GER, Country_IND, Country_ME, Country_SA, and Country_SP. Then, named it `Other_counteies`.
- Two columns from the dataset are used to answer this question: `Other_counteies` and `Country_US`
- Null hypothesis (H0): average of purchase in the US < average of purchase Other_countries
- Alternative hypothesis (H1):average of purchase in the US > average of purchase Other_countries

In [89]:
from statsmodels.stats.weightstats import ztest
import pandas as pd

# Create copies to avoid warnings
df_Other_counteies = df[df['Country_US'] == 0].copy()
df_US = df[df['Country_US'] == 1].copy()

# Convert to numeric and clean data
df_US['total_purchases'] = pd.to_numeric(df_US['total_purchases'], errors='coerce')
df_Other_counteies['total_purchases'] = pd.to_numeric(df_Other_counteies['total_purchases'], errors='coerce')

df_US_clean = df_US['total_purchases'].dropna()
df_Other_clean = df_Other_counteies['total_purchases'].dropna()

# Perform z-test
if len(df_Other_clean) > 1 and len(df_US_clean) > 1:
    try:
        zstat, pvalue = ztest(df_US_clean, df_Other_clean, alternative='larger')
        print(f"Z-statistic: {zstat:.4f}")
        print(f"P-value: {pvalue:.6f}")
        
        # Statistical significance interpretation
        alpha = 0.05
        if pvalue < alpha:
            print("Statistically significant difference (p < 0.05)")
        else:
            print("No statistically significant difference (p ≥ 0.05)")
            
    except Exception as e:
        print(f"Error in z-test: {e}")
else:
    print("Not enough data for z-test")

Z-statistic: 1.4399
P-value: 0.074955
No statistically significant difference (p ≥ 0.05)


**Results from above cell:**
- P value is higher than the typical significance threshold of 0.05. It means that there is no statistically significant difference in outperforming purchase in the United States compared to 