# A/B Hypothesis Testing

Here is the list of hypothesis used for the anlysis:
1. Accept or reject the following Null Hypothesis: 
2. There are no risk differences across provinces 
3. There are no risk differences between zipcodes 
4. There are no significant margin (profit) difference between zip codes 
5. There are not significant risk difference between Women and Men

To evaluate the null hypotheses using A/B hypothesis testing on the insurance dataset, I use Scipy python library and I employ the following steps: First, identify key metrics that will be used to measure risk and margin (profit) differences. Secondly, choose the Key Performance Indicator (KPI). Thirdly, create groups based on the feature being tested. Fourthly, perform statistical testing. Finally, interpret the p-values obtained from the tests.

1. ### Choose the Key Performance Indicator (KPI)

The KPIs for the analysis will be:

- Risk: Claim frequency or claim count (TotalClaims).
- Profit Margin: Calculated as (TotalPremium - TotalClaims) / TotalPremium

2. ### Data Segmentation

Create groups based on the feature being tested:

- Control Group (Group A): Plans without the feature.
- Test Group (Group B): Plans with the feature.

In [1]:
# Import required libraries for the anlysis
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.model_selection import train_test_split

In [2]:
# Gathering Data
def loding_data(path):
    try:
        clean_acis_df = pd.read_csv(path, low_memory=False)
    except Exception as e:
        print(f"Error on loading data: {e}")
    else:
        return clean_acis_df

In [3]:
path = "../data/clean_acis_data.csv"
clean_acis_df = loding_data(path)

### 3. Statistical Testing

In [14]:
# base class
def perform_chi_squared_test(df, column1, column2, nullhypo, althypo):
    """
    Creates a contingency table for two columns in a dataframe.

    Parameters:
    - df: The DataFrame to operate on.
    - column1: The first column to cross-tabulate.
    - column2: The second column to cross-tabulate.
    - nullhypo: The null hypothesis
    - althypo: The alternative hypothesis
    
    return: None"""
    
    # Contingency table for provinces and claims
    contingency_table = pd.crosstab(df[column1], df[column2])

    # segnivicance level
    a = 0.05

    # Chi-squared test
    ttest, p = ttest_ind(contingency_table)

    print(f"Chi-squared test for {column1} vs {column2}")
    print(f"Chi-Squared value: {ttest_ind}")
    print(f'p-value: {p}')

    result = althypo if p < a else nullhypo
    print(result)

In [4]:
# Function to perform T-test
def perform_ttest(groupA, groupB, kpi):
    t_stat, p_val = ttest_ind(groupA[kpi], groupB[kpi], equal_var=False)
    return t_stat, p_val

Null Hypothesis 1: There are no risk differences across provinces.

- KPI: TotalPremium.

In [23]:
# Split provinces into two groups
col_list = list(clean_acis_df["Province"].unique())
provinces_A = clean_acis_df[clean_acis_df['Province'].isin(col_list[:5])]
provinces_B = clean_acis_df[clean_acis_df['Province'].isin(col_list[5:])]

ttest_value, p = perform_ttest(provinces_A, provinces_B, "TotalPremium")
print(f"T-test for 'Provinces' vs TotalPremium")
print(f"T-Test value: {ttest_value:.2f}")
print(f'p-value: {p:.2f}') 



T-test for 'Provinces' vs TotalPremium
T-Test value: -13.63
p-value: 0.00


With such an exceedingly low p-value, we can very confidently reject the null hypothesis and conclude that there's a highly statistically significant difference in TotalPremium between the provinces.

Null Hypothesis 2: There are no risk differences between zip codes

- KPI: TotalPremium (numerical)

In [28]:

# Split provinces into two groups
col_list = list(clean_acis_df["PostalCode"].unique())
postalcode_A = clean_acis_df[clean_acis_df['PostalCode'].isin(col_list[:425])]
postalcode_B = clean_acis_df[clean_acis_df['PostalCode'].isin(col_list[425:])]

ttest_value, p = perform_ttest(postalcode_A, postalcode_B, "TotalPremium")
print(f"T-test for 'PostalCode' vs TotalPremium")
print(f"T-Test value: {ttest_value:.2f}")
print(f'p-value: {p:.2f}') 

T-test for 'PostalCode' vs TotalPremium
T-Test value: 5.34
p-value: 0.00


TotalPremium between postal codes:

- T-Test Value (5.34): A positive value suggests that the average TotalPremium in one group (likely postal codes with higher premiums) is higher than the other (lower premiums).
- p-value (0.00): An extremely small p-value (essentially zero) indicates a very low probability of observing this difference by chance.

With such a low p-value, we can reject the null hypothesis and conclude that there's a statistically significant difference in TotalPremium across postal codes. This means it's highly unlikely that the observed difference occurred randomly.

Null Hypothesis 3: There are no significant margin (profit) difference between zip codes
- KPI: Total Premium - Total Claims (numerical) 


In [36]:
# Calculate margins
clean_acis_df['Margin'] = clean_acis_df['TotalPremium'] - clean_acis_df['TotalClaims']

col_list = list(clean_acis_df["PostalCode"].unique())
postalcode_A = clean_acis_df[clean_acis_df['PostalCode'].isin(col_list[:425])]
postalcode_B = clean_acis_df[clean_acis_df['PostalCode'].isin(col_list[425:])]

ttest_value, p = perform_ttest(postalcode_A, postalcode_B, "Margin")
print(f"T-test for 'PostalCode' vs Margin")
print(f"T-Test value: {ttest_value:.2f}")
print(f'p-value: {p:.2f}') 

T-test for 'PostalCode' vs Margin
T-Test value: 5.34
p-value: 0.00


Margins between postal codes:

- T-Test Value (5.34): This positive value, similar to the previous case with TotalPremium, suggests that the average profit margin in one group of postal codes is higher than the other.
- p-value (0.00): The extremely low p-value (essentially zero) signifies a very low probability of observing this difference by random chance.


We can reject the null hypothesis and conclude that there's a statistically significant difference in profit margin across postal codes. This means it's highly unlikely that the observed difference occurred randomly.

Null Hypothesis 4: There are not significant risk difference between Women and Men
- KPI: Total Premium (numerical)

In [38]:
males = clean_acis_df[clean_acis_df['Gender'] == 'Male']
females = clean_acis_df[clean_acis_df['Gender'] == 'Female']

ttest_value, p = perform_ttest(males, females, "TotalPremium")
print(f"T-test for 'Gender' vs TotalPremium")
print(f"T-Test value: {ttest_value:.2f}")
print(f'p-value: {p:.2f}') 

T-test for 'Gender' vs TotalPremium
T-Test value: 1.38
p-value: 0.17


TotalPremiums between Genders
- T-Test Value (1.38): The absolute value here is relatively small, indicating that the average TotalPremium between genders might not be very different.
- p-value (0.17): This value is greater than the commonly used significance level of 0.05

Based on the p-value, we fail to reject the null hypothesis. In other words, there's not enough evidence to conclude that there's a statistically significant difference in TotalPremium between genders. This suggests that, on average, men and women might pay similar total premiums.