### A/B Hypothesis Testing

In [17]:
import os
import sys
import numpy as np
import pandas as pd
import scipy.stats as stats #It has all the probability distributions available along with many statistical functions.
# sns.set(style="darkgrid") # set the background for the graphs
# Get the current working directory
current_dir = os.getcwd()

# Append the parent directory to sys.path
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

# ignore warrnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
from scripts.AB_haypothesis_tester import ABHypothesisTester

In [3]:
#Reading the txt file MachineLearningRating_v3.txt
file_path = '../data/MachineLearningRating_v3.txt'
df=pd.read_csv(file_path, delimiter='|')

In [4]:
df.head(5)

Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


In [5]:
ab_tester=ABHypothesisTester(df)

## A/B Hypotheses Testing
### 1. No Risk Differences Across Provinces:

- ***Null Hypothesis (H0)***: There are no significant risk differences between provinces.
- **Alternative Hypothesis (H1)**: There are significant risk differences between provinces.

In [6]:
# Define the provinces for each group
control_provinces = ['Gauteng', 'Western Cape', 'KwaZulu-Natal']
test_provinces = ['Eastern Cape', 'Mpumalanga', 'Limpopo', 'North West', 'Free State', 'Northern Cape']

In [7]:
# 1. Risk Differences Across Provinces
group_A, group_B = ab_tester.create_groups(df, 'Province', control_provinces, test_provinces)
p_value_provinces = ab_tester.hypothesis_test(group_A, group_B, 'TotalClaims', test_type='t')

In [28]:
# Check the p-value for provinces
print("p_value_provinces= ",p_value_provinces)

p_value_provinces=  9.13223703854227e-10


In [9]:
# Reporting for Risk Differences Across Provinces
ab_tester.report_results(p_value_provinces, "Risk Differences Across Provinces")

Risk Differences Across Provinces: p-value = 0.0000 -> Reject the null hypothesis


####  Risk Differences Across Provinces (p-value = 0.0000):
- **Conclusion**: Reject the null hypothesis.
- **Observation**: There are significant differences in risk across provinces. This suggests that certain provinces are associated with higher or lower risks compared to others, indicating that location plays a critical role in determining the risk profile of insurance clients.

--------------------------------------------------------------------------------------------------------------------------------

### 2. No Risk Differences Between Zip Codes:

- **Null Hypothesis (H0)**: There are no significant risk differences between zip codes.
- **Alternative Hypothesis (H1)**: There are significant risk differences between zip codes.

In [10]:
# 2. Risk Differences Between Zip Codes
group_A, group_B = ab_tester.create_zipcode_groups(df)
p_value_zipcode = ab_tester.hypothesis_test(group_A, group_B, 'TotalClaims', test_type='t')

In [12]:
# Reporting for Risk Differences Across Zip Codes
ab_tester.report_results(p_value_zipcode, "Risk Differences Between Zip Codes")

Risk Differences Between Zip Codes: p-value = 0.2776 -> Fail to reject the null hypothesis


#### Risk Differences Between Zip Codes (p-value = 0.2776):

- **Conclusion**: Fail to reject the null hypothesis.
- **Observation**: There is no statistically significant difference in risk between zip codes. This means that at the zip code level, the risk profile is relatively consistent, and zip code alone may not be a determining factor for risk differentiation.

--------------------------------------------------------------------------------------------------------------------------------

### 3. No Significant Margin (Profit) Differences Between Zip Codes:

- **Null Hypothesis (H0)**: There are no significant margin differences between zip codes.
- **Alternative Hypothesis (H1)**: There are significant margin differences between zip codes.

In [18]:
# Calculate Profit Margin
df['ProfitMargin'] = (df['TotalPremium'] - df['TotalClaims']) / df['TotalPremium']

# Handle missing or infinite values in Profit Margin
df['ProfitMargin'].replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(subset=['ProfitMargin'], inplace=True)

In [22]:
# 3. Margin (Profit) Differences Between Zip Codes
group_A, group_B = ab_tester.create_zipcode_groups(df)
p_value_margin = ab_tester.hypothesis_test(group_A, group_B, 'ProfitMargin', test_type='t')

ValueError: No data; `observed` has size 0.

In [21]:
# Reporting for Profit Margin Differences Across Zip Codes
ab_tester.report_results(p_value_margin, "Margin Differences Between Zip Codes")

Margin Differences Between Zip Codes: p-value = 0.0880 -> Fail to reject the null hypothesis


#### Margin Differences Between Zip Codes (p-value = 0.0880):

- **Conclusion**: Fail to reject the null hypothesis.
- **Observation**: There is no significant difference in profit margins between zip codes. Although the p-value is closer to the 0.05 threshold, it does not reach statistical significance. This suggests that, at least for this dataset, profitability is similar across different zip codes.

--------------------------------------------------------------------------------------------------------------------------------

#### 4. No Significant Risk Differences Between Women and Men:

- **Null Hypothesis (H0)**: There are no significant risk differences between women and men.
- **Alternative Hypothesis (H1)**: There are significant risk differences between women and men.

In [23]:
# Data Cleaning
df['Gender'] = df['Gender'].replace({'Not specified': 'Unknown'})
df = df.dropna(subset=['Gender'])

In [24]:
# 4. Risk Differences Between Women and Men
group_A, group_B = ab_tester.create_gender_groups(df)
# group_A, group_B = ab_tester.create_groups(df, 'Gender', 'Female', 'Male')
p_value_gender = ab_tester.hypothesis_test(group_A, group_B, 'TotalClaims', test_type='t')

ValueError: No data; `observed` has size 0.

In [None]:
ab_tester.report_results(p_value_gender, "Risk Differences Between Genders")

Risk Differences Between Genders: p-value = 0.8041 -> Fail to reject the null hypothesis


#### Risk Differences Between Genders (p-value = 0.8041):

- **Conclusion**: Fail to reject the null hypothesis.
- **Observation**: There is no significant difference in risk between genders. This indicates that both men and women exhibit similar risk profiles in this insurance dataset, and gender does not appear to be a key differentiator in risk assessment.

-------------------------------------------------------------------------------------------------------------------------------------------

### General Insight:
- The analysis indicates that geographic location (provinces) plays a significant role in risk determination, but more granular levels like zip codes do not. Additionally, there are no significant differences in risk or profitability based on gender or between zip codes, which could imply a fairly homogeneous risk and margin landscape in those specific areas.