### Add scripts path to the notebook

In [1]:
import sys
import os

current_dir = os.getcwd()
print(current_dir)

# Get the parent directory
parent_dir = os.path.dirname(current_dir)

scripts_path = os.path.join(parent_dir, 'scripts')

# Insert the path to the parent directory
sys.path.insert(0, parent_dir)

# Insert the path to the Scripts directory
sys.path.insert(0, scripts_path)

# Add the parent directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..')))

d:\KifiyaAIM-Course\Week - 3\ACIS-Car-Insurance-Claim-Analysis\notebooks


### Import Statements

In [2]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scripts.ab_testing import annova_test, test_hypothesis, ab_test
from scripts.utils import count_group_contribution

### Load the data

In [3]:
PATH_TO_DATA = "../data/MachineLearningRating_v3.txt"

In [4]:
data = pd.read_csv(filepath_or_buffer=PATH_TO_DATA, delimiter='|', low_memory=False)

### Test 1

- Null Hypothesis = There are no risk differences across provinces
- Alternate Hypothesis = There are risk differences across provinces

1) Define the KPIs that indicate risk for a given data point 

    From all the columns available TotalClaims and Total Premium are the best to show the risk of someone. I think it is a reasonable metric to tell if someone is a highrisk by seeing their amount of claim, because high claim amount means more acidents and more payout which means a higher risk.      

In [5]:
data[['TotalClaims', 'Province']].isna().mean()

TotalClaims    0.0
Province       0.0
dtype: float64

All of the columns of interest (TotalClaims and Province) don't have null rows so let us proceed to other tasks.

In [6]:
data['Province'].nunique()

9

2) Run the annova test

In [7]:
# calculate the p_value using my custom function that uses annova test
f_statistics, p_value = annova_test(dependent_col='TotalClaims', independent_col='Province', data=data)

3) Accept or reject the hypothesis based on the p_value result

In [8]:
test_hypothesis(null_hypothesis="There are no risk differences across provinces", p_value=p_value)

Rejected the null hypothesis: There are no risk differences across provinces 
p_value: 1.6782057588675903e-07


### Test 2

- Null Hypothesis = There are no risk differences between zip codes
- Alternate Hypothesis = There are risk differences between zip codes

1) Check if there are ZIP codes absent

In [9]:
data['PostalCode'].isna().mean()

np.float64(0.0)

2) Run the Annova test instead of A/B test because there are more than 2 zipcodes/postalcodes

In [10]:
# calculate the p_value using my custom function that uses annova test
f_statistics, p_value = annova_test(dependent_col='TotalClaims', independent_col='PostalCode', data=data)

3) Accept or reject the hypothesis based on the p_value result

In [11]:
test_hypothesis(null_hypothesis="There are no risk differences between zip codes", p_value=p_value)

Accepted the null hypothesis: There are no risk differences between zip codes 
p_value: 0.8906511279164051


### Test 3

- Null Hypothesis = There are no significant margin (profit) difference between zip codes
- Alternate Hypothesis = There are significant margin (profit) difference between zip codes

1) First calculate the margin(profit)

In [12]:
data['Profit'] = data['TotalPremium'] - data['TotalClaims']

2) Run the Annova test

In [13]:
# calculate the p_value using my custom function that uses annova test
f_statistics, p_value = annova_test(dependent_col='Profit', independent_col='PostalCode', data=data)

3) Accept or reject the hypothesis based on the p_value result

In [14]:
test_hypothesis(null_hypothesis="There are no significant margin (profit) difference between zip codes", p_value=p_value)

Accepted the null hypothesis: There are no significant margin (profit) difference between zip codes 
p_value: 0.9976859758015036


### Test 4

- Null Hypothesis = There are not significant risk difference between Women and Men
- Alternate Hypothesis = There are significant risk difference between Women and Men

1) Check for missing gender data values

In [15]:
data['Gender'].isna().mean() * 100

np.float64(0.9535065563574769)

Check the available values in the Gender column

In [16]:
data['Gender'].unique()

array(['Not specified', 'Male', 'Female', nan], dtype=object)

About 0.95% of the gender data is missing , so I can assign them to the value Not specified

In [17]:
# fill the missing gender values with Not Specified
data['Gender'] = data['Gender'].fillna(value='Not specified')

2) Since there are 3 values for the Gender column, let me check the amount of data for each possible value

In [18]:
count_group_contribution(data=data, grouping_col='Gender')

Female: 0.68% of the data
Male: 4.28% of the data
Not specified: 95.04% of the data


The data at hand is imbalanced when it comes to gender. I say that because for one the data points that have a value of 'Not specified' account for 95% while female and male combined account for a little over 5%. The other reason is even if I try to check if the data supports or rejects the hypothesis for differences in male and femle the amount of data difference between them makes the result ambiguous

Filter the data with

3) Run the A/B test

In [24]:
# just keep the datapoints that are either male or female
male_female_df = data[data['Gender'] !=  'Not specified']

# see the new data distribution
count_group_contribution(data=male_female_df, grouping_col='Gender')


Female: 13.63% of the data
Male: 86.37% of the data


In [None]:
# calculate the p_value using my custom function that uses ab test
t_stat, p_value = ab_test(dependent_col='TotalClaims', independent_col='Gender', data=male_female_df)

4) Accept or reject the null hypothesis

In [22]:
test_hypothesis(null_hypothesis="There are not significant risk difference between Women and Men", p_value=p_value)

Accepted the null hypothesis: There are not significant risk difference between Women and Men 
p_value: 0.8041073961270343


5) I decided to run an Annova Test on the entire gender data and see if there is a difference in result

In [25]:
# calculate the p_value using my custom function that uses annova test
f_statistics, p_value = annova_test(dependent_col='TotalClaims', independent_col='Gender', data=data)

In [26]:
test_hypothesis(null_hypothesis="There are not significant risk difference between Women and Men", p_value=p_value)

Rejected the null hypothesis: There are not significant risk difference between Women and Men 
p_value: 0.010025171532279099


So when I include the values of the rows with Gender value of 'Not specified', I get a test that suggest gender plays a role in the risk