### Import Libraries

Import the necessary libraries for data manipulation and statistical hypothesis testing.

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency, ttest_ind, f_oneway

### Load Cleaned Data

Load the cleaned insurance dataset from a Parquet file into a pandas DataFrame.

In [2]:
df = pd.read_parquet('../../data/clean_data.parquet')

### Feature Engineering and Stratified Sampling

Create a binary `HasClaim` column, select relevant columns, convert categorical columns to the appropriate dtype, and draw a stratified sample by transaction month to reduce seasonality bias.

In [7]:
df['HasClaim'] = df['TotalClaims'].apply(lambda x: 1 if x > 0 else 0)
df_filtered = df[['Province', 'PostalCode', 'Gender', 'TotalClaims', 'TotalPremium', 'HasClaim']]

# Convert categorical columns to category dtype
df_filtered['Province'] = df_filtered['Province'].astype('category')
df_filtered['Gender'] = df_filtered['Gender'].astype('category')

# Stratify sample by month to neutralize seasonality bias
df_sample = df.groupby('TransactionMonth').sample(frac=0.3, random_state=42)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Province'] = df_filtered['Province'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Gender'] = df_filtered['Gender'].astype('category')


### Preview Sampled Data

Display the first few rows of the sampled DataFrame to verify the sampling and preprocessing steps.

In [37]:
df_sample.head()

Unnamed: 0.1,Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,...,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims,Z_Score,HasClaim
974709,987392,179,15,2013-10-01,False,ZA,Private company,Ms,English,Standard Bank,...,Third Party,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,41.51811,0.0,-0.088745,0
942158,954451,1191,157,2013-10-01,False,ZA,Private company,Mr,English,Nedbank,...,Third Party,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,10.316126,0.0,-0.224244,0
944478,956801,183,15,2013-10-01,False,ZA,Private company,Ms,English,Standard Bank,...,Keys and Alarms,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,1.107131,0.0,-0.264235,0
985811,998554,1197,157,2013-10-01,False,ZA,Private company,Mr,English,Nedbank,...,Keys and Alarms,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.412651,0.0,-0.267251,0
956890,969357,185,15,2013-10-01,False,ZA,Private company,Ms,English,Standard Bank,...,Signage and Vehicle Wraps,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.553594,0.0,-0.266639,0


### Chi-Squared Test: Province

Create a contingency table for province and claim occurrence, perform a chi-squared test, and interpret the result to determine if risk differs by province.

In [25]:
# Create a contingency table for chi-squared test
prov_contingency_table = pd.crosstab(df_sample['Province'], df_sample['HasClaim'])

# Perform the chi-squared test
prov_chi2_p, prov_p, _, _ = chi2_contingency(prov_contingency_table)

# Interpret the results
if prov_p < 0.05:
    print("Reject H₀: Provinces show significant risk differences.")
else:
    print("Fail to reject H₀: No significant risk differences among provinces.")
print(f"Chi-squared Statistic: {prov_chi2_p}")
print(f"P-value: {prov_p}")

Reject H₀: Provinces show significant risk differences.
Chi-squared Statistic: 57.78390174610344
P-value: 1.2658574266104923e-09


### Show Province Contingency Table

Display the contingency table used for the province chi-squared test.

In [36]:
prov_contingency_table

HasClaim,0,1
Province,Unnamed: 1_level_1,Unnamed: 2_level_1
Eastern Cape,8872,14
Free State,2396,5
Gauteng,116407,396
KwaZulu-Natal,48778,93
Limpopo,7394,16
Mpumalanga,15888,40
North West,42948,102
Northern Cape,1890,4
Western Cape,50872,89


### Chi-Squared Test: Zip Code

Create a contingency table for postal code and claim occurrence, perform a chi-squared test, and interpret the result to determine if risk differs by zip code.

In [27]:
# Create a contingency table for chi-squared test
zip_contingency_table = pd.crosstab(df_sample['PostalCode'], df_sample['HasClaim'])

# Perform the chi-squared test
zip_chi2_p , zip_p, _, _ = chi2_contingency(zip_contingency_table)

# Interpret the results
if zip_p < 0.05:
    print("Reject H₀: Provinces show significant risk differences.")
else:
    print("Fail to reject H₀: No significant risk differences among Zipcode.")
print(f"Chi-squared Statistic: {zip_chi2_p}")
print(f"P-value: {zip_p}")

Fail to reject H₀: No significant risk differences among Zipcode.
Chi-squared Statistic: 877.2255064053513
P-value: 0.4439998280075464


### Show Zip Code Contingency Table

Display the contingency table used for the zip code chi-squared test.

In [35]:
zip_contingency_table

HasClaim,0,1
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1595,3
2,434,2
4,24,0
5,121,1
6,131,1
...,...,...
9781,198,1
9830,17,0
9868,31,0
9869,412,1


### Welch’s t-test: Gender

Split the data by gender, perform Welch’s t-test to compare claim amounts between males and females, and interpret the result.

In [30]:
# Split data into male & femaile groups
male_claims = df_sample[df_sample['Gender'] == 'Male']['TotalClaims']
femaile_claims = df_sample[df_sample['Gender'] == 'Female']['TotalClaims']

# Perform Welch's t-test (handles unequal variances)
t_stat, gender_p = ttest_ind(male_claims, femaile_claims, equal_var=False)

# Interpret the results
if gender_p < 0.05:
    print("Reject H₀: Gender significantly affects claim amounts.")
else:
    print("Fail to reject H₀: No evidence of gender impact on claims.")
print(f"T-statistic: {t_stat}")
print(f"P-value: {gender_p}")


Fail to reject H₀: No evidence of gender impact on claims.
T-statistic: 0.800475929148579
P-value: 0.42346922108852514


### ANOVA: Vehicle Type

Group the data by vehicle type, filter out groups with insufficient data, perform a one-way ANOVA to test for differences in claim amounts across vehicle types, and interpret the result.

In [31]:
# Group by VehicleType
groups = [
    group['TotalClaims']
    for _, group in df_sample.groupby('VehicleType')
    if len(group) >= 2
]

# Perform ANOVA test if enough groups remain
if len(groups) >= 2:
    f_stat, vehicle_p = f_oneway(*groups)
    if vehicle_p < 0.05:
        print("Reject H₀: Certain vehicle types have significantly different claim amounts.")
    else:
        print("Fail to reject H₀: No significant impact.")
    print(f"F-statistic: {f_stat}")
    print(f"P-value: {vehicle_p}")
else:
    print("Not enough groups with sufficient data for ANOVA.")


Fail to reject H₀: No significant impact.
F-statistic: 0.6078443336309342
P-value: 0.6569647496172851


### Save Hypothesis Test Results

Summarize the results of all hypothesis tests in a DataFrame and save them as a CSV report.

In [34]:
results = pd.DataFrame({
    'Hypothesis': ['Province Risk Differences', 'Zip Code Risk Differences', 'Gender Risk Differences', 'Vehicle Type Differences'],
    'P-Value': [prov_p, zip_p, gender_p, vehicle_p],
    'Significance': ['Reject H₀' if p < 0.05 else 'Fail to Reject H₀' for p in [prov_p, zip_p, gender_p, vehicle_p]]
})

# Save results
results.to_csv("../../reports/hypothesis_results.csv", index=False)
