# 02 - Hypothesis Testing

This notebook performs A/B hypothesis testing to validate or reject key hypotheses about risk drivers and profit margins, as outlined in the project brief.

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## 1. Data Loading and Preprocessing

Load the dataset and perform any necessary preprocessing steps, such as converting data types or handling missing values, specifically for the variables relevant to hypothesis testing.

In [4]:
data_path = '../data/insurance_claims.txt'
df = pd.read_csv(data_path, sep='|')

# Convert TotalPremium and TotalClaims to numeric, handling potential errors
df['TotalPremium'] = pd.to_numeric(df['TotalPremium'], errors='coerce')
df['TotalClaims'] = pd.to_numeric(df['TotalClaims'], errors='coerce')

# Define risk and margin metrics

df['HasClaim'] = (df['TotalClaims'] > 0).astype(int)

# Claim Severity: average amount of a claim, given a claim occurred
df_claims = df[df['HasClaim'] == 1].copy()

# Margin: (TotalPremium - TotalClaims)
df['Margin'] = df['TotalPremium'] - df['TotalClaims']

df.head()

  df = pd.read_csv(data_path, sep='|')


Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims,HasClaim,Margin
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0,0,21.929825
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0,0,21.929825
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0,0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0,0,512.84807
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0,0,0.0


## 2. Hypothesis Testing

### Null Hypothesis 1: There are no risk differences across provinces

**Metrics to test:** Claim Frequency, Claim Severity
**Test:** Chi-squared test for claim frequency, ANOVA for claim severity (if normally distributed) or Kruskal-Wallis (non-parametric).

In [5]:
# H0: No risk differences across provinces (Claim Frequency)
contingency_table_province_claim_freq = pd.crosstab(df['Province'], df['HasClaim'])
chi2, p_province_claim_freq, dof, expected = stats.chi2_contingency(contingency_table_province_claim_freq)
print(f'Chi-squared test for Claim Frequency across Provinces: p-value = {p_province_claim_freq:.4f}')

# H0: No risk differences across provinces (Claim Severity)
# Filter out provinces with very few claims to avoid issues with ANOVA
provinces_with_claims = df_claims['Province'].value_counts()
provinces_to_test = provinces_with_claims[provinces_with_claims > 30].index # Threshold for meaningful analysis

if len(provinces_to_test) > 1:
    province_groups = [df_claims[df_claims['Province'] == p]['TotalClaims'].dropna() for p in provinces_to_test]
    # Check for variance homogeneity (Levene's test) if planning ANOVA
    # stat, p_levene = stats.levene(*province_groups)
    # print(f'Levene test for variance homogeneity: p-value = {p_levene:.4f}')

    # Perform ANOVA (assuming approximate normality or large sample sizes per group)
    f_stat_province_claim_sev, p_province_claim_sev = stats.f_oneway(*province_groups)
    print(f'ANOVA for Claim Severity across Provinces (filtered): p-value = {p_province_claim_sev:.4f}')
else:
    print("Not enough provinces with sufficient claims data for ANOVA.")


Chi-squared test for Claim Frequency across Provinces: p-value = 0.0000
ANOVA for Claim Severity across Provinces (filtered): p-value = 0.0000


### Null Hypothesis 2: There are no risk differences between zip codes

**Metrics to test:** Claim Frequency, Claim Severity
**Test:** Chi-squared for frequency, ANOVA/Kruskal-Wallis for severity (likely need to sample or group due to high cardinality of zip codes).

In [6]:
# H0: No risk differences between zip codes (Claim Frequency and Severity)
# Due to the high cardinality of PostalCode, direct chi-squared/ANOVA may be computationally intensive or yield unreliable results.
# Consider grouping zip codes or sampling for this analysis, or focusing on top N zip codes.
# For demonstration, let's select a few top zip codes with sufficient data.
top_zipcodes = df['PostalCode'].value_counts().nlargest(10).index
df_top_zips = df[df['PostalCode'].isin(top_zipcodes)].copy()
df_claims_top_zips = df_claims[df_claims['PostalCode'].isin(top_zipcodes)].copy()

if not df_top_zips.empty:
    contingency_table_zip_claim_freq = pd.crosstab(df_top_zips['PostalCode'], df_top_zips['HasClaim'])
    if contingency_table_zip_claim_freq.shape[0] > 1 and contingency_table_zip_claim_freq.shape[1] > 1:
        chi2_zip_claim_freq, p_zip_claim_freq, dof_zip, expected_zip = stats.chi2_contingency(contingency_table_zip_claim_freq)
        print(f'Chi-squared test for Claim Frequency across Top 10 Zip Codes: p-value = {p_zip_claim_freq:.4f}')
    else:
        print("Not enough variation in top zip codes for chi-squared test on claim frequency.")

    if not df_claims_top_zips.empty:
        zipcode_groups_severity = [df_claims_top_zips[df_claims_top_zips['PostalCode'] == z]['TotalClaims'].dropna() for z in top_zipcodes]
        # Filter out empty lists or lists with too few samples for ANOVA
        zipcode_groups_severity = [g for g in zipcode_groups_severity if len(g) > 1]

        if len(zipcode_groups_severity) > 1:
            f_stat_zip_claim_sev, p_zip_claim_sev = stats.f_oneway(*zipcode_groups_severity)
            print(f'ANOVA for Claim Severity across Top 10 Zip Codes: p-value = {p_zip_claim_sev:.4f}')
        else:
            print("Not enough zip codes with sufficient claims data for ANOVA.")
    else:
        print("No claims data available for selected top zip codes for severity analysis.")
else:
    print("No data for selected top zip codes.")


Chi-squared test for Claim Frequency across Top 10 Zip Codes: p-value = 0.0000
ANOVA for Claim Severity across Top 10 Zip Codes: p-value = 0.0000


### Null Hypothesis 3: There are no significant margin (profit) differences between zip codes

**Metrics to test:** Margin
**Test:** ANOVA/Kruskal-Wallis on margin per policy. Again, likely need to sample or group zip codes.

In [7]:
# H0: No significant margin (profit) differences between zip codes
if not df_top_zips.empty:
    zipcode_groups_margin = [df_top_zips[df_top_zips['PostalCode'] == z]['Margin'].dropna() for z in top_zipcodes]
    zipcode_groups_margin = [g for g in zipcode_groups_margin if len(g) > 1]

    if len(zipcode_groups_margin) > 1:
        f_stat_zip_margin, p_zip_margin = stats.f_oneway(*zipcode_groups_margin)
        print(f'ANOVA for Margin across Top 10 Zip Codes: p-value = {p_zip_margin:.4f}')
    else:
        print("Not enough zip codes with sufficient margin data for ANOVA.")
else:
    print("No data for selected top zip codes.")


ANOVA for Margin across Top 10 Zip Codes: p-value = 0.3964


### Null Hypothesis 4: There are no significant risk differences between Women and Men

**Metrics to test:** Claim Frequency, Claim Severity
**Test:** Chi-squared for frequency, independent t-test for severity.

In [8]:
# H0: No significant risk difference between Women and Men (Claim Frequency)
# Filter out 'Not specified' gender for this analysis
df_gender = df[df['Gender'].isin(['Male', 'Female'])].copy()

if not df_gender.empty and df_gender['Gender'].nunique() == 2:
    contingency_table_gender_claim_freq = pd.crosstab(df_gender['Gender'], df_gender['HasClaim'])
    if contingency_table_gender_claim_freq.shape[0] == 2 and contingency_table_gender_claim_freq.shape[1] == 2:
        chi2_gender_claim_freq, p_gender_claim_freq, dof_gender, expected_gender = stats.chi2_contingency(contingency_table_gender_claim_freq)
        print(f'Chi-squared test for Claim Frequency across Gender: p-value = {p_gender_claim_freq:.4f}')
    else:
        print("Not enough data for both gender and claim status for chi-squared test.")
else:
    print("Gender data not suitable for analysis (e.g., missing 'Male' or 'Female' categories).")

# H0: No significant risk difference between Women and Men (Claim Severity)
df_claims_gender = df_claims[df_claims['Gender'].isin(['Male', 'Female'])].copy()

if not df_claims_gender.empty and df_claims_gender['Gender'].nunique() == 2:
    male_claims = df_claims_gender[df_claims_gender['Gender'] == 'Male']['TotalClaims'].dropna()
    female_claims = df_claims_gender[df_claims_gender['Gender'] == 'Female']['TotalClaims'].dropna()

    if len(male_claims) > 1 and len(female_claims) > 1:
        # Perform independent t-test (assuming approximate normality and equal variances for simplicity, Levene's test can be added)
        t_stat_gender_claim_sev, p_gender_claim_sev = stats.ttest_ind(male_claims, female_claims, equal_var=False) # Welch's t-test for unequal variances
        print(f'T-test for Claim Severity across Gender: p-value = {p_gender_claim_sev:.4f}')
    else:
        print("Not enough claims data for both male and female for t-test.")
else:
    print("Gender claims data not suitable for analysis.")


Chi-squared test for Claim Frequency across Gender: p-value = 0.9515
T-test for Claim Severity across Gender: p-value = 0.5680


## 3. Analysis and Reporting

Based on the statistical tests performed:

### Null Hypothesis 1: There are no risk differences across provinces
*   **Result:** Rejected for both Claim Frequency (p < 0.0001) and Claim Severity (p < 0.0001).
*   **Interpretation:** There are highly significant differences in both the likelihood and the average cost of claims across different provinces. This suggests that geographical location at the provincial level is a strong risk factor for car insurance, warranting province-specific premium adjustments.

### Null Hypothesis 2: There are no risk differences between zip codes
*   **Result:** Rejected for both Claim Frequency (p < 0.0001) and Claim Severity (p < 0.0001) for the top 10 zip codes.
*   **Interpretation:** Similar to provinces, specific postal code areas exhibit significant variations in both claim frequency and severity. This reinforces the importance of granular location data for precise risk assessment and potentially more targeted pricing.

### Null Hypothesis 3: There are no significant margin (profit) differences between zip codes
*   **Result:** Failed to reject (p = 0.3964) for Margin across the top 10 zip codes.
*   **Interpretation:** While claim risk varies by zip code, the current pricing structure appears to equalize profit margins across these high-volume areas. This suggests that existing premiums are broadly effective in covering the varying risk levels within these specific postal codes.

### Null Hypothesis 4: There are no significant risk differences between Women and Men
*   **Result:** Failed to reject for both Claim Frequency (p = 0.9515) and Claim Severity (p = 0.5680).
*   **Interpretation:** Based on this historical data, gender does not emerge as a statistically significant differentiator for either the likelihood or the average cost of claims. This finding could inform discussions on more equitable pricing strategies, potentially focusing solely on other significant risk factors identified.
