# Task 3: Hypothesis Testing

This notebook performs A/B testing to assess whether significant risk differences exist across customer segments such as provinces, postal codes, and genders.

**Goals**:
- Validate or reject predefined hypotheses using statistical tests.
- Focus on metrics: Claim Frequency, Loss Ratio, Margin, Claim Severity.
- Use t-test, Mann-Whitney, and Chi-squared tests as appropriate.

### Load Data

In [None]:
import pandas as pd
import numpy as np

# Load the cleaned DataFrame from EDA (or re-run EDA if needed)
df = pd.read_csv("../data/raw/MachineLearningRating_v3.txt", sep='|',parse_dates=['TransactionMonth'])

# Derived metrics
df['LossRatio'] = np.where(df['TotalPremium'] > 0, df['TotalClaims'] / df['TotalPremium'], np.nan)
df['Margin'] = df['TotalPremium'] - df['TotalClaims']
df['HasClaim'] = (df['TotalClaims'] > 0).astype(int)

# View sample
df[['Province', 'Gender', 'LossRatio', 'Margin', 'HasClaim']].head()

## Step 1: Hypothesis H1 – Risk Differences Across Provinces

- **Null Hypothesis (H1)**: No difference in average Loss Ratio across provinces.
- **Metric**: LossRatio (continuous).
- **Test**: T-test or ANOVA (if more than two provinces).

In [None]:
# Group counts
province_counts = df['Province'].value_counts()
print(province_counts)

# Filter provinces with at least 30 policies
valid_provinces = province_counts[province_counts >= 30].index.tolist()
df_province = df[df['Province'].isin(valid_provinces)]

print("Filtered provinces:", valid_provinces)

### Visualize Group Differences

In [None]:
import matplotlib.pyplot as plt

# Mean LossRatio per province
loss_by_province = df_province.groupby('Province')['LossRatio'].mean().sort_values(ascending=False)

# Bar plot
plt.figure(figsize=(8, 4))
loss_by_province.plot(kind='bar', color='coral', edgecolor='black')
plt.title('Mean Loss Ratio by Province')
plt.ylabel('Mean Loss Ratio')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Check Normality and Run Anova Test

In [None]:
from scipy.stats import f_oneway

# Group LossRatio arrays
groups = [group['LossRatio'].dropna() for name, group in df_province.groupby('Province') if len(group) >= 30]

# Run ANOVA (1-way)
stat, p_value = f_oneway(*groups)
print(f"ANOVA result: F-stat = {stat:.3f}, p-value = {p_value:.4f}")

if p_value < 0.05:
    print("✅ Reject null hypothesis: significant difference in LossRatio across provinces.")
else:
    print("❌ Fail to reject null hypothesis: no significant difference found.")

## Step 2: Hypothesis H4 – Risk Differences by Gender

### H4a – Claim Frequency
- **Null Hypothesis**: Claim frequency is equal for males and females.
- **Metric**: HasClaim (binary)
- **Test**: Chi-squared test

### H4b – Claim Severity
- **Null Hypothesis**: Claim severity is equal for males and females.
- **Metric**: TotalClaims (for HasClaim == 1)
- **Test**: Mann–Whitney U test (non-parametric)

### Claim Frequency by Gender (Chi-squared)

In [None]:
from scipy.stats import chi2_contingency

# Build contingency table
contingency = pd.crosstab(df['Gender'], df['HasClaim'])
print("Contingency Table:\n", contingency)

# Chi-squared test
stat, p, dof, expected = chi2_contingency(contingency)

print(f"\nChi-squared test:\nChi² = {stat:.3f}, p = {p:.4f}")
if p < 0.05:
    print("✅ Reject H₀: Claim frequency differs by gender.")
else:
    print("❌ Fail to reject H₀: No significant difference in claim frequency between genders.")

### Claim Severity by Gender (Mann–Whitney)

In [None]:
from scipy.stats import mannwhitneyu

# Filter to only those who had claims
claimants = df[df['HasClaim'] == 1]

# Check sample sizes
print(claimants['Gender'].value_counts())

# Group values
male_claims = claimants[claimants['Gender'] == 'Male']['TotalClaims'].dropna()
female_claims = claimants[claimants['Gender'] == 'Female']['TotalClaims'].dropna()

# Mann–Whitney test
stat, p_value = mannwhitneyu(male_claims, female_claims, alternative='two-sided')

print(f"\nMann–Whitney U test on Claim Severity:\nU = {stat:.2f}, p = {p_value:.4f}")
if p_value < 0.05:
    print("✅ Reject H₀: Claim severity differs between genders.")
else:
    print("❌ Fail to reject H₀: No significant difference in claim severity between genders.")


### H4a – Claim Frequency (Chi-squared)

- **Metric**: HasClaim
- **Test**: Chi-squared test
- **p-value**: 0.0266
- ✅ **Conclusion**: Reject H₀ → Male and female claim frequencies differ.

---

### H4b – Claim Severity (Mann–Whitney U)

- **Metric**: TotalClaims (HasClaim == 1)
- **Test**: Mann–Whitney U
- **p-value**: 0.2235
- ❌ **Conclusion**: Fail to reject H₀ → No strong evidence of difference in claim severity between genders.

---

**Business Implication**: Gender may influence claim frequency, but not necessarily claim amount. Pricing strategies can account for this, provided it aligns with regulatory fairness.


## Step 3: Hypothesis H2 – Loss Ratio Differences by Zip Code

- **Null Hypothesis**: Loss Ratio is equal across zip codes.
- **Metric**: LossRatio
- **Test**: ANOVA or pairwise t-tests (if narrowed to top N zip codes)

In [None]:
# Zip code distribution
zip_counts = df['PostalCode'].value_counts()
top_zips = zip_counts[zip_counts >= 30].index.tolist()[:5]  # Top 5 zip codes with ≥30 entries

# Filter for top zip codes only
df_zip = df[df['PostalCode'].isin(top_zips)]

print("Top zip codes used:", top_zips)

### Visualize Loss Ratio By Zip Code

In [None]:
import matplotlib.pyplot as plt

loss_by_zip = df_zip.groupby('PostalCode')['LossRatio'].mean().sort_values(ascending=False)

plt.figure(figsize=(8, 4))
loss_by_zip.plot(kind='bar', color='mediumseagreen', edgecolor='black')
plt.title('Mean Loss Ratio by Zip Code')
plt.ylabel('Mean Loss Ratio')
plt.xlabel('Zip Code')
plt.tight_layout()
plt.show()

### ANOVA Test

In [None]:
from scipy.stats import f_oneway

# LossRatio values grouped by zip
groups = [g['LossRatio'].dropna() for _, g in df_zip.groupby('PostalCode')]

stat, p_value = f_oneway(*groups)
print(f"ANOVA result: F = {stat:.3f}, p = {p_value:.4f}")

if p_value < 0.05:
    print("✅ Reject H₂: Significant differences in Loss Ratio between zip codes.")
else:
    print("❌ Fail to reject H₂: No significant difference found.")

### H2 – Loss Ratio by Zip Code (ANOVA)

- **Metric**: LossRatio
- **Groups**: Top 5 zip codes with ≥30 records
- **Test**: One-way ANOVA
- **p-value**: 0.0088
- ✅ **Conclusion**: Reject H₀ → Loss Ratio varies significantly by zip code.

**Implication**: Zip-level segmentation may help fine-tune pricing/risk adjustment.

## Step 4: Hypothesis H3 – Margin Differences by Zip Code

- **Null Hypothesis**: Average profit margin is the same across zip codes.
- **Metric**: Margin (TotalPremium - TotalClaims)
- **Test**: One-way ANOVA (same filtered zip codes as in H2)

### Margin ANOVA

In [None]:
# Margin grouped by zip
groups_margin = [g['Margin'].dropna() for _, g in df_zip.groupby('PostalCode')]

# Run ANOVA
stat, p_value = f_oneway(*groups_margin)
print(f"ANOVA result (Margin): F = {stat:.3f}, p = {p_value:.4f}")

if p_value < 0.05:
    print("✅ Reject H₃: Margin differs significantly by zip code.")
else:
    print("❌ Fail to reject H₃: No significant margin difference across zip codes.")

### H3 – Margin by Zip Code (ANOVA)

- **Metric**: Margin (TotalPremium – TotalClaims)
- **Groups**: Top 5 zip codes with ≥30 records
- **Test**: One-way ANOVA
- **p-value**: 0.0469
- ✅ **Conclusion**: Reject H₀ → No statistically significant margin difference across zip codes.

### H2 – Loss Ratio by Zip Code (ANOVA)

- **Metric**: LossRatio
- **Groups**: Top 5 zip codes with ≥30 records
- **Test**: One-way ANOVA
- **p-value**: 0.0088
- ✅ **Conclusion**: Reject H₀ → Loss Ratio varies significantly by zip code.

**Implication**: Zip-level segmentation may help fine-tune pricing/risk adjustment.

# Hypothesis Tester Class

In [None]:
from scipy.stats import ttest_ind, mannwhitneyu, chi2_contingency, f_oneway

class HypothesisTester:
    def __init__(self, df):
        self.df = df.copy()

    def chi_squared_test(self, group_col, binary_col):
        """
        Performs Chi-squared test for independence.
        Used for: HasClaim ~ Gender (or categorical groups).
        """
        contingency = pd.crosstab(self.df[group_col], self.df[binary_col])
        stat, p, dof, expected = chi2_contingency(contingency)
        return {
            "test": "Chi-squared",
            "stat": stat,
            "p_value": p,
            "contingency": contingency
        }

    def mann_whitney_test(self, group_col, value_col, group1, group2):
        """
        Mann–Whitney U test for skewed continuous data between two groups.
        """
        g1 = self.df[self.df[group_col] == group1][value_col].dropna()
        g2 = self.df[self.df[group_col] == group2][value_col].dropna()
        stat, p = mannwhitneyu(g1, g2, alternative='two-sided')
        return {
            "test": "Mann–Whitney U",
            "group1_median": g1.median(),
            "group2_median": g2.median(),
            "stat": stat,
            "p_value": p
        }

    def t_test(self, group_col, value_col, group1, group2):
        """
        Welch’s t-test for difference in means.
        """
        g1 = self.df[self.df[group_col] == group1][value_col].dropna()
        g2 = self.df[self.df[group_col] == group2][value_col].dropna()
        stat, p = ttest_ind(g1, g2, equal_var=False)
        return {
            "test": "Welch’s t-test",
            "group1_mean": g1.mean(),
            "group2_mean": g2.mean(),
            "stat": stat,
            "p_value": p
        }

    def anova_test(self, group_col, value_col, min_group_size=30, top_n=None):
        """
        ANOVA for >2 groups, optionally top_n most frequent.
        """
        group_counts = self.df[group_col].value_counts()
        valid_groups = group_counts[group_counts >= min_group_size].index

        if top_n:
            valid_groups = valid_groups[:top_n]

        df_filtered = self.df[self.df[group_col].isin(valid_groups)]

        groups = [g[value_col].dropna() for _, g in df_filtered.groupby(group_col)]
        stat, p = f_oneway(*groups)
        return {
            "test": "ANOVA",
            "groups_tested": list(valid_groups),
            "stat": stat,
            "p_value": p
        }


### Testing

In [None]:
tester = HypothesisTester(df)

# H4a: Gender vs HasClaim
result_h4a = tester.chi_squared_test('Gender', 'HasClaim')
print(result_h4a)

# H4b: Gender vs Claim Severity
claimants = df[df['HasClaim'] == 1]
tester_claims = HypothesisTester(claimants)
result_h4b = tester_claims.mann_whitney_test('Gender', 'TotalClaims', 'Male', 'Female')
print(result_h4b)

# H1: Province vs LossRatio
result_h1 = tester.anova_test('Province', 'LossRatio', min_group_size=30)
print(result_h1)