# Data Segmentation and Statistical Testing

## Data Segmentation

In [2]:
import pandas as pd

# Load the data
file_path = '../data/MachineLearningRating_v3.txt'  
data = pd.read_csv(file_path, delimiter='|')  

print(data.head())

  data = pd.read_csv(file_path, delimiter='|')


   UnderwrittenCoverID  PolicyID     TransactionMonth  IsVATRegistered  \
0               145249     12827  2015-03-01 00:00:00             True   
1               145249     12827  2015-05-01 00:00:00             True   
2               145249     12827  2015-07-01 00:00:00             True   
3               145255     12827  2015-05-01 00:00:00             True   
4               145255     12827  2015-07-01 00:00:00             True   

  Citizenship          LegalType Title Language                 Bank  \
0              Close Corporation    Mr  English  First National Bank   
1              Close Corporation    Mr  English  First National Bank   
2              Close Corporation    Mr  English  First National Bank   
3              Close Corporation    Mr  English  First National Bank   
4              Close Corporation    Mr  English  First National Bank   

       AccountType  ...                    ExcessSelected CoverCategory  \
0  Current account  ...             Mobility - 

In [3]:
print(data.columns.tolist())


['UnderwrittenCoverID', 'PolicyID', 'TransactionMonth', 'IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province', 'PostalCode', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'mmcode', 'VehicleType', 'RegistrationYear', 'make', 'Model', 'Cylinders', 'cubiccapacity', 'kilowatts', 'bodytype', 'NumberOfDoors', 'VehicleIntroDate', 'CustomValueEstimate', 'AlarmImmobiliser', 'TrackingDevice', 'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'CrossBorder', 'NumberOfVehiclesInFleet', 'SumInsured', 'TermFrequency', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass', 'StatutoryRiskType', 'TotalPremium', 'TotalClaims']


In [4]:
# Get the two most common provinces
top_provinces = data['Province'].value_counts().index[:2]

group_prov_1 = data[data['Province'] == top_provinces[0]]
group_prov_2 = data[data['Province'] == top_provinces[1]]


In [5]:
# Get top 2 zip codes with the most records
top_postal_codes = data['PostalCode'].value_counts().index[:2]

group_zip_1 = data[data['PostalCode'] == top_postal_codes[0]]
group_zip_2 = data[data['PostalCode'] == top_postal_codes[1]]


In [6]:
group_female = data[data['Gender'].str.lower() == 'female']
group_male = data[data['Gender'].str.lower() == 'male']


## Statistical Testing

🎯 Hypothesis 1 (H₀-1):
"There are no risk differences across provinces."

We'll test this using two metrics:

Claim Frequency → Chi-squared test

Claim Severity → T-test

✅ Chi-Squared Test for Claim Frequency across Provinces
💡 Goal:
See if there's a statistically significant difference in the proportion of policies with at least one claim between the top two provinces.

📊 What we’re testing:
Group 1: Province A (e.g., Gauteng)

Group 2: Province B (e.g., Western Cape)

Metric: Proportion of rows where ClaimAmount > 0

🧪 Test: Chi-Squared Test for Independence
📌 Why?
Because we’re comparing frequencies of a binary outcome (HasClaim: Yes/No) across two groups (Province A vs. Province B).

In [7]:
from scipy.stats import chi2_contingency

# Step 1: Create the binary HasClaim column
data['HasClaim'] = data['TotalClaims'] > 0

# Step 2: Get the top 2 provinces
top_provinces = data['Province'].value_counts().index[:2]
prov1 = top_provinces[0]
prov2 = top_provinces[1]

# Step 3: Subset data
group_prov1 = data[data['Province'] == prov1]
group_prov2 = data[data['Province'] == prov2]

# Step 4: Build a contingency table
contingency = [
    [
        group_prov1['HasClaim'].sum(),                      # number of claims in prov1
        len(group_prov1) - group_prov1['HasClaim'].sum()    # no-claims in prov1
    ],
    [
        group_prov2['HasClaim'].sum(),                      
        len(group_prov2) - group_prov2['HasClaim'].sum()
    ]
]

# Step 5: Perform chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-squared: {chi2:.2f}")
print(f"p-value: {p:.4f}")


Chi-squared: 56.09
p-value: 0.0000


📣 Conclusion (Business Interpretation):
We reject the null hypothesis that there are no risk differences across provinces (p < 0.0001).
Specifically, the claim frequency differs significantly between the top two provinces.
This suggests geographical risk segmentation is warranted — provincial location should be considered in pricing or underwriting strategies.

✅ T-Test for Claim Severity Across Provinces
💡 Goal:
We want to see whether the average claim amount, given that a claim occurred, is significantly different between the top two provinces.

📊 What we’re testing:
Claim Severity = TotalClaims per policy (but only where TotalClaims > 0)

Group A: Province 1 (e.g., Gauteng)

Group B: Province 2 (e.g., Western Cape)

🧪 Test: Independent Two-Sample T-Test
📌 Why?
We’re comparing the means of a continuous variable (TotalClaims) between two independent groups, only for records where a claim occurred.

In [8]:
from scipy.stats import ttest_ind

# Step 1: Filter for policies that had claims
claims_data = data[data['TotalClaims'] > 0]

# Step 2: Get claim values by province
claims_prov1 = claims_data[claims_data['Province'] == prov1]['TotalClaims']
claims_prov2 = claims_data[claims_data['Province'] == prov2]['TotalClaims']

# Step 3: Perform independent t-test
t_stat, p_value = ttest_ind(claims_prov1, claims_prov2, equal_var=False)  # Welch's t-test

print(f"T-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.4f}")


T-statistic: -2.17
p-value: 0.0306


🎯 Decision:
Since p < 0.05, we reject the null hypothesis.

📣 Business Interpretation:
We reject the null hypothesis that claim severity is the same across provinces (p = 0.0306).
This suggests that, among those who make a claim, the average claim amount differs significantly between the top two provinces.
In other words, not only do people in different provinces claim at different rates, but they also claim different amounts.

🎯 Hypothesis 2 (H₀-2):
"There are no risk differences between zip codes."

We'll use the same two metrics:

Claim Frequency → Chi-squared test

Claim Severity → T-test

✅ Chi-Squared Test for Claim Frequency Across Zip Codes
💡 Goal:
Determine if the proportion of policies with at least one claim differs significantly between the two most common postal codes.

In [9]:
# Step 1: Get top 2 postal codes
top_postal_codes = data['PostalCode'].value_counts().index[:2]
zip1 = top_postal_codes[0]
zip2 = top_postal_codes[1]

# Step 2: Subset data
group_zip1 = data[data['PostalCode'] == zip1]
group_zip2 = data[data['PostalCode'] == zip2]

# Step 3: Build contingency table for claim frequency
contingency_zip = [
    [
        group_zip1['HasClaim'].sum(),
        len(group_zip1) - group_zip1['HasClaim'].sum()
    ],
    [
        group_zip2['HasClaim'].sum(),
        len(group_zip2) - group_zip2['HasClaim'].sum()
    ]
]

# Step 4: Perform chi-squared test
chi2_zip, p_zip, dof_zip, expected_zip = chi2_contingency(contingency_zip)

print(f"Chi-squared: {chi2_zip:.2f}")
print(f"p-value: {p_zip:.4f}")


Chi-squared: 3.60
p-value: 0.0579


🎯 Decision:
Since p-value ≥ 0.05, we fail to reject the null hypothesis.

📣 Business Interpretation:
We do not have sufficient evidence to say that claim frequency differs between the top two zip codes (p = 0.0579).
This means the rate at which customers file claims appears similar between those zip codes, based on the current data.



✅ T-Test for Claim Severity Across Zip Codes
💡 Goal:
Check whether the average claim amount differs significantly between the top two zip codes (only among those who actually filed a claim).



In [10]:
# Step 1: Filter to policies with claims
claims_data_zip = data[data['TotalClaims'] > 0]

# Step 2: Subset data by top 2 postal codes
claims_zip1 = claims_data_zip[claims_data_zip['PostalCode'] == zip1]['TotalClaims']
claims_zip2 = claims_data_zip[claims_data_zip['PostalCode'] == zip2]['TotalClaims']

# Step 3: Perform Welch's t-test (assumes unequal variance)
t_stat_zip, p_val_zip = ttest_ind(claims_zip1, claims_zip2, equal_var=False)

print(f"T-statistic: {t_stat_zip:.2f}")
print(f"p-value: {p_val_zip:.4f}")


T-statistic: 0.39
p-value: 0.7002


🎯 Decision:
Since p-value ≥ 0.05, we fail to reject the null hypothesis.

📣 Business Interpretation:
There is no significant difference in the average claim amount between the top two zip codes (p = 0.7002).
This suggests that, among customers who file claims, those in these zip codes incur similar claim costs.

💼 Strategic Insight:
Zip code alone does not appear to be a strong differentiator for either claim frequency or severity (at least between the top two zip codes).
Therefore, zip code segmentation may have limited impact on risk-adjusted pricing — unless further grouped by other factors.

✅ Test for Margin Differences Between Zip Codes
🧮 Metric:
Margin = TotalPremium - TotalClaims

This tells us how profitable or unprofitable a customer or group is.

🎯 Hypothesis 3 (H₀-3):
"There are no significant margin (profit) differences between zip codes."

🧪 Test: Independent T-Test
We'll test if average margin differs significantly between the top 2 zip codes (same ones as before).

In [11]:
# Step 1: Create margin column
data['Margin'] = data['TotalPremium'] - data['TotalClaims']

# Step 2: Subset margin by top 2 zip codes
margin_zip1 = data[data['PostalCode'] == zip1]['Margin']
margin_zip2 = data[data['PostalCode'] == zip2]['Margin']

# Step 3: Perform Welch's t-test
t_stat_margin, p_val_margin = ttest_ind(margin_zip1, margin_zip2, equal_var=False)

print(f"T-statistic: {t_stat_margin:.2f}")
print(f"p-value: {p_val_margin:.4f}")


T-statistic: 1.16
p-value: 0.2445


🎯 Decision:
Since p-value ≥ 0.05, we fail to reject the null hypothesis.

📣 Business Interpretation:
There is no significant difference in average profit margin between the top two zip codes (p = 0.2445).
This suggests that from a financial performance perspective, these zip codes are equally profitable.

✅ Risk Differences Between Women and Men
🎯 Hypothesis 4 (H₀-4):
"There are no significant risk differences between Women and Men."

We’ll evaluate this using two metrics again:

Claim Frequency (Chi-squared test)

Claim Severity (T-test, among claimants only)

✅ Chi-Squared Test for Claim Frequency by Gender
💡 Goal:
See if proportion of claimants differs significantly between male and female customers.

In [12]:
# Step 1: Filter out rows with missing gender
gender_data = data[data['Gender'].isin(['Male', 'Female'])]

# Step 2: Build contingency table
contingency_gender = pd.crosstab(gender_data['Gender'], gender_data['HasClaim'])

# Step 3: Perform chi-squared test
chi2_gender, p_gender, _, _ = chi2_contingency(contingency_gender)

print(f"Chi-squared: {chi2_gender:.2f}")
print(f"p-value: {p_gender:.4f}")


Chi-squared: 0.00
p-value: 0.9515


🎯 Decision:
Since p-value = 0.9515 ≥ 0.05, we fail to reject the null hypothesis.

📣 Business Interpretation:
There is no significant difference in the proportion of claimants between women and men.
This means men and women file claims at roughly the same rate in this dataset.



✅ T-Test for Claim Severity Between Women and Men
💡 Goal:
Check if the average claim amount differs significantly between men and women (only among those who filed claims).

In [13]:
# Step 1: Filter policies with claims and valid gender
claims_gender = data[(data['TotalClaims'] > 0) & (data['Gender'].isin(['Male', 'Female']))]

# Step 2: Separate claim amounts by gender
claims_male = claims_gender[claims_gender['Gender'] == 'Male']['TotalClaims']
claims_female = claims_gender[claims_gender['Gender'] == 'Female']['TotalClaims']

# Step 3: Perform Welch's t-test (unequal variance)
t_stat_gender, p_val_gender = ttest_ind(claims_male, claims_female, equal_var=False)

print(f"T-statistic: {t_stat_gender:.2f}")
print(f"p-value: {p_val_gender:.4f}")


T-statistic: -0.58
p-value: 0.5680


🎯 Decision:
Since p-value ≥ 0.05, we fail to reject the null hypothesis.

📣 Business Interpretation:
There is no significant difference in average claim amount between men and women who file claims.
This suggests claim severity is similar across genders in this dataset.