In [1]:
# Add the src directory to the Python path
import sys
import pandas as pd
sys.path.append('../src')
# Import necessary modules
from hypothesis_testing import HypothesisTester

In [2]:

# Load the processed data
cleaned_data = pd.read_csv('../data/Processed/cleaned_data.csv', low_memory=False)


In [3]:
tester = HypothesisTester(cleaned_data)

### Risk Differences Across Provinces
#### Analysis Objective
This test examines whether there are significant differences in risk levels (measured by Total Claims) across provinces. The goal is to understand how risk varies regionally, which can inform province-specific policies or risk management strategies.
### Hypothesis
Null Hypothesis (H₀): No risk differences across provinces.

Alternative Hypothesis (H₁): Risk differences exist across provinces.

In [8]:

# 1. Test risk differences across provinces
province_result = tester.test_risk_by_group('Province', 'TotalClaims')
print(province_result)

Test: ANOVA
Statistic: 5.849413762407606
P-value: 1.6782057588675903e-07
Message: Reject Null Hypothesis: There are significant differences.
None


#### Results

- Test Type: ANOVA (Analysis of Variance)
  - Compares the mean Total Claims across multiple provinces.
- F-Statistic: 5.849
  - Indicates significant variation in Total Claims between provinces.
- p-Value: 1.6782e-07
 - This value is much smaller than the common significance level of 0.05 indicating that the observed differences are statistically significant.
- Decision: Reject the null hypothesis (H₀).
#### Conclusion
There is sufficient evidence to conclude that significant risk differences exist across provinces. The variations in Total Claims are statistically significant and not due to random chance.
#### Implications
1. Targeted Risk Strategy: Risk management or pricing strategies may need to be adjusted by province.
2. Further Exploration: Identify specific provinces driving the differences and tailor policies accordingly.
3. Potential Policy Adjustments: Review how premiums and policies can be aligned with the provincial risk profiles.
#### Next Steps
Visualize Total Claims by province to pinpoint specific regional disparities.

Investigate other factors contributing to the risk differences between provinces.

## Risk Differences Between Zip Codes

### Analysis Objective
This test evaluates whether there are significant differences in **risk levels** (measured by `Total Claims`) between zip codes. The goal is to determine if zip code-specific risk management or pricing strategies are warranted.

### Hypotheses
- **Null Hypothesis (H₀):** No risk differences between zip codes.
- **Alternative Hypothesis (H₁):** Risk differences exist between zip codes.

In [9]:
# 2. Test risk differences between zip codes
zip_result = tester.test_risk_by_group('PostalCode', 'TotalClaims')
print(zip_result)


Test: ANOVA
Statistic: 0.9419762214391849
P-value: 0.8906511279164051
Message: Accept Null Hypothesis: There are no significant differences.
None


### Results
- **Test Type:** ANOVA (Analysis of Variance)  
  - Compares the mean `Total Claims` across multiple zip codes.
- **F-Statistic:** 0.942  
  - Indicates a very small variance in `Total Claims` between zip codes.
- **p-Value:** 0.891  
  - This p-value is much greater than the common significance level of 0.05, suggesting that the observed differences are likely due to random chance.
- **Decision:** Accept the null hypothesis (H₀).

### Conclusion
There is **insufficient evidence** to reject the null hypothesis, meaning there are no significant risk differences between zip codes. Any variations in `Total Claims` across zip codes are likely due to random fluctuation rather than any systematic difference.

### Implications
1. **Uniform Risk Strategy:** Since no significant differences exist between zip codes, it may not be necessary to implement zip code-specific risk management or pricing strategies.
2. **Resource Allocation:** The same risk management approach can be applied uniformly across zip codes, as there is no compelling evidence to suggest otherwise.
3. **Exploring Other Factors:** Further analysis could explore whether other variables, such as demographics or policy types, influence risk within zip codes.


## Profit Margin Differences by Zip Codes

### Analysis Objective
This test assesses whether there are significant differences in **profit margins** (measured by `Total Premium`) across zip codes. The objective is to determine if zip code-specific pricing or strategies are necessary for profitability.

### Hypotheses
- **Null Hypothesis (H₀):** No profit margin differences between zip codes.
- **Alternative Hypothesis (H₁):** Profit margin differences exist between zip codes.


In [10]:
# 3. Test profit margin differences by zip codes
profit_result = tester.test_risk_by_group('PostalCode', 'TotalPremium')
print(profit_result)

Test: ANOVA
Statistic: 10.81111575835253
P-value: 0.0
Message: Reject Null Hypothesis: There are significant differences.
None


### Results
- **Test Type:** ANOVA (Analysis of Variance)  
  - Compares the mean `Total Premium` across multiple zip codes.
- **F-Statistic:** 10.811  
  - Indicates a substantial variance in `Total Premium` between zip codes.
- **p-Value:** 0.0  
  - This p-value is well below the typical significance level of 0.05, suggesting that the observed differences are statistically significant.
- **Decision:** Reject the null hypothesis (H₀).

### Conclusion
There is **sufficient evidence** to reject the null hypothesis, indicating that significant differences in profit margins exist between zip codes. This suggests that risk and pricing strategies may need to be adjusted depending on the zip code.

### Implications
1. **Targeted Pricing Strategies:** Given the significant differences in profit margins across zip codes, it may be necessary to implement differentiated pricing strategies based on zip code.
2. **Tailored Risk Management:** Zip codes with higher profit margins may warrant more focused risk management efforts to maintain profitability.
3. **Further Analysis:** Consider conducting further analyses to identify which factors (e.g., demographics, claims history) contribute to the differences in `Total Premium`.

### Next Steps
- Investigate which zip codes exhibit the highest and lowest profit margins.
- Explore the underlying factors driving the differences in profit margins, such as customer demographics or policy types.
- Develop customized pricing strategies based on zip code-specific risk and profit profiles.

## Risk Differences Between Genders

### Analysis Objective
This test examines whether there are significant differences in **risk levels** (measured by `Total Claims`) between genders. The aim is to understand if gender-based risk differences exist, which could influence gender-specific insurance policies or risk management approaches.

### Hypotheses
- **Null Hypothesis (H₀):** No risk differences between genders.
- **Alternative Hypothesis (H₁):** Risk differences exist between genders.


In [11]:
# 4. Test risk differences between genders
gender_result = tester.test_risk_by_group('Gender', 'TotalClaims')
print(gender_result)

Test: ANOVA
Statistic: 3.1698468738351573
P-value: 0.023230509106449998
Message: Reject Null Hypothesis: There are significant differences.
None


### Results
- **Test Type:** ANOVA (Analysis of Variance)  
  - Compares the mean `Total Claims` between two groups: men and women.
- **F-Statistic:** 3.17  
  - Suggests a moderate variance in `Total Claims` between genders compared to within each gender.
- **p-Value:** 0.023  
  - This p-value is below the standard significance level of 0.05, indicating that the observed differences are statistically significant.
- **Decision:** Reject the null hypothesis (H₀).

### Conclusion
There is **sufficient evidence** to reject the null hypothesis, indicating that significant risk differences exist between genders. This suggests that gender-based considerations may be relevant in risk management and pricing strategies.

### Implications
1. **Gender-Specific Risk Management:** The observed differences in risk levels between genders may require differentiated risk management or pricing strategies for men and women.
2. **Further Exploration:** Investigate additional factors that might contribute to gender-based risk differences, such as age or health status.
3. **Policy Adjustments:** Consider revisiting existing policies and adjusting them based on gender-specific risk insights.
