# Statistical Modeling for AlphaCare Insurance Solutions (ACIS)

This notebook performs statistical modeling on the insurance claim data to predict TotalPremium and analyze feature importance.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
import shap

# Define the path to the src directory
src_dir = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
sys.path.insert(0, src_dir)

if 'data_loader' in sys.modules:
    del sys.modules['data_loader']
if 'statistical_modeling' in sys.modules:
    del sys.modules['statistical_modeling']

from data_loader import DataLoader
from statistical_modeling import StatisticalModeling

## Load and Prepare Data

In [3]:
data_loader = DataLoader('../resources/Data/machineLearning.txt')
data = data_loader.load_data()
print(data.head())
print(data.info())

  self.data = pd.read_csv(self.file_path, sep='|')


   UnderwrittenCoverID  PolicyID     TransactionMonth  IsVATRegistered  \
0               145249     12827  2015-03-01 00:00:00             True   
1               145249     12827  2015-05-01 00:00:00             True   
2               145249     12827  2015-07-01 00:00:00             True   
3               145255     12827  2015-05-01 00:00:00             True   
4               145255     12827  2015-07-01 00:00:00             True   

  Citizenship          LegalType Title Language                 Bank  \
0              Close Corporation    Mr  English  First National Bank   
1              Close Corporation    Mr  English  First National Bank   
2              Close Corporation    Mr  English  First National Bank   
3              Close Corporation    Mr  English  First National Bank   
4              Close Corporation    Mr  English  First National Bank   

       AccountType  ...                    ExcessSelected CoverCategory  \
0  Current account  ...             Mobility - 

In [6]:
modeling = StatisticalModeling(data)
modeling.prepare_data(target='TotalPremium')

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

## Build and Evaluate Models

In [None]:
modeling.build_models()
results = modeling.evaluate_models()

for model, metrics in results.items():
    print(f"{model}:")
    print(f"  MSE: {metrics['MSE']:.2f}")
    print(f"  R2: {metrics['R2']:.2f}")
    print()

## Feature Importance Analysis

In [None]:
importances = modeling.feature_importance()
plt.figure(figsize=(12, 8))
importances.head(20).plot(kind='bar')
plt.title('Top 20 Feature Importances')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

## SHAP Analysis

In [None]:
shap_values, feature_names = modeling.shap_analysis()
shap.summary_plot(shap_values, modeling.X_test, feature_names=feature_names, max_display=20)

## Observations and Conclusions

1. Model Performance:
   - [Interpret the results of each model]
   - [Compare the performance of Linear Regression, Random Forest, and XGBoost]
   - [Discuss which model performs best and why]

2. Feature Importance:
   - [Discuss the top features identified by the Random Forest model]
   - [Explain how these features might influence TotalPremium]

3. SHAP Analysis:
   - [Interpret the SHAP summary plot]
   - [Discuss how different features impact the model's predictions]
   - [Compare SHAP results with feature importance from Random Forest]

4. Implications for ACIS:
   - [Discuss how these insights can be used to optimize pricing strategies]
   - [Suggest potential areas for product development or risk management]
   - [Recommend ways to leverage the most important features in marketing or underwriting]

5. Limitations and Future Work:
   - [Discuss any limitations of the current analysis]
   - [Suggest potential improvements or additional analyses]
   - [Recommend any additional data that could enhance the models]

Overall, this statistical modeling exercise provides valuable insights into the factors driving premium pricing and can help ACIS make data-driven decisions in their insurance operations.