# Task 4 · Model Training & Interpretability  
**Dataset:** `data/cleaned/cleaned_data.csv`  
**Focus:**  
1. Claim-Severity Regression (predict `TotalClaims` where claims > 0).  
2. Model comparison: Linear Regression, Random Forest, XGBoost.  
3. Model interpretability with **LIME**.  
4. Business take-aways for risk-based premium setting.  
*All preprocessing / feature-engineering code comes from `src/task_4`.*  

In [18]:
import sys
sys.path.append("../../")

In [19]:
# Core
import pandas as pd
import numpy as np

# ML & metrics
from sklearn.metrics import mean_squared_error, r2_score

# Local modules
from src.task_4.data_processing import  prepare_claim_severity_data, prepare_claim_probability_data

from src.task_4.model_training import train_and_compare_models, evaluate_model
from src.task_4.interpretability import explain_model_with_lime, show_lime_explanation
from src.task_4.feature_engineering import add_features

DATA_PATH = "../../data/cleaned/cleaned_data.csv"
RANDOM_STATE = 42

In [20]:
df_clean = pd.read_csv(DATA_PATH)
print("Shape:", df_clean.shape)
display(df_clean.head())
df_clean.describe(include="all").T.head(10)

Shape: (569760, 48)


Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,CalculatedPremiumPerTerm,ExcessSelected,CoverCategory,CoverType,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims,claim_indicator
0,145249.0,12827,2015-03-01 00:00:00,True,,close corporation,mr,english,first national bank,current account,...,25.0,mobility - windscreen,windscreen,windscreen,mobility metered taxis: monthly,commercial,ifrs constant,21.929825,0.0,False
1,145249.0,12827,2015-05-01 00:00:00,True,,close corporation,mr,english,first national bank,current account,...,25.0,mobility - windscreen,windscreen,windscreen,mobility metered taxis: monthly,commercial,ifrs constant,21.929825,6140.350877,True
2,145249.0,12827,2015-07-01 00:00:00,True,,close corporation,mr,english,first national bank,current account,...,25.0,mobility - windscreen,windscreen,windscreen,mobility metered taxis: monthly,commercial,ifrs constant,0.0,0.0,False
3,145255.0,12827,2015-05-01 00:00:00,True,,close corporation,mr,english,first national bank,current account,...,584.6468,mobility - metered taxis - r2000,own damage,own damage,mobility metered taxis: monthly,commercial,ifrs constant,512.84807,0.0,False
4,145255.0,12827,2015-07-01 00:00:00,True,,close corporation,mr,english,first national bank,current account,...,584.6468,mobility - metered taxis - r2000,own damage,own damage,mobility metered taxis: monthly,commercial,ifrs constant,0.0,6140.350877,True


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
UnderwrittenCoverID,569760.0,,,,114768.440433,59657.961505,13797.0,74389.0,112079.0,145019.0,262572.41
PolicyID,569760.0,,,,8869.084099,5075.150739,369.0,5570.0,7703.0,12032.0,22193.0
TransactionMonth,569760.0,22.0,2015-08-01 00:00:00,67553.0,,,,,,,
IsVATRegistered,569760.0,2.0,False,568959.0,,,,,,,
Citizenship,0.0,,,,,,,,,,
LegalType,569760.0,2.0,individual,568299.0,,,,,,,
Title,569760.0,3.0,mr,562668.0,,,,,,,
Language,569760.0,1.0,english,569760.0,,,,,,,
Bank,569760.0,8.0,first national bank,257306.0,,,,,,,
AccountType,569760.0,3.0,current account,323270.0,,,,,,,


In [21]:
df = add_features(df_clean)
print("Shape after feature engineering:", df.shape)

Shape after feature engineering: (569760, 54)


In [22]:
df.isna().sum().sort_values(ascending=True).tail(20)

Rebuilt                          0
Converted                        0
SumInsured                       0
TermFrequency                    0
CalculatedPremiumPerTerm         0
ExcessSelected                   0
CoverCategory                    0
CoverType                        0
Product                          0
StatutoryClass                   0
StatutoryRiskType                0
TotalPremium                     0
TotalClaims                      0
claim_indicator                  0
IsHighValue                      0
VehicleAge                       0
IsNew                            0
PowerPerCylinder                 0
MonthlyPremium                   0
Citizenship                 569760
dtype: int64

In [23]:
df = df.drop(columns = ['Citizenship', 'ClaimRatio'])

In [24]:
df.isna().sum().sum()

np.int64(0)

In [25]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = prepare_claim_severity_data(df)
print(f"Train set: {X_train_reg.shape}  |  Test set: {X_test_reg.shape}")

  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(series, errors='coerce')
  parsed = pd.to_datetime(serie

Train set: (92238, 50)  |  Test set: (23060, 50)


In [26]:
df.isna().sum().sort_values(ascending=False).head(20)

UnderwrittenCoverID    0
PolicyID               0
IsVATRegistered        0
LegalType              0
Title                  0
Language               0
Bank                   0
AccountType            0
MaritalStatus          0
Gender                 0
Country                0
Province               0
PostalCode             0
SubCrestaZone          0
ItemType               0
mmcode                 0
VehicleType            0
RegistrationYear       0
make                   0
Model                  0
dtype: int64

In [27]:
X_train_reg.isna().sum().sum()

np.int64(0)

In [28]:
df.describe(include="all")

Unnamed: 0,UnderwrittenCoverID,PolicyID,IsVATRegistered,LegalType,Title,Language,Bank,AccountType,MaritalStatus,Gender,...,claim_indicator,VehicleAge,IsNew,PowerPerCylinder,IsHighValue,MonthlyPremium,TransactionMonth_year,TransactionMonth_month,VehicleIntroDate_year,VehicleIntroDate_month
count,569760.0,569760.0,569760,569760,569760,569760,569760,569760,569760,569760,...,569760,569760.0,569760.0,569760.0,569760.0,569760.0,569760.0,569760.0,569760.0,569760.0
unique,,,2,2,3,1,8,3,2,2,...,2,,,,,,,,,
top,,,False,individual,mr,english,first national bank,current account,not specified,Female,...,False,,,,,,,,,
freq,,,568959,568299,562668,569760,257306,323270,569584,563581,...,454462,,,,,,,,,
mean,114768.440433,8869.084099,,,,,,,,,...,,14.854853,0.0,24.231296,0.119454,114.426005,2014.806062,5.827273,2007.548117,7.374884
std,59657.961505,5075.150739,,,,,,,,,...,,3.287042,0.0,4.390139,0.324322,212.714929,0.396312,3.002611,5.957131,3.279792
min,13797.0,369.0,,,,,,,,,...,,10.0,0.0,15.333333,0.0,0.9292,2013.0,1.0,1991.0,1.0
25%,74389.0,5570.0,,,,,,,,,...,,12.0,0.0,18.75,0.0,3.2417,2015.0,3.0,2007.0,4.0
50%,112079.0,7703.0,,,,,,,,,...,,14.0,0.0,27.75,0.0,8.36325,2015.0,6.0,2010.0,8.0
75%,145019.0,12032.0,,,,,,,,,...,,17.0,0.0,27.75,0.0,90.0,2015.0,8.0,2012.0,11.0


In [29]:
assert len(X_train_reg) == len(y_train_reg) , "X_train and y_train length mismatch"
assert len(X_test_reg) == len(y_test_reg), "X_test and y_test length mismatch"
print(len(X_train_reg), len(y_train_reg), len(X_test_reg), len(y_test_reg))


92238 92238 23060 23060


In [30]:
best_model, best_name, all_scores = train_and_compare_models(X_train_reg, y_train_reg, X_test_reg, y_test_reg)
print("\nAll model scores:", all_scores)
print(f"\n🟢  Best model = {best_name}")

Linear Regression Results → RMSE: 4347.74, R²: 0.0438
Random Forest Results → RMSE: 4837.29, R²: -0.1837
XGBoost Results → RMSE: 4757.96, R²: -0.1452

All model scores: {'LinearRegression': (np.float64(4347.739916554806), 0.04377097808964758), 'RandomForest': (np.float64(4837.287837817768), -0.18369187971065437), 'XGBoost': (np.float64(4757.959883084042), -0.14518686832165772)}

🟢  Best model = LinearRegression


# Model Interpretability

## importance


In [33]:
if hasattr(best_model, "feature_importances_"):
    importances = pd.Series(best_model.feature_importances_,
                            index=X_train_reg.columns).sort_values(ascending=False)
    display(importances.head(10).to_frame("Importance"))
elif best_name == "LinearRegression":
    coefs = pd.Series(best_model.coef_, index=X_train_reg.columns)
    display(coefs.sort_values(key=abs, ascending=False).head(10).to_frame("Coefficient"))
else:
    print("Model has no built-in feature_importances_.")

Unnamed: 0,Coefficient
Cylinders,-902.206323
IsVATRegistered,-659.133239
LegalType,-336.25762
AlarmImmobiliser,274.009953
Converted,-269.20271
NewVehicle,219.571274
TransactionMonth_year,-183.414234
PowerPerCylinder,-156.805334
Title,150.068979
VehicleType,-137.916064


## LIME

In [34]:
# Explain the first row of X_test_reg
lime_exp = explain_model_with_lime(
    best_model,
    X_train_reg,
    X_test_reg,
    feature_names=list(X_train_reg.columns),
    instance_idx=0,
    mode="regression"
)
show_lime_explanation(lime_exp)

Feature contributions to prediction:
Cylinders <= 4.00: 4785.4689
IsVATRegistered <= 0.00: 1206.5559
18.75 < PowerPerCylinder <= 27.75: -985.4877
75.00 < kilowatts <= 111.00: 768.3439
Converted <= 0.00: -576.0644
0.00 < TotalPremium <= 2.36: -397.2949
make <= 31.00: -385.9235
mmcode > 60058418.00: -377.3096
14.00 < CoverCategory <= 18.00: -323.6194
13.00 < CoverType <= 16.00: 271.2910




## classification

In [35]:
X_train_clf, X_test_clf, y_train_clf, y_test_clf = prepare_claim_probability_data(df)
print("Classifier train / test shapes:", X_train_clf.shape, X_test_clf.shape)

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def eval_clf(name, model, Xt, yt):
    pred = model.predict(Xt)
    return {
        "Acc": accuracy_score(yt, pred),
        "Prec": precision_score(yt, pred),
        "Rec": recall_score(yt, pred),
        "F1": f1_score(yt, pred)
    }

clf_models = {
    "LogReg": LogisticRegression(max_iter=1000),
    "RF": RandomForestClassifier(n_estimators=400, random_state=RANDOM_STATE, n_jobs=-1),
    "XGB": XGBClassifier(n_estimators=500, learning_rate=0.05,
                         max_depth=6, random_state=RANDOM_STATE, n_jobs=-1)
}
clf_scores = {}
for name, mdl in clf_models.items():
    mdl.fit(X_train_clf, y_train_clf)
    clf_scores[name] = eval_clf(name, mdl, X_test_clf, y_test_clf)

pd.DataFrame(clf_scores).T

Classifier train / test shapes: (455808, 53) (113952, 53)


Unnamed: 0,Acc,Prec,Rec,F1
LogReg,1.0,1.0,1.0,1.0
RF,1.0,1.0,1.0,1.0
XGB,0.998728,1.0,0.993712,0.996846


## 🔍 Key Pricing Insights & Actions

| Insight                                | Evidence (LIME / Importance)     | Action                                         |
|----------------------------------------|----------------------------------|------------------------------------------------|
| ≤4 Cylinders → Higher Severity         | +4785 / -902                     | Apply loading to ≤4-cylinder vehicles          |
| Not VAT Registered → Higher Risk       | +1206 / -659                     | Increase base rate for non-VAT customers       |
| Converted Vehicles → Higher Risk       | -576 / -269                      | Raise premiums for modified/converted cars     |
| Alarm/Immobiliser → Lower Risk         | N/A / +274                       | Offer discounts for vehicles with alarms       |
| New Vehicles → Lower Risk              | N/A / +220                       | Discount premiums for new cars                 |
| Moderate Power/Cylinder → Lower Risk   | -985 / -157                      | Consider targeted discounts                    |
| Low-Premium Segment → Possibly Underpriced | -397                          | Review pricing in low-premium segments         |

> Deploying **{best_name}** reduces RMSE by **Δ%** vs. linear model, enhancing pricing precision.  
> Pairing with **{best_clf_name}** enables dynamic pricing:

\[
\text{Premium} = \left[\Pr(\text{Claim}) \times \widehat{\text{Severity}}\right] + \text{Expenses} + \text{Margin}
\]
