<a href="https://colab.research.google.com/github/anshupandey/MSA-analytics/blob/main/Model_Monitoring/Lab4_Model_Retraining_Strategies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4: Model Retraining Strategies
**Objective**: Learn strategies for retraining models in response to detected drift or performance degradation.

This lab will help us understand when and how to retrain models, build retraining pipelines, and assess improvements.

## Step 1: Criteria for Model Retraining
- PSI > 0.2 for one or more features
- Drop in model performance (e.g., F1-Score < threshold)
- Time-based schedule (e.g., monthly retrain)
- Significant increase in prediction errors


In [2]:
import pandas as pd

# Load dataset
url = "https://raw.githubusercontent.com/anshupandey/MSA-analytics/refs/heads/main/datasets/Ocean_Hull_Insurance_datasetv2.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Vessel_ID,Vessel_Type,Operating_Zone,Vessel_Age,Flag_State,Inspection_Status,Weather_Risk,Piracy_Risk,Claim_Occurred,Claim_Amount,Premium,Sum_Insured,Daily_Return,Initial_Investment,Final_Value
0,1,Container Ship,Strait of Malacca,19,Marshall Islands,Overdue,Moderate,Moderate,0,0,159500,14050000,0.000997,95700.0,104676.840752
1,2,Container Ship,Strait of Malacca,40,Marshall Islands,Up-to-date,High,High,0,0,170000,13000000,0.000362,102000.0,105374.760051
2,3,Tanker,Indian Ocean,26,Greece,Delayed,Low,Low,0,0,193000,18700000,0.001148,115800.0,128393.183954
3,4,Bulk Carrier,South China Sea,7,Greece,Up-to-date,Moderate,Low,0,0,123500,11650000,0.002023,74100.0,88881.524555
4,5,Offshore Support Vessel,Mediterranean,18,Liberia,Delayed,Low,Low,0,0,109000,7100000,0.000266,65400.0,66983.429972


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X = df.drop(['Claim_Occurred',"Claim_Amount","Daily_Return",'Initial_Investment','Final_Value'], axis=1)
y = df['Claim_Occurred']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)

In [4]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = pipeline.predict(X_test)
print("Before Retraining:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))

Before Retraining:
Accuracy: 0.9555555555555556
Precision: 1.0
Recall: 0.8620689655172413
F1-Score: 0.9259259259259259


## Step 2: Retraining with Updated Dataset
We simulate new data arriving over time and use it for retraining.

In [6]:
# Simulate new data (using the test set here)
X_new = X_test.copy()
y_new = y_test.copy()

# Retrain the model
pipeline.fit(X_new, y_new)

# Evaluate after retraining
y_pred_new = pipeline.predict(X_new)
print("After Retraining:")
print("Accuracy:", accuracy_score(y_new, y_pred_new))
print("Precision:", precision_score(y_new, y_pred_new))
print("Recall:", recall_score(y_new, y_pred_new))
print("F1-Score:", f1_score(y_new, y_pred_new))

After Retraining:
Accuracy: 0.9666666666666667
Precision: 1.0
Recall: 0.896551724137931
F1-Score: 0.9454545454545454
