<a href="https://colab.research.google.com/github/anshupandey/MSA-analytics/blob/main/Model_Monitoring/Lab3_Stability_Monitoring_PSI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: Stability Monitoring with PSI
**Objective**: Implement the Population Stability Index (PSI) to monitor feature distribution changes over time.

This lab helps us understand how certain input features in our Ocean Hull Insurance dataset may have drifted over time.
Drift in feature distributions can lead to degraded model performance.

In [13]:
import pandas as pd
import numpy as np

# Load dataset
url = "https://raw.githubusercontent.com/anshupandey/MSA-analytics/refs/heads/main/datasets/Ocean_Hull_Insurance_datasetv2.csv"
df = pd.read_csv(url)

# Convert object columns to categorical
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].astype('category')

# Create train and current slices (70/30 split)
split_idx = int(len(df) * 0.7)
df_train = df.iloc[:split_idx].copy() # reference data
df_current = df.iloc[split_idx:].copy() # current data

df_train.shape, df_current.shape


((210, 15), (90, 15))

### PSI Calculation Logic
We use bins or category distribution to compare how feature distributions changed between two time periods.

In [14]:
def calculate_psi(expected, actual, buckets=10):
    def scale_range(series, buckets):
        return pd.qcut(series.rank(method='first'), buckets, labels=False, duplicates='drop')

    if expected.dtype.name == 'category' or expected.dtype == 'object':
        expected_dist = expected.value_counts(normalize=True)
        actual_dist = actual.value_counts(normalize=True)
        all_categories = set(expected_dist.index).union(actual_dist.index)
        psi_val = 0
        for cat in all_categories:
            e_perc = expected_dist.get(cat, 0.0001)
            a_perc = actual_dist.get(cat, 0.0001)
            psi_val += (e_perc - a_perc) * np.log(e_perc / a_perc) # formula for PSI
    else:
        expected_bins = scale_range(expected, buckets)
        actual_bins = scale_range(actual, buckets)
        expected_perc = pd.Series(expected_bins).value_counts(normalize=True)
        actual_perc = pd.Series(actual_bins).value_counts(normalize=True)
        psi_val = 0
        for b in range(buckets):
            e_perc = expected_perc.get(b, 0.0001)
            a_perc = actual_perc.get(b, 0.0001)
            psi_val += (e_perc - a_perc) * np.log(e_perc / a_perc)
    return psi_val


### Compute PSI for All Features

In [15]:
psi_results = []
for col in df.columns:
    if col == 'Claim_Occurred':
        continue
    psi_val = calculate_psi(df_train[col], df_current[col])
    psi_results.append({'Feature': col, 'PSI': psi_val})

psi_df = pd.DataFrame(psi_results).sort_values(by='PSI', ascending=False)
psi_df


Unnamed: 0,Feature,PSI
2,Operating_Zone,0.208578
1,Vessel_Type,0.086394
4,Flag_State,0.04307
6,Weather_Risk,0.008464
5,Inspection_Status,0.007128
7,Piracy_Risk,0.003743
3,Vessel_Age,0.0
0,Vessel_ID,0.0
8,Claim_Amount,0.0
9,Premium,0.0


### Interpretation Strategy
- PSI < 0.1: Stable
- 0.1 <= PSI < 0.25: Moderate Shift – **monitor these features**
- PSI >= 0.25: Significant Shift – **investigate & retrain model if needed**

From the results:
- **Operating_Zone** shows **moderate drift**.
- **Vessel_Type** shows mild changes.
- Others are stable.

In [18]:
# Simulate drift in Weather_Risk and Vessel_Age
df_current.loc[df_current.sample(frac=0.4).index, 'Weather_Risk'] = 'High'  # randomly setting 30% of values of column weather_risk as "high"
df_current['Vessel_Age'] += np.random.randint(-20, 50, size=df_current.shape[0]) # randomly adding a value between -3 to +3 to vessel age
df_current['Vessel_Age'] = df_current['Vessel_Age'].clip(lower=5, upper=40) # clipping all the values to make it between 5 to 40


In [22]:
psi_results = []
for col in df.columns:
    if col == 'Claim_Occurred':
        continue
    psi_val = calculate_psi(df_train[col], df_current[col],buckets=10)
    psi_results.append({'Feature': col, 'PSI': psi_val})

psi_df = pd.DataFrame(psi_results).sort_values(by='PSI', ascending=False)
psi_df


Unnamed: 0,Feature,PSI
6,Weather_Risk,1.099915
2,Operating_Zone,0.208578
1,Vessel_Type,0.086394
4,Flag_State,0.04307
5,Inspection_Status,0.007128
7,Piracy_Risk,0.003743
3,Vessel_Age,0.0
0,Vessel_ID,0.0
8,Claim_Amount,0.0
9,Premium,0.0
