# PSI (Population Stability Index) Monitoring Visualization

This notebook demonstrates how to read the PSI monitoring results generated by the Airflow DAG and visualize them to track model stability over time.

In [2]:
import os
import glob
import json
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## 1. Load and Parse PSI Data

First, we'll locate and read all the PSI monitoring JSON files from the `datamart/gold/psi_monitoring` directory. We'll parse these files and load them into a pandas DataFrame for easier analysis and visualization.

In [3]:
# Define the path to the PSI monitoring data
PSI_DATA_DIR = 'datamart/gold/psi_monitoring'

# Find all PSI JSON files
psi_files = sorted(glob.glob(os.path.join(PSI_DATA_DIR, 'psi_*.json')))

if not psi_files:
    print(f"No PSI files found in '{PSI_DATA_DIR}'.")
    print("Please run the Airflow DAG to generate PSI monitoring data first.")
else:
    print(f"Found {len(psi_files)} PSI files.")

# Read and parse all PSI JSON files into a list of records
records = []
for file_path in psi_files:
    try:
        with open(file_path, 'r') as f:
            data = json.load(f)
            if data.get('status') == 'OK':
                month = data.get('month')
                overall_psi = data.get('overall')
                psi_details = data.get('psi', {})
                
                record = {'month': month, 'overall_psi': overall_psi}
                record.update(psi_details)
                records.append(record)
    except (json.JSONDecodeError, KeyError) as e:
        print(f"Could not process file {os.path.basename(file_path)}: {e}")

# Create a DataFrame
if records:
    psi_df = pd.DataFrame(records)
    psi_df = psi_df.sort_values('month').reset_index(drop=True)
    
    # Convert month to datetime for better plotting
    psi_df['month_dt'] = pd.to_datetime(psi_df['month'], format='%Y-%m')
    
    print("PSI Data loaded successfully:")
    display(psi_df.head())
else:
    print("No valid PSI data could be loaded.")
    psi_df = pd.DataFrame() # Create empty dataframe to avoid errors in later cells
# </VSCodeCell>
# <VSCode.Cell language="markdown">
# ## 2. Visualize Overall PSI Trend

# This chart shows the trend of the overall PSI score over time. A rising trend indicates that the model's prediction distribution is drifting away from the baseline, which might signal a need for retraining.

# **Thresholds:**
# - **PSI &lt; 0.1**: No significant shift. The model is stable.
# - **0.1 &lt;= PSI &lt; 0.25**: Moderate shift. Requires monitoring.
# - **PSI &gt;= 0.25**: Significant shift. Investigation and potential model retraining are recommended.

Found 10 PSI files.
PSI Data loaded successfully:


Unnamed: 0,month,overall_psi,prediction,month_dt
0,2024-03,0.016966,0.016966,2024-03-01
1,2024-04,0.015263,0.015263,2024-04-01
2,2024-05,0.028406,0.028406,2024-05-01
3,2024-06,0.02246,0.02246,2024-06-01
4,2024-07,0.046465,0.046465,2024-07-01


In [4]:
if not psi_df.empty:
    fig = px.line(
        psi_df, 
        x='month_dt', 
        y='overall_psi', 
        title='Overall PSI Trend Over Time',
        labels={'month_dt': 'Month', 'overall_psi': 'Overall PSI'},
        markers=True
    )

    # Add threshold lines for reference
    fig.add_hline(y=0.1, line_dash="dash", line_color="orange", annotation_text="Moderate Shift Threshold (0.1)")
    fig.add_hline(y=0.25, line_dash="dash", line_color="red", annotation_text="Significant Shift Threshold (0.25)")

    fig.update_layout(xaxis_title="Month", yaxis_title="Population Stability Index (PSI)")
    fig.show()
else:
    print("PSI DataFrame is empty. Cannot generate plot.")

## 3. Visualize Feature PSI for a Specific Month

This bar chart allows you to inspect the PSI values for individual features for a selected month. This helps identify which specific features are contributing most to the model drift.

In [5]:
if not psi_df.empty:
    # --- Select a month to analyze ---
    # By default, we'll analyze the latest month available.
    latest_month = psi_df['month'].iloc[-1]
    print(f"Analyzing features for the latest available month: {latest_month}")
    
    feature_cols = [col for col in psi_df.columns if col not in ['month', 'month_dt', 'overall_psi']]
    
    # Extract feature PSI values for the selected month
    feature_psi = psi_df[psi_df['month'] == latest_month][feature_cols].T.reset_index()
    feature_psi.columns = ['Feature', 'PSI']
    feature_psi = feature_psi.sort_values('PSI', ascending=False)

    # Plot
    fig = px.bar(
        feature_psi,
        x='Feature',
        y='PSI',
        title=f'Feature PSI for {latest_month}',
        labels={'Feature': 'Feature Name', 'PSI': 'Population Stability Index (PSI)'}
    )
    
    # Add threshold lines
    fig.add_hline(y=0.1, line_dash="dash", line_color="orange", annotation_text="Moderate Shift")
    fig.add_hline(y=0.25, line_dash="dash", line_color="red", annotation_text="Significant Shift")
    
    fig.show()
else:
    print("PSI DataFrame is empty. Cannot generate plot.")

Analyzing features for the latest available month: 2024-12


## 4. Feature PSI Heatmap

The heatmap provides a comprehensive overview of the stability of all features across all months. It's a powerful tool for quickly spotting features that are consistently unstable or identifying specific time periods when widespread drift occurred.