## Layer 5: Early-Warning Indicators and ML Reinforcement

This notebook formalises early-warning indicators for Aadhaar identity stress and reinforces them using machine learning.  
The objective is to prioritise regions requiring attention while preserving transparency and governance relevance.


In [None]:
# Import required libraries for data manipulation
import pandas as pd
import numpy as np

# Load analysis-ready dataset and ensure correct date format
df = pd.read_csv('identity_metrics.csv')
df['month'] = pd.to_datetime(df['month'])

In [None]:
df[['demo_update_ratio', 'bio_update_ratio']].describe()

Unnamed: 0,demo_update_ratio,bio_update_ratio
count,5062.0,5062.0
mean,11.412227,13.406626
std,41.50029,35.789282
min,0.0,0.0
25%,3.216551,3.603244
50%,5.976599,6.351876
75%,10.736147,12.186472
max,1599.297872,1101.255319


### Step 1: Rule-Based Early-Warning Indicators

Explainable thresholds are defined using percentile-based logic to identify regions experiencing unusually high Aadhaar update pressure.  
This approach ensures interpretability and avoids black-box decision making.


In [None]:
# Define percentile-based thresholds for high update pressure
demo_warning_threshold = df['demo_update_ratio'].quantile(0.90)
bio_warning_threshold = df['bio_update_ratio'].quantile(0.90)

In [None]:
# Create binary warning flags based on update pressure thresholds
df['demo_update_warning'] = df['demo_update_ratio'] > demo_warning_threshold
df['bio_update_warning'] = df['bio_update_ratio'] > bio_warning_threshold

In [None]:
# Create a composite identity stress indicator
df['identity_stress_flag'] = (
    df['demo_update_warning'] |
    df['bio_update_warning']
)

### Step 2: Indicator Validation

The distribution of rule-based stress flags is examined to ensure selectivity.  
A well-designed early-warning system should flag a minority of regions without over-triggering alerts.


In [None]:
# Inspect distribution of stress flags for sanity check
df['identity_stress_flag'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
identity_stress_flag,Unnamed: 1_level_1
False,0.839392
True,0.160608


In [None]:
df[df['identity_stress_flag']].head(10)

Unnamed: 0,month,state,district,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,total_enrolments,total_demo_updates,total_bio_updates,demo_update_ratio,bio_update_ratio,demo_update_warning,bio_update_warning,identity_stress_flag
1,2025-03-01,Assam,Baksa,11,14,13,397.0,4438.0,1333.0,527.0,38,4835.0,1860.0,127.236842,48.947368,True,True,True
2,2025-03-01,Assam,Barpeta,24,34,10,821.0,9181.0,8751.0,2053.0,68,10002.0,10804.0,147.088235,158.882353,True,True,True
3,2025-03-01,Assam,Bongaigaon,25,33,31,699.0,8681.0,3011.0,1170.0,89,9380.0,4181.0,105.393258,46.977528,True,True,True
4,2025-03-01,Assam,Chirang,12,29,15,179.0,1711.0,1611.0,679.0,56,1890.0,2290.0,33.75,40.892857,True,True,True
5,2025-03-01,Assam,Dhemaji,20,46,19,534.0,5979.0,1622.0,834.0,85,6513.0,2456.0,76.623529,28.894118,True,True,True
6,2025-03-01,Assam,Dhubri,89,229,98,1070.0,10824.0,6765.0,1655.0,416,11894.0,8420.0,28.591346,20.240385,True,False,True
7,2025-03-01,Assam,Dibrugarh,10,12,14,338.0,5542.0,1444.0,2315.0,36,5880.0,3759.0,163.333333,104.416667,True,True,True
8,2025-03-01,Assam,Hojai,27,42,20,319.0,3439.0,1585.0,1239.0,89,3758.0,2824.0,42.224719,31.730337,True,True,True
9,2025-03-01,Assam,Kamrup,13,20,13,319.0,5222.0,2602.0,1197.0,46,5541.0,3799.0,120.456522,82.586957,True,True,True
11,2025-03-01,Assam,Lakhimpur,25,34,42,577.0,9766.0,1470.0,1353.0,101,10343.0,2823.0,102.405941,27.950495,True,True,True


In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

### Step 3: Feature Engineering for Anomaly Detection

This step prepares meaningful features for machine learning.  
Update ratios capture the intensity of Aadhaar stress, while month-on-month changes capture acceleration or persistence.


In [None]:
# Sort data to enable correct temporal computations
df = df.sort_values(by=['state', 'district', 'month'])

In [None]:
# Compute month-on-month change in update ratios
df['demo_ratio_mom_change'] = (
    df.groupby(['state', 'district'])['demo_update_ratio']
      .diff()
)

df['bio_ratio_mom_change'] = (
    df.groupby(['state', 'district'])['bio_update_ratio']
      .diff()
)

In [None]:
df[['demo_ratio_mom_change', 'bio_ratio_mom_change']].describe()

Unnamed: 0,demo_ratio_mom_change,bio_ratio_mom_change
count,3992.0,3992.0
mean,-0.730504,-3.204615
std,45.295035,37.940197
min,-1595.047144,-1086.043613
25%,0.0,-1.000005
50%,1.515495,0.790791
75%,4.656999,3.502323
max,546.0,184.0


In [None]:
# Select features for anomaly detection
features = df[
    [
        'demo_update_ratio',
        'bio_update_ratio',
        'demo_ratio_mom_change',
        'bio_ratio_mom_change'
    ]
].copy()

In [None]:
# Replace missing values to enable model training
features = features.fillna(0)

In [None]:
features.isna().sum()

Unnamed: 0,0
demo_update_ratio,0
bio_update_ratio,0
demo_ratio_mom_change,0
bio_ratio_mom_change,0


### Step 4: Isolation Forest for Anomaly Ranking

An Isolation Forest model is used to rank regions by unusual Aadhaar update behaviour.  
The model is applied in an unsupervised manner to prioritise regions rather than to predict failures.


In [None]:
# Scale features prior to model training
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [None]:
# Train Isolation Forest model for anomaly ranking
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.10,
    random_state=42
)

iso_forest.fit(features_scaled)

In [None]:
# Generate anomaly scores from trained model
df['anomaly_score'] = -iso_forest.score_samples(features_scaled)

### Step 5: Anomaly Scoring and Ranking

Anomaly scores are converted into severity ranks to enable prioritisation.  
Higher-ranked regions exhibit more unusual update behaviour relative to the system baseline.


In [None]:
# Assign anomaly severity ranks
df['anomaly_rank'] = df['anomaly_score'].rank(
    ascending=False,
    method='dense'
)

In [None]:
# Inspect top-ranked anomalous records
df.sort_values('anomaly_score', ascending=False).head(10)

Unnamed: 0,month,state,district,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,...,total_bio_updates,demo_update_ratio,bio_update_ratio,demo_update_warning,bio_update_warning,identity_stress_flag,demo_ratio_mom_change,bio_ratio_mom_change,anomaly_score,anomaly_rank
43,2025-03-01,Maharashtra,Thane,19,13,15,6005.0,69162.0,13203.0,38556.0,...,51759.0,1599.297872,1101.255319,True,True,True,,,0.811536,1.0
26,2025-03-01,Delhi,North West Delhi,35,15,10,6373.0,62825.0,15709.0,29195.0,...,44904.0,1153.3,748.4,True,True,True,,,0.80989,2.0
127,2025-04-01,Delhi,North West Delhi,903,349,54,9362.0,53618.0,27815.0,37256.0,...,65071.0,48.223583,49.824655,True,True,True,-1105.076417,-698.575345,0.806633,3.0
223,2025-04-01,Maharashtra,Thane,2371,913,150,1303.0,13294.0,16028.0,36209.0,...,52237.0,4.250728,15.211706,False,False,False,-1595.047144,-1086.043613,0.80456,4.0
113,2025-04-01,Bihar,Muzaffarpur,1159,865,65,0.0,0.0,18807.0,16225.0,...,35032.0,0.0,16.769746,False,False,False,-638.837838,-413.122146,0.799187,5.0
303,2025-04-01,Uttar Pradesh,Gorakhpur,378,727,86,0.0,0.0,9404.0,12727.0,...,22131.0,0.0,18.581864,False,False,False,-813.5,-339.051469,0.796334,6.0
114,2025-04-01,Bihar,Patna,723,2303,373,0.0,0.0,17785.0,15295.0,...,33080.0,0.0,9.732274,False,False,False,-584.725,-366.392726,0.795769,7.0
19,2025-03-01,Bihar,Muzaffarpur,24,40,10,4456.0,42818.0,13651.0,18161.0,...,31812.0,638.837838,429.891892,True,True,True,,,0.79319,8.0
60,2025-03-01,Uttar Pradesh,Gorakhpur,13,36,11,4752.0,44058.0,8891.0,12567.0,...,21458.0,813.5,357.633333,True,True,True,,,0.792118,9.0
21,2025-03-01,Bihar,Patna,12,52,16,4807.0,41971.0,11916.0,18174.0,...,30090.0,584.725,376.125,True,True,True,,,0.791047,10.0


### Step 6: Validation of ML Reinforcement

Correlation analysis is used to assess alignment between rule-based indicators and ML anomaly scores.  
A strong but imperfect correlation confirms that machine learning reinforces explainable logic while adding prioritisation nuance.


In [None]:
# Validate alignment between rule-based indicators and ML scores
df[['identity_stress_flag', 'anomaly_score']].corr()

Unnamed: 0,identity_stress_flag,anomaly_score
identity_stress_flag,1.0,0.625277
anomaly_score,0.625277,1.0
