## Layer 4: Signal Extraction and Analytical Insights

This notebook focuses on extracting meaningful patterns, variations, and early stress signals from the prepared Aadhaar datasets.  
The objective is to identify geographic, demographic, and temporal behaviours that can inform proactive decision-making.


In [None]:
# Import required libraries for data analysis
import pandas as pd
import numpy as np

In [None]:
# Load analysis-ready dataset and parse month column
identity_metrics = pd.read_csv('identity_metrics.csv')

In [None]:
identity_metrics.head()

Unnamed: 0,month,state,district,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,total_enrolments,total_demo_updates,total_bio_updates,demo_update_ratio,bio_update_ratio
0,2025-03-01,Andhra Pradesh,Spsr Nellore,43,44,29,0.0,0.0,0.0,0.0,116,0.0,0.0,0.0,0.0
1,2025-03-01,Assam,Baksa,11,14,13,397.0,4438.0,1333.0,527.0,38,4835.0,1860.0,127.236842,48.947368
2,2025-03-01,Assam,Barpeta,24,34,10,821.0,9181.0,8751.0,2053.0,68,10002.0,10804.0,147.088235,158.882353
3,2025-03-01,Assam,Bongaigaon,25,33,31,699.0,8681.0,3011.0,1170.0,89,9380.0,4181.0,105.393258,46.977528
4,2025-03-01,Assam,Chirang,12,29,15,179.0,1711.0,1611.0,679.0,56,1890.0,2290.0,33.75,40.892857


In [None]:
# Confirm time coverage
identity_metrics['month'] = pd.to_datetime(identity_metrics['month'])
identity_metrics['month'].nunique(), identity_metrics['month'].sort_values().unique()

(9,
 <DatetimeArray>
 ['2025-03-01 00:00:00', '2025-04-01 00:00:00', '2025-05-01 00:00:00',
  '2025-06-01 00:00:00', '2025-07-01 00:00:00', '2025-09-01 00:00:00',
  '2025-10-01 00:00:00', '2025-11-01 00:00:00', '2025-12-01 00:00:00']
 Length: 9, dtype: datetime64[ns])

### Step 1: National-Level Baseline Analysis

A national baseline is established by aggregating Aadhaar update ratios across all states and districts.  
This baseline represents the system-wide “normal” against which regional variations are assessed.


In [None]:
# Aggregate to NATIONAL level
# Now we collapse all districts and states to get a national monthly picture.
national_monthly = (
    identity_metrics
    .groupby('month', as_index=False)
    .agg({
        'total_enrolments': 'sum',
        'total_demo_updates': 'sum',
        'total_bio_updates': 'sum'
    })
)

In [None]:
# Compute national-level ratios
# These represent system-wide pressure.
national_monthly['national_demo_update_ratio'] = (
    national_monthly['total_demo_updates'] /
    national_monthly['total_enrolments'].replace(0, np.nan)
)

national_monthly['national_bio_update_ratio'] = (
    national_monthly['total_bio_updates'] /
    national_monthly['total_enrolments'].replace(0, np.nan)
)

In [None]:
# Inspect the national baseline
national_monthly

Unnamed: 0,month,total_enrolments,total_demo_updates,total_bio_updates,national_demo_update_ratio,national_bio_update_ratio
0,2025-03-01,16582,1443338.0,932719.0,87.042456,56.248884
1,2025-04-01,257438,919969.0,3756011.0,3.573556,14.589963
2,2025-05-01,183616,727750.0,2726929.0,3.963435,14.85126
3,2025-06-01,215734,526541.0,2433057.0,2.440695,11.278041
4,2025-07-01,616868,1204489.0,4813025.0,1.952588,7.802358
5,2025-09-01,1475879,7323818.0,6654815.0,4.962343,4.509052
6,2025-10-01,817920,5010573.0,4582491.0,6.125994,5.602615
7,2025-11-01,1092007,9386791.0,7285335.0,8.595907,6.671509
8,2025-12-01,759658,9434367.0,8703825.0,12.419229,11.457557


In [None]:
national_monthly.describe()

Unnamed: 0,month,total_enrolments,total_demo_updates,total_bio_updates,national_demo_update_ratio,national_bio_update_ratio
count,9,9.0,9.0,9.0,9.0,9.0
mean,2025-07-15 00:00:00,603966.9,3997515.0,4654245.0,14.564023,14.779027
min,2025-03-01 00:00:00,16582.0,526541.0,932719.0,1.952588,4.509052
25%,2025-05-01 00:00:00,215734.0,919969.0,2726929.0,3.573556,6.671509
50%,2025-07-01 00:00:00,616868.0,1443338.0,4582491.0,4.962343,11.278041
75%,2025-10-01 00:00:00,817920.0,7323818.0,6654815.0,8.595907,14.589963
max,2025-12-01 00:00:00,1475879.0,9434367.0,8703825.0,87.042456,56.248884
std,,482327.5,3828011.0,2516425.0,27.376816,15.996733


### Step 2: State-Level Variation from National Baseline

State-level averages are compared against the national baseline to identify states experiencing higher or lower Aadhaar update pressure.  
This highlights uneven distribution of system stress across regions.


In [None]:
# Step 2.1: Aggregate metrics at STATE × MONTH level
# We first collapse districts into states.
state_monthly = (
    identity_metrics
    .groupby(['month', 'state'], as_index=False)
    .agg({
        'total_enrolments': 'sum',
        'total_demo_updates': 'sum',
        'total_bio_updates': 'sum'
    })
)

In [None]:
# Step 2.2: Compute state-level ratios
state_monthly['state_demo_update_ratio'] = (
    state_monthly['total_demo_updates'] /
    state_monthly['total_enrolments'].replace(0, np.nan)
)

state_monthly['state_bio_update_ratio'] = (
    state_monthly['total_bio_updates'] /
    state_monthly['total_enrolments'].replace(0, np.nan)
)

In [None]:
# Step 2.3: Join national baseline for comparison
# We now attach the national ratios so each state can be compared to “normal”.
state_vs_national = state_monthly.merge(
    national_monthly[['month', 'national_demo_update_ratio', 'national_bio_update_ratio']],
    on='month',
    how='left'
)

In [None]:
# Step 2.4: Compute deviation from national baseline
# These deviations are key analytical signals.
state_vs_national['demo_ratio_diff_from_national'] = (
    state_vs_national['state_demo_update_ratio'] -
    state_vs_national['national_demo_update_ratio']
)

state_vs_national['bio_ratio_diff_from_national'] = (
    state_vs_national['state_bio_update_ratio'] -
    state_vs_national['national_bio_update_ratio']
)

In [None]:
# Step 2.5: Inspect results
state_vs_national.head()

Unnamed: 0,month,state,total_enrolments,total_demo_updates,total_bio_updates,state_demo_update_ratio,state_bio_update_ratio,national_demo_update_ratio,national_bio_update_ratio,demo_ratio_diff_from_national,bio_ratio_diff_from_national
0,2025-03-01,Andhra Pradesh,116,0.0,0.0,0.0,0.0,87.042456,56.248884,-87.042456,-56.248884
1,2025-03-01,Assam,2492,124858.0,67560.0,50.103531,27.110754,87.042456,56.248884,-36.938924,-29.13813
2,2025-03-01,Bihar,2352,220996.0,153583.0,93.960884,65.298895,87.042456,56.248884,6.918429,9.05001
3,2025-03-01,Chhattisgarh,260,24888.0,15613.0,95.723077,60.05,87.042456,56.248884,8.680621,3.801116
4,2025-03-01,Delhi,368,118682.0,69183.0,322.505435,187.997283,87.042456,56.248884,235.462979,131.748398


In [None]:
state_vs_national.describe()

Unnamed: 0,month,total_enrolments,total_demo_updates,total_bio_updates,state_demo_update_ratio,state_bio_update_ratio,national_demo_update_ratio,national_bio_update_ratio,demo_ratio_diff_from_national,bio_ratio_diff_from_national
count,323,323.0,323.0,323.0,323.0,323.0,323.0,323.0,323.0,323.0
mean,2025-08-16 03:47:22.105263104,16828.798762,111385.9,129684.9,14.892666,18.314291,10.773567,11.327839,4.119099,6.986452
min,2025-03-01 00:00:00,1.0,0.0,0.0,0.0,0.0,1.952588,4.509052,-87.042456,-56.248884
25%,2025-06-01 00:00:00,243.0,238.0,2205.5,2.698003,4.150348,3.573556,5.602615,-2.178884,-2.881483
50%,2025-09-01 00:00:00,2186.0,10046.0,25211.0,5.945535,8.409091,6.125994,7.802358,-0.429675,0.383095
75%,2025-11-01 00:00:00,18970.5,118971.5,165989.0,11.353837,15.849323,8.595907,11.457557,3.013702,6.115944
max,2025-12-01 00:00:00,261079.0,1926128.0,1217196.0,483.490909,295.883333,87.042456,56.248884,396.448453,239.634449
std,,32015.406648,250230.8,220219.9,42.027827,35.998452,18.848141,11.468983,33.172547,31.637681


### Step 3: District-Level Deviation within States

District-level update behaviour is compared against respective state averages.  
This step uncovers localized hotspots of Aadhaar stress that may be masked by state-level aggregation.


In [None]:
# Step 3.1: Preparing district-level view

# We already have district-level data in identity_metrics.
# We will now compare districts against their state context.

In [None]:
# Step 3.2: Compute STATE-level reference ratios

# This helps us answer: “Is this district high compared to its own state?”

state_reference = (
    state_monthly[['month', 'state', 'state_demo_update_ratio', 'state_bio_update_ratio']]
)

In [None]:
# Step 3.3: Join state reference to district data
district_vs_state = identity_metrics.merge(
    state_reference,
    on=['month', 'state'],
    how='left'
)

In [None]:
# Step 3.4: Compute district deviation from state average

# These are local stress indicators.
district_vs_state['demo_ratio_diff_from_state'] = (
    district_vs_state['demo_update_ratio'] -
    district_vs_state['state_demo_update_ratio']
)

district_vs_state['bio_ratio_diff_from_state'] = (
    district_vs_state['bio_update_ratio'] -
    district_vs_state['state_bio_update_ratio']
)

In [None]:
# Step 3.5: Inspect extreme districts
district_vs_state.sort_values(
    by='demo_ratio_diff_from_state',
    ascending=False
).head(10)

Unnamed: 0,month,state,district,age_0_5,age_5_17,age_18_greater,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,total_enrolments,total_demo_updates,total_bio_updates,demo_update_ratio,bio_update_ratio,state_demo_update_ratio,state_bio_update_ratio,demo_ratio_diff_from_state,bio_ratio_diff_from_state
43,2025-03-01,Maharashtra,Thane,19,13,15,6005.0,69162.0,13203.0,38556.0,47,75167.0,51759.0,1599.297872,1101.255319,241.663043,230.326087,1357.634829,870.929232
26,2025-03-01,Delhi,North West Delhi,35,15,10,6373.0,62825.0,15709.0,29195.0,60,69198.0,44904.0,1153.3,748.4,322.505435,187.997283,830.794565,560.402717
60,2025-03-01,Uttar Pradesh,Gorakhpur,13,36,11,4752.0,44058.0,8891.0,12567.0,60,48810.0,21458.0,813.5,357.633333,106.053773,56.352494,707.446227,301.280839
4800,2025-12-01,Rajasthan,Jalore,1,0,0,38.0,619.0,3.0,14.0,1,657.0,17.0,657.0,17.0,8.491356,10.51531,648.508644,6.48469
19,2025-03-01,Bihar,Muzaffarpur,24,40,10,4456.0,42818.0,13651.0,18161.0,74,47274.0,31812.0,638.837838,429.891892,93.960884,65.298895,544.876953,364.592997
56,2025-03-01,Uttar Pradesh,Bulandshahr,19,32,10,3371.0,34208.0,15326.0,4600.0,61,37579.0,19926.0,616.04918,326.655738,106.053773,56.352494,509.995407,270.303244
17,2025-03-01,Bihar,Bhagalpur,13,40,18,4949.0,37353.0,13344.0,11954.0,71,42302.0,25298.0,595.802817,356.309859,93.960884,65.298895,501.841933,291.010965
21,2025-03-01,Bihar,Patna,12,52,16,4807.0,41971.0,11916.0,18174.0,80,46778.0,30090.0,584.725,376.125,93.960884,65.298895,490.764116,310.826105
4780,2025-12-01,Rajasthan,Beawar,1,0,0,19.0,430.0,2.0,6.0,1,449.0,8.0,449.0,8.0,8.491356,10.51531,440.508644,-2.51531
33,2025-03-01,Gujarat,Vadodara,36,43,10,3492.0,27140.0,7894.0,17180.0,89,30632.0,25074.0,344.179775,281.730337,41.173979,41.260968,303.005796,240.469369


### Step 4: Age-Group Pressure Analysis

Update ratios are analysed across age groups to understand lifecycle-driven identity changes.  
This helps identify whether Aadhaar update pressure is concentrated among children, youth, or adults.


In [None]:
# Step 4.1: Create age-group–specific ratios

# We already have age-wise counts.
# Now we normalise them by total enrolments.

# Age-group specific demographic update ratios
identity_metrics['demo_ratio_5_17'] = (
    identity_metrics['demo_age_5_17'] /
    identity_metrics['total_enrolments'].replace(0, np.nan)
)

identity_metrics['demo_ratio_17_plus'] = (
    identity_metrics['demo_age_17_'] /
    identity_metrics['total_enrolments'].replace(0, np.nan)
)

# Age-group specific biometric update ratios
identity_metrics['bio_ratio_5_17'] = (
    identity_metrics['bio_age_5_17'] /
    identity_metrics['total_enrolments'].replace(0, np.nan)
)

identity_metrics['bio_ratio_17_plus'] = (
    identity_metrics['bio_age_17_'] /
    identity_metrics['total_enrolments'].replace(0, np.nan)
)

In [None]:
# Step 4.2: Inspect distributions
identity_metrics[
    [
        'demo_ratio_5_17',
        'demo_ratio_17_plus',
        'bio_ratio_5_17',
        'bio_ratio_17_plus'
    ]
].describe()

# What to observe:

# a) Which ratios are generally higher?

# b) Are adult updates dominating?

# c) Is child/youth update pressure meaningful?

# Just observing, not concluding yet.

Unnamed: 0,demo_ratio_5_17,demo_ratio_17_plus,bio_ratio_5_17,bio_ratio_17_plus
count,5062.0,5062.0,5062.0,5062.0
mean,1.086289,10.325937,6.294472,7.112154
std,3.678662,38.020996,15.061222,22.157131
min,0.0,0.0,0.0,0.0
25%,0.253549,2.799561,1.426078,1.712365
50%,0.552403,5.296239,2.908058,3.05709
75%,1.065329,9.670406,6.213452,5.952542
max,127.765957,1471.531915,394.985507,820.340426


In [None]:
# Step 4.3: Compare age-group pressure at NATIONAL level
# We now aggregate age-group behaviour nationally.
national_age_pressure = (
    identity_metrics
    .groupby('month', as_index=False)
    .agg({
        'demo_age_5_17': 'sum',
        'demo_age_17_': 'sum',
        'bio_age_5_17': 'sum',
        'bio_age_17_': 'sum',
        'total_enrolments': 'sum'
    })
)

In [None]:
# Compute national age-group ratios
national_age_pressure['demo_ratio_5_17'] = (
    national_age_pressure['demo_age_5_17'] /
    national_age_pressure['total_enrolments'].replace(0, np.nan)
)

national_age_pressure['demo_ratio_17_plus'] = (
    national_age_pressure['demo_age_17_'] /
    national_age_pressure['total_enrolments'].replace(0, np.nan)
)

national_age_pressure['bio_ratio_5_17'] = (
    national_age_pressure['bio_age_5_17'] /
    national_age_pressure['total_enrolments'].replace(0, np.nan)
)

national_age_pressure['bio_ratio_17_plus'] = (
    national_age_pressure['bio_age_17_'] /
    national_age_pressure['total_enrolments'].replace(0, np.nan)
)

In [None]:
# Step 4.4: Inspect national age-group behaviour
national_age_pressure

Unnamed: 0,month,demo_age_5_17,demo_age_17_,bio_age_5_17,bio_age_17_,total_enrolments,demo_ratio_5_17,demo_ratio_17_plus,bio_ratio_5_17,bio_ratio_17_plus
0,2025-03-01,120920.0,1322418.0,452336.0,480383.0,16582,7.292245,79.750211,27.278736,28.970148
1,2025-04-01,115419.0,804550.0,1929583.0,1826428.0,257438,0.448337,3.125218,7.495331,7.094632
2,2025-05-01,85045.0,642705.0,1317249.0,1409680.0,183616,0.463168,3.500267,7.173934,7.677327
3,2025-06-01,57923.0,468618.0,1167558.0,1265499.0,215734,0.268493,2.172203,5.412026,5.866016
4,2025-07-01,155005.0,1049484.0,2302232.0,2510793.0,616868,0.251277,1.70131,3.732131,4.070227
5,2025-09-01,770768.0,6553050.0,3610474.0,3044341.0,1475879,0.522243,4.4401,2.446321,2.062731
6,2025-10-01,479206.0,4531367.0,2215322.0,2367169.0,817920,0.585884,5.54011,2.708482,2.894133
7,2025-11-01,821216.0,8565575.0,3608835.0,3676500.0,1092007,0.752024,7.843883,3.304773,3.366737
8,2025-12-01,926010.0,8508357.0,4624042.0,4079783.0,759658,1.218983,11.200247,6.087005,5.370552


### Step 5: Temporal Behaviour and Month-on-Month Change

Month-on-month changes in update ratios are computed to capture acceleration or easing of Aadhaar update pressure over time.  
This distinguishes persistent stress from one-time fluctuations.


In [None]:
# Step 5.1: Sort data correctly
identity_metrics = identity_metrics.sort_values(
    by=['state', 'district', 'month']
)

# This ensures:

# a) time is ordered
# b) month-on-month comparisons are valid

## Computing Month-on-Month (MoM) change in ratios
**Why**

MoM change tells us:

a) accelerating pressure

b) stabilising pressure

c) declining pressure

In [None]:
# Step 5.2: Compute Month-on-Month (MoM) change in ratios

identity_metrics['demo_ratio_mom_change'] = (
    identity_metrics
    .groupby(['state', 'district'])['demo_update_ratio']
    .diff()
)

identity_metrics['bio_ratio_mom_change'] = (
    identity_metrics
    .groupby(['state', 'district'])['bio_update_ratio']
    .diff()
)

In [None]:
# Step 5.3: Inspect MoM behaviour
identity_metrics[
    ['demo_ratio_mom_change', 'bio_ratio_mom_change']
].describe()

# We are looking for:

# a) mostly small changes (normal)
# b) some large positive values (potential concern)
# c) some negative values (cool-down)

Unnamed: 0,demo_ratio_mom_change,bio_ratio_mom_change
count,3992.0,3992.0
mean,-0.730504,-3.204615
std,45.295035,37.940197
min,-1595.047144,-1086.043613
25%,0.0,-1.000005
50%,1.515495,0.790791
75%,4.656999,3.502323
max,546.0,184.0


### Step 6: Identification of Persistent High-Pressure Regions

Regions that remain above baseline thresholds across multiple months are identified as persistent stress candidates.  
These regions are strong candidates for early-warning and targeted intervention.


In [None]:
# Step 6.1: Identify persistent high-pressure districts
# We now identify districts that are consistently above national baseline, not just once.

high_pressure_candidates = identity_metrics[
    (identity_metrics['demo_update_ratio'] > national_monthly['national_demo_update_ratio'].mean()) |
    (identity_metrics['bio_update_ratio'] > national_monthly['national_bio_update_ratio'].mean())
]

In [None]:
# Keep only districts appearing multiple times:
persistent_pressure = (
    high_pressure_candidates
    .groupby(['state', 'district'])
    .size()
    .reset_index(name='months_above_baseline')
    .sort_values(by='months_above_baseline', ascending=False)
)

In [None]:
# Step 6.2: Inspect top persistent candidates
persistent_pressure.head(10)

# These districts are:

# a) prime early-warning candidates

# b) excellent inputs for ML ranking

# c) highly actionable for UIDAI

Unnamed: 0,state,district,months_above_baseline
334,Maharashtra,Beed,8
371,Maharashtra,Satara,8
75,Assam,Bongaigaon,7
358,Maharashtra,Nanded,7
699,West Bengal,Jalpaiguri,7
332,Maharashtra,Amravati,6
376,Maharashtra,Washim,6
373,Maharashtra,Solapur,6
370,Maharashtra,Sangli,6
339,Maharashtra,Chandrapur,6


In [None]:
persistent_pressure.to_csv('persistent_pressure.csv', index=False)

# Persistent high-pressure districts were identified during exploratory analysis.
# For ML reinforcement, we used the full identity_metrics dataset to avoid pre-filtering bias.

## What Step 5 and Step 6 has achieved

We now have:

a) Temporal signals (MoM changes)

b) Distinction between:

  1) random spikes

  2) persistent stress

c) A shortlist of candidate regions for:

  1) early warning

  2) ML reinforcement

  3) policy discussion

## LAYER 4 — OFFICIALLY COMPLETE

We have now:

a) Patterns (national & state)

b) Variations (state & district)

c) Lifecycle signals (age groups)

d) Temporal behaviour (MoM)

e) Early-signal candidates (persistent pressure)