# Notebook 03: Social Normalisation of Aadhaar Update Activity

## Objective
This notebook transforms aggregated Aadhaar activity into **normalised pressure indicators**
by separating routine (expected) updates from sensitive updates.

The goal is to:
- Control for district size and routine lifecycle activity
- Highlight temporal deviations within districts
- Produce defensible, aggregate indicators (not migration rates)

Outputs from this notebook are used in Notebook 04 and the dashboard.

In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [3]:
bio_df  = pd.read_csv("../data/processed/bio_dm.csv")
demo_df = pd.read_csv("../data/processed/demo_dm.csv")
enr_df  = pd.read_csv("../data/processed/enrol_df_month.csv")

bio_df.shape, demo_df.shape, enr_df.shape

((11559, 5), (11711, 5), (10961, 6))

In [4]:
bio_df = bio_df.rename(columns={
    "bio_age_5_17": "biometric_expected",
    "bio_age_17_":  "biometric_sensitive"
})

demo_df = demo_df.rename(columns={
    "demo_age_5_17": "demographic_expected",
    "demo_age_17_":  "demographic_sensitive"
})

### Why Normalisation?

Raw counts vary due to population size, enrolment maturity, and routine behaviour.
Normalisation helps isolate **deviations from a district’s own baseline**.

- Expected activity ≈ routine lifecycle/administrative behaviour
- Sensitive activity ≈ potentially stress-related deviations

We compute ratios and then standardise **within districts** over time.

In [5]:
# Avoid division by zero using a small constant
EPS = 1.0

bio_df["biometric_pressure_ratio"] = (
    bio_df["biometric_sensitive"] / (bio_df["biometric_expected"] + EPS)
)

demo_df["demographic_pressure_ratio"] = (
    demo_df["demographic_sensitive"] / (demo_df["demographic_expected"] + EPS)
)

In [6]:
def zscore_within_group(s):
    return (s - s.mean()) / (s.std(ddof=0) + 1e-6)

bio_df["biometric_pressure_index"] = (
    bio_df.groupby(["state", "district"])["biometric_pressure_ratio"]
          .transform(zscore_within_group)
)

demo_df["demographic_pressure_index"] = (
    demo_df.groupby(["state", "district"])["demographic_pressure_ratio"]
           .transform(zscore_within_group)
)

### Interpreting the Indices

- Values > 0: higher-than-usual pressure for that district
- Values < 0: routine or below-baseline activity
- Indices are **relative within districts**, not comparable as absolute magnitudes

These indicators support exploratory policy analysis without individual inference.

In [7]:
bio_out = bio_df[[
    "state", "district", "month",
    "biometric_pressure_index"
]].copy()

demo_out = demo_df[[
    "state", "district", "month",
    "demographic_pressure_index"
]].copy()

In [10]:
bio_out.to_csv("../data/processed/biometric_pressure_index.csv", index=False)
demo_out.to_csv("../data/processed/demographic_pressure_index.csv", index=False)

This notebook produced socially normalised pressure indicators that control
for routine Aadhaar update behaviour.

The next notebook focuses on:
- Temporal patterns and comparisons
- Geographic heterogeneity
- Evidence-based visualisations

Proceed to Notebook 04.

In [8]:
# Merge biometric and demographic indices
combined_df = bio_out.merge(
    demo_out,
    on=["state", "district", "month"],
    how="inner"
)

# Composite pressure index (equal weighting)
combined_df["composite_pressure_index"] = (
    0.5 * combined_df["biometric_pressure_index"] +
    0.5 * combined_df["demographic_pressure_index"]
)

In [17]:
combined_df["composite_pressure_index_capped"] = combined_df[
    "composite_pressure_index"
].clip(-3, 3)

In [18]:
district_stats["volatility_iqr"] = (
    combined_df
    .groupby(["state", "district"])["composite_pressure_index"]
    .quantile(0.75)
    -
    combined_df
    .groupby(["state", "district"])["composite_pressure_index"]
    .quantile(0.25)
).values

In [10]:
def classify_district(row):
    if row["mean_pressure"] < 0 and row["volatility"] < 0.5:
        return "Stable"
    elif row["mean_pressure"] < 0 and row["volatility"] >= 0.5:
        return "Volatile"
    elif row["mean_pressure"] >= 0 and row["volatility"] < 0.5:
        return "Sustained Stress"
    else:
        return "Shock-Prone"

district_stats["risk_category"] = district_stats.apply(classify_district, axis=1)

In [20]:
def alert_level(x):
    if pd.isna(x):
        return "No Data"
    elif x < 0.5:
        return "Normal"
    elif x < 1.0:
        return "Watch"
    elif x < 1.5:
        return "Elevated"
    else:
        return "Critical"

combined_df["alert_level"] = combined_df[
    "rolling_pressure_3m"
].apply(alert_level)

In [21]:
combined_df["alert_flag"] = combined_df["alert_level"].isin(
    ["Elevated", "Critical"]
)

In [14]:
combined_df.to_csv(
    "../data/processed/composite_pressure_index.csv",
    index=False
)

district_stats.to_csv(
    "../data/processed/district_risk_typology.csv",
    index=False
)

In [15]:
combined_df["alert_level"].value_counts()

alert_level
Normal               9182
Watch                1092
Insufficient Data     989
Elevated              140
Critical                9
Name: count, dtype: int64

In [16]:
combined_df[combined_df["alert_flag"]].head()

Unnamed: 0,state,district,month,biometric_pressure_index,demographic_pressure_index,composite_pressure_index,rolling_pressure_3m,alert_flag,alert_level
600,Arunachal Pradesh,Anjaw,2025-02,-0.653746,1.913577,0.629916,1.264411,True,Elevated
775,Arunachal Pradesh,Pakke Kessang,2025-08,-0.18537,0.218218,0.016424,1.214082,True,Elevated
816,Arunachal Pradesh,Tawang,2025-02,1.271669,1.241497,1.256583,1.473817,True,Elevated
1486,Bihar,Jamui,2025-05,1.17503,1.173823,1.174426,1.27013,True,Elevated
1893,Chhattisgarh,Baloda Bazar,2025-02,-0.24449,1.862631,0.80907,1.619977,True,Critical


In [19]:
combined_df["rolling_window_status"] = np.where(
    combined_df["rolling_pressure_3m"].isna(),
    "Insufficient history",
    "Sufficient history"
)