# Notebook 03: Social Normalisation of Aadhaar Update Activity

## Objective
This notebook transforms aggregated Aadhaar activity into **normalised pressure indicators**
by separating routine (expected) updates from sensitive updates.

The goal is to:
- Control for district size and routine lifecycle activity
- Highlight temporal deviations within districts
- Produce defensible, aggregate indicators (not migration rates)

Outputs from this notebook are used in Notebook 04 and the dashboard.

In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [3]:
bio_df  = pd.read_csv("../data/processed/bio_dm.csv")
demo_df = pd.read_csv("../data/processed/demo_dm.csv")
enr_df  = pd.read_csv("../data/processed/enrol_df_month.csv")

bio_df.shape, demo_df.shape, enr_df.shape

((11559, 5), (11711, 5), (10961, 6))

In [4]:
bio_df = bio_df.rename(columns={
    "bio_age_5_17": "biometric_expected",
    "bio_age_17_":  "biometric_sensitive"
})

demo_df = demo_df.rename(columns={
    "demo_age_5_17": "demographic_expected",
    "demo_age_17_":  "demographic_sensitive"
})

### Why Normalisation?

Raw counts vary due to population size, enrolment maturity, and routine behaviour.
Normalisation helps isolate **deviations from a district’s own baseline**.

- Expected activity ≈ routine lifecycle/administrative behaviour
- Sensitive activity ≈ potentially stress-related deviations

We compute ratios and then standardise **within districts** over time.

In [5]:
# Avoid division by zero using a small constant
EPS = 1.0

bio_df["biometric_pressure_ratio"] = (
    bio_df["biometric_sensitive"] / (bio_df["biometric_expected"] + EPS)
)

demo_df["demographic_pressure_ratio"] = (
    demo_df["demographic_sensitive"] / (demo_df["demographic_expected"] + EPS)
)

In [6]:
def zscore_within_group(s):
    return (s - s.mean()) / (s.std(ddof=0) + 1e-6)

bio_df["biometric_pressure_index"] = (
    bio_df.groupby(["state", "district"])["biometric_pressure_ratio"]
          .transform(zscore_within_group)
)

demo_df["demographic_pressure_index"] = (
    demo_df.groupby(["state", "district"])["demographic_pressure_ratio"]
           .transform(zscore_within_group)
)

### Interpreting the Indices

- Values > 0: higher-than-usual pressure for that district
- Values < 0: routine or below-baseline activity
- Indices are **relative within districts**, not comparable as absolute magnitudes

These indicators support exploratory policy analysis without individual inference.

In [7]:
bio_out = bio_df[[
    "state", "district", "month",
    "biometric_pressure_index"
]].copy()

demo_out = demo_df[[
    "state", "district", "month",
    "demographic_pressure_index"
]].copy()

In [8]:
bio_out.to_csv("../data/processed/biometric_pressure_index.csv", index=False)
demo_out.to_csv("../data/processed/demographic_pressure_index.csv", index=False)

This notebook produced socially normalised pressure indicators that control
for routine Aadhaar update behaviour.

The next notebook focuses on:
- Temporal patterns and comparisons
- Geographic heterogeneity
- Evidence-based visualisations

Proceed to Notebook 04.

In [9]:
# Merge biometric and demographic indices
combined_df = bio_out.merge(
    demo_out,
    on=["state", "district", "month"],
    how="inner"
)

# Composite pressure index (equal weighting)
combined_df["composite_pressure_index"] = (
    0.5 * combined_df["biometric_pressure_index"] +
    0.5 * combined_df["demographic_pressure_index"]
)

In [10]:
combined_df["composite_pressure_index_capped"] = combined_df[
    "composite_pressure_index"
].clip(-3, 3)

In [11]:
district_stats = (
    combined_df
    .groupby(["state", "district"])
    .agg(
        mean_pressure=("composite_pressure_index", "mean"),
        std_pressure=("composite_pressure_index", "std"),
        observations=("composite_pressure_index", "count")
    )
)

In [12]:
volatility_iqr = (
    combined_df
    .groupby(["state", "district"])["composite_pressure_index"]
    .quantile(0.75)
    -
    combined_df
    .groupby(["state", "district"])["composite_pressure_index"]
    .quantile(0.25)
)

district_stats = district_stats.join(
    volatility_iqr.rename("volatility_iqr")
)

In [13]:
def classify_district(row):
    if row["mean_pressure"] < 0 and row["volatility_iqr"] < 0.5:
        return "Baseline Stable"

    elif row["mean_pressure"] < 0 and row["volatility_iqr"] >= 0.5:
        return "Operational Anomaly"

    elif row["mean_pressure"] >= 0 and row["volatility_iqr"] < 0.5:
        return "Structural Stress"

    else:
        return "Shock-Driven Anomaly"

In [14]:
district_stats["risk_category"] = district_stats.apply(
    classify_district, axis=1
)

In [20]:
def alert_level(x):
    if pd.isna(x):
        return "No Data"
    elif x < 0.5:
        return "Normal"
    elif x < 1.0:
        return "Watch"
    elif x < 1.5:
        return "Elevated"
    else:
        return "Critical"

In [21]:
combined_df = combined_df.sort_values(
    ["state", "district", "month"]
)

combined_df["rolling_pressure_3m"] = (
    combined_df
    .groupby(["state", "district"])["composite_pressure_index"]
    .rolling(window=3, min_periods=2)
    .mean()
    .reset_index(level=[0,1], drop=True)
)
combined_df["alert_level"] = combined_df[
    "rolling_pressure_3m"
].apply(alert_level)

In [22]:
combined_df["alert_flag"] = combined_df["alert_level"].isin(
    ["Elevated", "Critical"]
)

In [18]:
combined_df.to_csv(
    "../data/processed/composite_pressure_index.csv",
    index=False
)

district_stats.to_csv(
    "../data/processed/district_risk_typology.csv",
    index=False
)

In [23]:
combined_df["alert_level"].value_counts()

alert_level
Normal      9182
Watch       1092
No Data      989
Elevated     140
Critical       9
Name: count, dtype: int64

In [24]:
combined_df[combined_df["alert_flag"]].head()

Unnamed: 0,state,district,month,biometric_pressure_index,demographic_pressure_index,composite_pressure_index,composite_pressure_index_capped,rolling_pressure_3m,alert_level,alert_flag
25,Andaman And Nicobar Islands,Nicobar,2025-02,1.917576,0.85101,1.384293,1.384293,1.125582,Elevated,True
26,Andaman And Nicobar Islands,Nicobar,2025-03,-0.034015,2.470178,1.218082,1.218082,1.156415,Elevated,True
202,Andhra Pradesh,East Godavari,2025-11,0.950912,1.161536,1.056224,1.056224,1.005382,Elevated,True
263,Andhra Pradesh,K.V.Rangareddy,2025-12,1.874042,0.452648,1.163345,1.163345,1.109881,Elevated,True
513,Andhra Pradesh,Sri Sathya Sai,2025-11,0.976851,2.589437,1.783144,1.783144,1.074589,Elevated,True


In [25]:
combined_df["rolling_window_status"] = np.where(
    combined_df["rolling_pressure_3m"].isna(),
    "Insufficient history",
    "Sufficient history"
)

In [26]:
def map_anomaly_label(row):
    if row.get("early_warning", 0) == 1 and row.get("sustained_alert", 0) == 1:
        return "Structural Anomaly"
    elif row.get("sustained_alert", 0) == 1:
        return "Chronic Stress"
    elif row.get("early_warning", 0) == 1:
        return "Transient Spike"
    else:
        return "No Anomaly"

In [27]:
combined_df["anomaly_label"] = combined_df.apply(map_anomaly_label, axis=1)

In [29]:
volatility=("composite_pressure_index", "std")

In [30]:
def classify_district(row):
    if row["mean_pressure"] < 0 and row["volatility"] < 0.5:
        return "Stable"
    elif row["mean_pressure"] < 0 and row["volatility"] >= 0.5:
        return "Volatile"
    elif row["mean_pressure"] >= 0 and row["volatility"] < 0.5:
        return "Sustained Stress"
    else:
        return "Shock-Prone"

In [31]:
district_stats["risk_category"] = district_stats.apply(classify_district, axis=1)

In [32]:
district_stats = (
    combined_df
    .groupby(["state", "district"])
    .agg(
        mean_pressure=("composite_pressure_index", "mean"),
        volatility=("composite_pressure_index", "std"),
    )
    .reset_index()   # <-- THIS IS CRITICAL
)

district_stats["risk_category"] = district_stats.apply(classify_district, axis=1)

district_stats.to_csv(
    "district_risk_typology.csv",
    index=False
)

In [33]:
combined_df[["state", "district", "month", "rolling_pressure_3m", "alert_level"]].head()

Unnamed: 0,state,district,month,rolling_pressure_3m,alert_level
0,Andaman & Nicobar Islands,Andamans,2025-01,,No Data
1,Andaman & Nicobar Islands,Andamans,2025-02,0.747283,Watch
2,Andaman & Nicobar Islands,Andamans,2025-03,0.377334,Normal
3,Andaman & Nicobar Islands,Andamans,2025-04,0.197008,Normal
4,Andaman & Nicobar Islands,Andamans,2025-05,0.053481,Normal


In [34]:
def alert_level(x):
    if pd.isna(x):
        return "No Data"
    elif x < 0.5:
        return "Normal"
    elif x < 1.0:
        return "Watch"
    elif x < 1.5:
        return "Elevated"
    else:
        return "Critical"

In [35]:
import pandas as pd

df = pd.read_csv("../data/processed/composite_pressure_index.csv")
df["month"] = pd.to_datetime(df["month"])

df = df.sort_values(["state", "district", "month"])

# rolling pressure (already exists, but recompute safely)
df["rolling_pressure_3m"] = (
    df.groupby(["state", "district"])["composite_pressure_index"]
      .rolling(window=3, min_periods=2)
      .mean()
      .reset_index(level=[0,1], drop=True)
)

# alert level
df["alert_level"] = df["rolling_pressure_3m"].apply(alert_level)

# binary alert flag (this was your old column)
df["alert_flag"] = (df["alert_level"].isin(["Elevated", "Critical"])).astype(int)

In [36]:
df["early_warning"] = 0
df["sustained_alert"] = 0

for (state, district), g in df.groupby(["state", "district"]):
    threshold = g["composite_pressure_index"].quantile(0.75)

    ew = (g["composite_pressure_index"] > threshold).astype(int)
    sa = ew.rolling(window=2, min_periods=2).sum().ge(2).astype(int)

    df.loc[g.index, "early_warning"] = ew.values
    df.loc[g.index, "sustained_alert"] = sa.values

In [37]:
df.to_csv("../data/processed/composite_pressure_index.csv", index=False)