## 2. Datasets Used

### 2.1 Aadhaar Enrolment Dataset
- **Columns:** `date`, `state`, `district`, `pincode`, `age_0_5`, `age_5_17`, `age_18_greater`
- **Records:** ~1M rows spanning enrolment activity across India

### 2.2 Aadhaar Demographic Update Dataset
- **Columns:** `date`, `state`, `district`, `pincode`, `demo_age_5_17`, `demo_age_17_`
- **Records:** ~2M rows capturing non-biometric profile changes

### 2.3 Aadhaar Biometric Update Dataset
- **Columns:** `date`, `state`, `district`, `pincode`, `bio_age_5_17`, `bio_age_17_`
- **Records:** ~1.8M rows tracking biometric updates

In [1]:
# Environment Setup
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression

plt.style.use("seaborn-v0_8")
sns.set(rc={"figure.figsize": (12, 6)})
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

BASE_DIR = Path(r"c:/Users/msi/Desktop/uidai")
enrol_dir = BASE_DIR / "api_data_aadhar_enrolment" / "api_data_aadhar_enrolment"
demo_dir = BASE_DIR / "api_data_aadhar_demographic" / "api_data_aadhar_demographic"
bio_dir = BASE_DIR / "api_data_aadhar_biometric" / "api_data_aadhar_biometric"
print("Directories configured.")

Directories configured.


In [None]:
# Data Loading Functions
def load_and_concat_csvs(directory: Path, prefix: str) -> pd.DataFrame:
    csv_files = sorted([p for p in directory.glob(f"{prefix}_*.csv")])
    if not csv_files:
        raise FileNotFoundError(f"No CSV files found in {directory} with prefix {prefix}")
    frames = [pd.read_csv(file_path) for file_path in csv_files]
    return pd.concat(frames, ignore_index=True)

def parse_common_fields(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["date"] = pd.to_datetime(df["date"], format="%d-%m-%Y", errors="coerce")
    for column in ["state", "district"]:
        if column in df.columns:
            df[column] = df[column].astype(str).str.strip()
    if "pincode" in df.columns:
        df["pincode"] = df["pincode"].astype(str).str.zfill(6)
    return df

print("Data loading functions defined.")

In [None]:
# Load all three datasets
print("Loading enrolment dataset...")
enrol_raw = load_and_concat_csvs(enrol_dir, "api_data_aadhar_enrolment")
enrol = parse_common_fields(enrol_raw)

print("Loading demographic updates dataset...")
demo_raw = load_and_concat_csvs(demo_dir, "api_data_aadhar_demographic")
demo = parse_common_fields(demo_raw)

print("Loading biometric updates dataset...")
bio_raw = load_and_concat_csvs(bio_dir, "api_data_aadhar_biometric")
bio = parse_common_fields(bio_raw)

print(f"Enrolment: {len(enrol)} rows")
print(f"Demographic: {len(demo)} rows")
print(f"Biometric: {len(bio)} rows")

In [None]:
# Data Quality Checks
print("\n=== ENROLMENT DATASET ===")
print(f"Date range: {enrol['date'].min()} to {enrol['date'].max()}")
print(f"States: {enrol['state'].nunique()}, Districts: {enrol['district'].nunique()}, Pincodes: {enrol['pincode'].nunique()}")
print(f"Duplicates: {enrol[['date', 'state', 'district', 'pincode']].duplicated().mean():.2%}")

print("\n=== DEMOGRAPHIC UPDATES DATASET ===")
print(f"Date range: {demo['date'].min()} to {demo['date'].max()}")
print(f"States: {demo['state'].nunique()}, Districts: {demo['district'].nunique()}, Pincodes: {demo['pincode'].nunique()}")
print(f"Duplicates: {demo[['date', 'state', 'district', 'pincode']].duplicated().mean():.2%}")

print("\n=== BIOMETRIC UPDATES DATASET ===")
print(f"Date range: {bio['date'].min()} to {bio['date'].max()}")
print(f"States: {bio['state'].nunique()}, Districts: {bio['district'].nunique()}, Pincodes: {bio['pincode'].nunique()}")
print(f"Duplicates: {bio[['date', 'state', 'district', 'pincode']].duplicated().mean():.2%}")

In [None]:
# Aggregate Enrolments by State
enrol["total_enrol"] = enrol["age_0_5"] + enrol["age_5_17"] + enrol["age_18_greater"]
enrol_state_total = enrol.groupby("state", as_index=False)[
    ["age_0_5", "age_5_17", "age_18_greater", "total_enrol"]
].sum()

# Age-share within each state
for column in ["age_0_5", "age_5_17", "age_18_greater"]:
    enrol_state_total[f"share_{column}"] = enrol_state_total[column] / enrol_state_total["total_enrol"]

print("Top 10 States by Enrolments:")
print(enrol_state_total.sort_values("total_enrol", ascending=False).head(10)[["state", "total_enrol", "share_age_0_5", "share_age_5_17", "share_age_18_greater"]])

In [None]:
# Aggregate Demographic and Biometric Updates by State
demo["total_demo_updates"] = demo["demo_age_5_17"] + demo["demo_age_17_"]
demo_state_total = demo.groupby("state", as_index=False)[["demo_age_5_17", "demo_age_17_", "total_demo_updates"]].sum()

bio["total_bio_updates"] = bio["bio_age_5_17"] + bio["bio_age_17_"]
bio_state_total = bio.groupby("state", as_index=False)[["bio_age_5_17", "bio_age_17_", "total_bio_updates"]].sum()

# Merge into unified panel
state_panel = enrol_state_total.merge(
    demo_state_total[["state", "total_demo_updates"]],
    on="state",
    how="left",
).merge(
    bio_state_total[["state", "total_bio_updates"]],
    on="state",
    how="left",
)

# Fill missing values
state_panel[["total_demo_updates", "total_bio_updates"]] = state_panel[["total_demo_updates", "total_bio_updates"]].fillna(0)

# Compute intensity ratios
state_panel["demo_updates_per_1000_enrol"] = 1000 * state_panel["total_demo_updates"] / (state_panel["total_enrol"] + 1)
state_panel["bio_updates_per_1000_enrol"] = 1000 * state_panel["total_bio_updates"] / (state_panel["total_enrol"] + 1)

print("Top 10 States by Demographic Update Intensity:")
print(state_panel.sort_values("demo_updates_per_1000_enrol", ascending=False).head(10)[["state", "total_demo_updates", "demo_updates_per_1000_enrol"]])

In [None]:
# National-Level Time Series
enrol_nat_daily = enrol.groupby("date", as_index=False)["total_enrol"].sum()
demo_nat_daily = demo.groupby("date", as_index=False)["total_demo_updates"].sum()
bio_nat_daily = bio.groupby("date", as_index=False)["total_bio_updates"].sum()

merged_nat_daily = enrol_nat_daily.merge(demo_nat_daily, on="date", how="outer").merge(
    bio_nat_daily, on="date", how="outer"
).fillna(0).sort_values("date")

fig = px.line(
    merged_nat_daily,
    x="date",
    y=["total_enrol", "total_demo_updates", "total_bio_updates"],
    title="National Daily Aadhaar Enrolments vs Updates",
    labels={"value": "Count", "date": "Date", "variable": "Metric"},
)
fig.show()

In [None]:
# Top States by Total Enrolments
top_states_by_enrol = state_panel.sort_values("total_enrol", ascending=False).head(15)

fig = px.bar(
    top_states_by_enrol,
    x="state",
    y="total_enrol",
    title="Top 15 States by Total Aadhaar Enrolments",
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [None]:
# Age-Profile of Enrolments
age_profile = enrol_state_total.melt(
    id_vars=["state"],
    value_vars=["age_0_5", "age_5_17", "age_18_greater"],
    var_name="age_group",
    value_name="enrolments",
)

fig = px.bar(
    age_profile,
    x="state",
    y="enrolments",
    color="age_group",
    title="Age-wise Aadhaar Enrolments by State",
)
fig.update_layout(xaxis_tickangle=-60)
fig.show()

In [None]:
# Demographic vs Biometric Update Mix
update_mix = state_panel.copy()
update_mix["demo_share_of_updates"] = update_mix["total_demo_updates"] / (
    update_mix["total_demo_updates"] + update_mix["total_bio_updates"] + 1e-9
)

fig = px.scatter(
    update_mix,
    x="demo_updates_per_1000_enrol",
    y="bio_updates_per_1000_enrol",
    text="state",
    color="demo_share_of_updates",
    color_continuous_scale="Viridis",
    title="State-wise Mix and Intensity of Demographic vs Biometric Updates",
)
fig.update_traces(textposition="top center")
fig.show()

In [None]:
# Anomaly Detection
def detect_anomalies(series: pd.Series, window: int = 7, z_threshold: float = 3.0) -> pd.Series:
    rolling_mean = series.rolling(window=window, min_periods=window).mean()
    rolling_std = series.rolling(window=window, min_periods=window).std()
    z_scores = (series - rolling_mean) / (rolling_std + 1e-9)
    return z_scores.abs() > z_threshold

merged_nat_daily["enrol_anomaly"] = detect_anomalies(
    merged_nat_daily["total_enrol"], window=7, z_threshold=3.0
)

anomalous_days = merged_nat_daily[merged_nat_daily["enrol_anomaly"]]
print(f"Anomalous days detected: {len(anomalous_days)}")
print(anomalous_days[["date", "total_enrol"]].head(10))

In [None]:
# Visualize Anomalies
fig = px.line(
    merged_nat_daily,
    x="date",
    y="total_enrol",
    title="National Daily Enrolments with Anomaly Markers",
)

if len(anomalous_days) > 0:
    fig.add_scatter(
        x=anomalous_days["date"],
        y=anomalous_days["total_enrol"],
        mode="markers",
        marker=dict(color="red", size=10),
        name="Anomalies",
    )

fig.show()

In [None]:
# Simple Forecasting Model
enrol_ts = enrol_nat_daily.sort_values("date").copy()
enrol_ts["t"] = (enrol_ts["date"] - enrol_ts["date"].min()).dt.days

X = enrol_ts[["t"]].values
y = enrol_ts["total_enrol"].values

model = LinearRegression()
model.fit(X, y)

# Forecast for next 14 days
last_t = enrol_ts["t"].max()
future_t = np.arange(last_t + 1, last_t + 15)
future_dates = enrol_ts["date"].max() + pd.to_timedelta(future_t - last_t, unit="D")
future_pred = model.predict(future_t.reshape(-1, 1))

forecast_df = pd.DataFrame({"date": future_dates, "predicted_enrol": future_pred})

fig = px.line(
    enrol_ts,
    x="date",
    y="total_enrol",
    title="Observed vs Forecasted National Enrolments",
)

fig.add_scatter(
    x=forecast_df["date"],
    y=forecast_df["predicted_enrol"],
    mode="lines+markers",
    name="Forecast",
    line=dict(color="red", dash="dash"),
)

fig.show()

## KEY FINDINGS & INSIGHTS

### 1. Enrolment Intensity and Age Structure
- Uttar Pradesh, Bihar, and Karnataka drive the largest absolute enrolment volumes.
- States show varying age profiles: higher child enrolment (0–17) in certain regions suggests recent expansion or school-integration initiatives.

### 2. Update Behaviour and Lifecycle
- Demographic updates per 1,000 enrolments vary widely across states, indicating differences in:
  - Population mobility (migration, urbanisation).
  - Mobile number seeding and KYC uptake.
- Biometric update intensity may reflect either young populations or initial capture quality issues.

### 3. Temporal Dynamics
- National time series reveals distinct peaks in enrolment and updates, likely corresponding to:
  - Targeted campaigns or scheme deadlines.
  - Seasonal patterns (e.g., post-festival surges, school enrolment cycles).

### 4. Anomalies and Risk Indicators
- Z-score based detection flags unusual spikes in activity that may indicate:
  - Operational bottlenecks (bursty patterns due to limited capacity).
  - Potential data-quality or misuse patterns requiring audit.

### 5. Proposed Solution Frameworks

**A. Inclusion and Outreach Targeting**
- Use age-profile indicators to identify states/districts with low child or adult coverage.
- Launch targeted drives: school-based (for 5–17) and mobile-based (for remote areas).

**B. Capacity Planning and Resource Allocation**
- Use state- and district-level demand forecasts to allocate kits, staff, and operating hours.
- Prioritise extra capacity in high-demand regions with high update-to-enrolment ratios.

**C. Risk-Based Supervision**
- Maintain anomaly scores for states and operators.
- Apply additional verification, training, or audits for high-risk entities.

**D. Dashboard and Monitoring**
- Build multi-level KPI dashboard: enrolments by age, update intensities, anomaly rates, and demand forecasts.
- Integrate decision rules (thresholds for alerts) to make analytics immediately actionable.