In [None]:
                "def detect_anomalies(series: pd.Series, window: int = 7, z_threshold: float = 2.5) -> pd.Series:",
merged_nat_daily["enrol_anomaly"] = detect_anomalies(
                "    merged_nat_daily[\"total_enrol\"], window=7, z_threshold=2.5",
{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "#VSC-3809d8c5",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## 2. Datasets Used",
                "",
                "The following datasets from the UIDAI hackathon bundle are used. Each is split into multiple CSV partitions; we will programmatically load and concatenate all parts for each dataset.",
                "",
                "### 2.1 Aadhaar Enrolment Dataset",
                "",
                "**Files (example names):**",
                "- `api_data_aadhar_enrolment_0_500000.csv`",
                "- `api_data_aadhar_enrolment_500000_1000000.csv`",
                "- `api_data_aadhar_enrolment_1000000_1006029.csv`",
                "",
                "**Columns:**",
                "- `date` – Date of enrolment activity (string, format like `DD-MM-YYYY`).",
                "- `state` – State/UT name.",
                "- `district` – District name.",
                "- `pincode` – 6-digit postal pincode.",
                "- `age_0_5` – Number of enrolments for children aged 0–5 on that date and location.",
                "- `age_5_17` – Number of enrolments for children/adolescents aged 5–17.",
                "- `age_18_greater` – Number of enrolments for adults aged 18 and above.",
                "",
                "Together, these columns provide **age-structured enrolment volume** per pincode per day.",
                "",
                "### 2.2 Aadhaar Demographic Update Dataset",
                "",
                "**Files (example names):**",
                "- `api_data_aadhar_demographic_0_500000.csv`",
                "- `api_data_aadhar_demographic_500000_1000000.csv`",
                "- `api_data_aadhar_demographic_1000000_1500000.csv`",
                "- `api_data_aadhar_demographic_1500000_2000000.csv`",
                "- `api_data_aadhar_demographic_2000000_2071700.csv`",
                "",
                "**Columns:**",
                "- `date` – Date of demographic update activity.",
                "- `state` – State/UT name.",
                "- `district` – District name.",
                "- `pincode` – Postal pincode.",
                "- `demo_age_5_17` – Number of **demographic updates** for age group 5–17 (e.g., name, address, mobile number, etc.).",
                "- `demo_age_17_` – Number of **demographic updates** for age 17 and above (column name truncated, but interpreted as 17+).",
                "",
                "This dataset captures **non-biometric profile changes**, which are often proxies for migration, household mobility, and KYC-related updates.",
                "",
                "### 2.3 Aadhaar Biometric Update Dataset",
                "",
                "**Files (example names):**",
                "- `api_data_aadhar_biometric_0_500000.csv`",
                "- `api_data_aadhar_biometric_500000_1000000.csv`",
                "- `api_data_aadhar_biometric_1000000_1500000.csv`",
                "- `api_data_aadhar_biometric_1500000_1861108.csv`",
                "",
                "**Columns:**",
                "- `date` – Date of biometric update activity.",
                "- `state` – State/UT name.",
                "- `district` – District name.",
                "- `pincode` – Postal pincode.",
                "- `bio_age_5_17` – Number of **biometric updates** for age group 5–17.",
                "- `bio_age_17_` – Number of **biometric updates** for age 17 and above.",
                "",
                "Biometric updates can reflect both **quality refresh needs** (e.g., children whose biometrics change as they grow) and **operational issues** (poor initial capture)."
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-75965f30",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## 3. Methodology Overview",
                "",
                "The methodology is organised into distinct stages, making the analysis reproducible and extensible:",
                "",
                "1. **Data Ingestion and Schema Harmonisation**",
                "   - Read all CSV partitions for each dataset and vertically concatenate them.",
                "   - Parse `date` into a proper datetime field.",
                "   - Ensure consistent casing and trimming of `state`, `district`, and datatypes for `pincode`.",
                "   - Retain a common key: `(date, state, district, pincode)`.",
                "",
                "2. **Data Cleaning and Quality Checks**",
                "   - Check for missing, invalid, or duplicate keys.",
                "   - Validate non-negativity and reasonableness of counts.",
                "   - Identify potential data-entry anomalies (e.g., very large counts on a single day at a single pincode).",
                "",
                "3. **Feature Engineering and Derived Metrics**",
                "   - Aggregate data to multiple levels: pincode, district, state, and national.",
                "   - Compute age-wise proportions and totals:",
                "     - Enrolment profile: child vs adult share by state.",
                "     - Update profile: ratio of demographic to biometric updates by age group.",
                "   - Derive **intensity metrics**, e.g., updates per 1,000 enrolments (proxy, since individual-level link is absent).",
                "   - Construct **time-series indicators**: 7-day rolling averages, seasonal patterns.",
                "",
                "4. **Exploratory Data Analysis (EDA) and Visualisation**",
                "   - Time series of daily enrolments and updates at national and state levels.",
                "   - State-wise rankings and distributions of enrolment and update intensities.",
                "   - Age-group patterns across states and over time.",
                "",
                "5. **Anomaly Detection and Risk Indicators**",
                "   - Spot unusually high spikes in enrolment or updates using z-score based methods on residuals.",
                "   - Compute anomaly indicators at pincode/state level and examine their geographical and temporal clustering.",
                "",
                "6. **Predictive / Forward-Looking Indicators (Lightweight Models)**",
                "   - Use recent historical activity to build simple forecasting models of daily enrolment/update volumes (per state).",
                "   - Use regression-based scoring to identify locations likely to see high update intensity relative to enrolments.",
                "",
                "7. **Synthesis into Policy and Operational Insights**",
                "   - Map analytical findings to operational levers: capacity planning, outreach targeting, risk-based supervision.",
                "   - Illustrate how derived indicators can feed into dashboards and decision rules."
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-0a5e6df8",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 4. Environment Setup and Library Imports",
                "",
                "import os",
                "from pathlib import Path",
                "",
                "import numpy as np",
                "import pandas as pd",
                "import matplotlib.pyplot as plt",
                "import seaborn as sns",
                "import plotly.express as px",
                "",
                "plt.style.use(\"seaborn-v0_8\")",
                "sns.set(rc={\"figure.figsize\": (12, 6)})",
                "",
                "# Configure pandas display",
                "pd.set_option(\"display.max_columns\", 50)",
                "pd.set_option(\"display.width\", 120)",
                "",
                "# Base directory (adjust if running in a different folder)",
                "BASE_DIR = Path(r\"c:/Users/msi/Desktop/uidai\")",
                "",
                "enrol_dir = BASE_DIR / \"api_data_aadhar_enrolment\" / \"api_data_aadhar_enrolment\"",
                "demo_dir = BASE_DIR / \"api_data_aadhar_demographic\" / \"api_data_aadhar_demographic\"",
                "bio_dir = BASE_DIR / \"api_data_aadhar_biometric\" / \"api_data_aadhar_biometric\"",
                "",
                "enrol_dir, demo_dir, bio_dir"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-826fad03",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 5. Helper Functions for Loading and Cleaning",
                "",
                "def load_and_concat_csvs(directory: Path, prefix: str) -> pd.DataFrame:",
                "    \"\"\"Load all CSV files in a directory whose names start with a given prefix and concatenate them.",
                "",
                "    Parameters",
                "    ----------",
                "    directory : Path",
                "        Directory containing the CSV partitions.",
                "    prefix : str",
                "        File name prefix, e.g. \"api_data_aadhar_enrolment\".",
                "",
                "    Returns",
                "source",
                "",
                "def missing_summary(df: pd.DataFrame, name: str) -> pd.Series:",
                "    summary = df.isna().mean().sort_values(ascending=False)",
                "    print(f\"\\nMissing-value fraction for {name} (top 10):\")",
                "    print(summary.head(10))",
                "    return summary",
                "",
                "miss_enrol = missing_summary(enrol, \"enrolment\")",
                "miss_demo = missing_summary(demo, \"demographic updates\")",
                "miss_bio = missing_summary(bio, \"biometric updates\")",
                "",
                "miss_enrol, miss_demo, miss_bio"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-f2557693",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 6. Load Enrolment, Demographic, and Biometric Datasets",
                "",
                "enrol_raw = load_and_concat_csvs(enrol_dir, \"api_data_aadhar_enrolment\")",
                "demo_raw = load_and_concat_csvs(demo_dir, \"api_data_aadhar_demographic\")",
                "bio_raw = load_and_concat_csvs(bio_dir, \"api_data_aadhar_biometric\")",
                "",
                "enrol = parse_common_fields(enrol_raw)",
                "demo = parse_common_fields(demo_raw)",
                "bio = parse_common_fields(bio_raw)",
                "",
                "enrol.head(), demo.head(), bio.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-ea44e74c",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## 7. Data Cleaning and Basic Quality Checks",
                "",
                "We now perform basic validation and cleaning steps to ensure that the analyses rest on a consistent and robust base:",
                "- Check for missing dates, states, districts, and pincodes.",
                "- Verify that all counts are non-negative.",
                "- Identify duplicated keys `(date, state, district, pincode)` per dataset.",
                "- Summarise the overall time coverage and geographic coverage."
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-c157c80f",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 7.1 Missing Values Summary",
                "",
                "def missing_summary(df: pd.DataFrame, name: str) -> pd.Series:",
                "    summary = df.isna().mean().sort_values(ascending=False)",
                "    print(f\\nMissing-value fraction for {name} (top 10):",
                ",",
                ",",
                ",",
                "enrolment\")",
                "miss_demo = missing_summary(demo, \"demographic updates\")",
                "miss_bio = missing_summary(bio, \"biometric updates\")",
                "",
                "miss_enrol, miss_demo, miss_bio"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-c507295e",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 7.2 Non-Negativity and Basic Distribution Checks for Count Columns",
                "",
                "enrol_count_cols = [\"age_0_5\", \"age_5_17\", \"age_18_greater\"]",
                "demo_count_cols = [\"demo_age_5_17\", \"demo_age_17_\"]",
                "bio_count_cols = [\"bio_age_5_17\", \"bio_age_17_\"]",
                "",
                "def check_non_negative(df: pd.DataFrame, columns: list, name: str) -> None:",
                "    for column in columns:",
                "        if column in df.columns:",
                "            negative_count = (df[column] < 0).sum()",
                "            print(f\"{name} - {column}: {negative_count} rows with negative values\")",
                "",
                "check_non_negative(enrol, enrol_count_cols, \"Enrolment\")",
                "check_non_negative(demo, demo_count_cols, \"Demographic updates\")",
                "check_non_negative(bio, bio_count_cols, \"Biometric updates\")"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-9cf120eb",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 7.3 Duplication Checks",
                "",
                "def key_duplication_rate(df: pd.DataFrame, name: str) -> None:",
                "    key_columns = [\"date\", \"state\", \"district\", \"pincode\"]",
                "    df_key = df[key_columns]",
                "    duplicated_fraction = df_key.duplicated().mean()",
                "    print(f\"{name}: {duplicated_fraction:.4%} of key rows are duplicated\")",
                "",
                "key_duplication_rate(enrol, \"Enrolment\")",
                "key_duplication_rate(demo, \"Demographic updates\")",
                "key_duplication_rate(bio, \"Biometric updates\")"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-62233e7d",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 7.4 Time Coverage and Geographic Coverage",
                "",
                "def coverage_summary(df: pd.DataFrame, name: str) -> None:",
                "    print(f\"\\n{name} coverage summary:\")",
                "    print(\"Date range:\", df[\"date\"].min(), \"to\", df[\"date\"].max())",
                "    print(\"# States:\", df[\"state\"].nunique())",
                "    print(\"# Districts:\", df[\"district\"].nunique())",
                "    print(\"# Pincodes:\", df[\"pincode\"].nunique())",
                "    print(\"# Rows:\", len(df))",
                "",
                "coverage_summary(enrol, \"Enrolment\")",
                "coverage_summary(demo, \"Demographic updates\")",
                "coverage_summary(bio, \"Biometric updates\")"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-72b3df7b",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "### Interpretation Notes (for the Report)",
                "",
                "In the final PDF report, this section should summarise:",
                "- Overall temporal span of the data and whether it covers continuous days or selective sampling.",
                "- Breadth of geographic coverage and any states/districts with especially sparse representation.",
                "- Any unexpected missingness or negative counts and how they were handled (e.g., filtered out, treated as zero)."
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-f3a144b4",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 8. Aggregate Metrics and Unified Panel",
                "",
                "        ",
                "# 8.1 Aggregate Enrolments by Key Levels",
                "enrol[\"total_enrol\"] = enrol[\"age_0_5\"] + enrol[\"age_5_17\"] + enrol[\"age_18_greater\"]",
                "",
                "enrol_state_daily = enrol.groupby([\"date\", \"state\"], as_index=False)[",
                "    [\"age_0_5\", \"age_5_17\", \"age_18_greater\", \"total_enrol\"]",
                "]",
                "",
                "enrol_state_total = enrol_state_daily.groupby(\"state\", as_index=False)[",
                "    [\"age_0_5\", \"age_5_17\", \"age_18_greater\", \"total_enrol\"]",
                "].sum()",
                "",
                "# Age-share within each state",
                "for column in [\"age_0_5\", \"age_5_17\", \"age_18_greater\"]:",
                "    enrol_state_total[f\"share_{column}\"] = enrol_state_total[column] / enrol_state_total[\"total_enrol\"]",
                "",
                "enrol_state_total.head()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-a3d38ce6",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 8.2 Aggregate Demographic and Biometric Updates by State and Date",
                "",
                "demo[\"total_demo_updates\"] = demo[\"demo_age_5_17\"] + demo[\"demo_age_17_\"]",
                "bio[\"total_bio_updates\"] = bio[\"bio_age_5_17\"] + bio[\"bio_age_17_\"]",
                "",
                "demo_state_daily = demo.groupby([\"date\", \"state\"], as_index=False)[",
                "    [\"demo_age_5_17\", \"demo_age_17_\", \"total_demo_updates\"]",
                "].sum()",
                "bio_state_daily = bio.groupby([\"date\", \"state\"], as_index=False)[",
                "    [\"bio_age_5_17\", \"bio_age_17_\", \"total_bio_updates\"]",
                "].sum()",
                "",
                "demo_state_total = demo_state_daily.groupby(\"state\", as_index=False).sum(numeric_only=True)",
                "bio_state_total = bio_state_daily.groupby(\"state\", as_index=False).sum(numeric_only=True)",
                "",
                "demo_state_total.head(), bio_state_total.head()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-d5f11008",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 8.3 Merge into a Unified State-Level Panel and Derive Intensity Ratios",
                "",
                "state_panel = enrol_state_total.merge(",
                "    demo_state_total[[\"state\", \"demo_age_5_17\", \"demo_age_17_\", \"total_demo_updates\"]],",
                "    on=\"state\",",
                "    how=\"left\",",
                ").merge(",
                "    bio_state_total[[\"state\", \"bio_age_5_17\", \"bio_age_17_\", \"total_bio_updates\"]],",
                "    on=\"state\",",
                "    how=\"left\",",
                ")",
                "",
                "# Replace missing update counts with zero where states appear only in enrolment",
                "for column in [",
                "    \"demo_age_5_17\",",
                "    \"demo_age_17_\",",
                "    \"total_demo_updates\",",
                "    \"bio_age_5_17\",",
                "    \"bio_age_17_\",",
                "    \"total_bio_updates\",",
                "]:",
                "    if column in state_panel.columns:",
                "        state_panel[column] = state_panel[column].fillna(0)",
                "",
                "# Ratios: updates per enrolment (proxy intensity measures)",
                "state_panel[\"demo_updates_per_1000_enrol\"] = 1000 * state_panel[\"total_demo_updates\"] / state_panel[\"total_enrol\"]",
                "state_panel[\"bio_updates_per_1000_enrol\"] = 1000 * state_panel[\"total_bio_updates\"] / state_panel[\"total_enrol\"]",
                "",
                "state_panel.sort_values(\"total_enrol\", ascending=False).head(10)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-64203663",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## 9. Data Analysis and Visualisations",
                "",
                "In this section, we generate key visualisations and interpret them to extract insights about:",
                "- The **distribution and intensity** of Aadhaar enrolments across states and age groups.",
                "- The **relative burden of demographic vs biometric updates**, indicating stability of records and operational quality.",
                "- **Temporal trends and anomalies** in enrolment and updates.",
                "- **Derived indicators** that can inform policy and operational decisions."
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-f66d6a34",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 9.1 National-Level Daily Time Series of Enrolments and Updates",
                "",
                "enrol_nat_daily = enrol.groupby(\"date\", as_index=False)[\"total_enrol\"].sum()",
                "demo_nat_daily = demo.groupby(\"date\", as_index=False)[\"total_demo_updates\"].sum()",
                "bio_nat_daily = bio.groupby(\"date\", as_index=False)[\"total_bio_updates\"].sum()",
                "",
                "merged_nat_daily = enrol_nat_daily.merge(demo_nat_daily, on=\"date\", how=\"outer\").merge(",
                "    bio_nat_daily, on=\"date\", how=\"outer\"",
                ").fillna(0).sort_values(\"date\")",
                "",
                "fig = px.line(",
                "    merged_nat_daily,",
                "    x=\"date\",",
                "    y=[\"total_enrol\", \"total_demo_updates\", \"total_bio_updates\"],",
                "    labels={\"value\": \"Count\", \"date\": \"Date\", \"variable\": \"Metric\"},",
                "    title=\"National Daily Aadhaar Enrolments vs Demographic and Biometric Updates\",",
                ")",
                "fig.show()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-8b22c6a9",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 9.2 Top States by Total Enrolments and Update Intensity",
                "",
                "top_states_by_enrol = state_panel.sort_values(\"total_enrol\", ascending=False).head(15)",
                "",
                "fig = px.bar(",
                "    top_states_by_enrol,",
                "    x=\"state\",",
                "    y=\"total_enrol\",",
                "    title=\"Top 15 States by Total Aadhaar Enrolments\",",
                "    labels={\"total_enrol\": \"Total Enrolments\", \"state\": \"State\"},",
                ")",
                "fig.update_layout(xaxis_tickangle=-45)",
                "fig.show()",
                "",
                "top_states_by_demo_intensity = state_panel.sort_values(",
                "    \"demo_updates_per_1000_enrol\", ascending=False",
                ").head(15)",
                "",
                "fig = px.bar(",
                "    top_states_by_demo_intensity,",
                "    x=\"state\",",
                "    y=\"demo_updates_per_1000_enrol\",",
                "    title=\"Top 15 States by Demographic Update Intensity (per 1,000 Enrolments)\",",
                "    labels={\"demo_updates_per_1000_enrol\": \"Demographic Updates per 1,000 Enrolments\"},",
                ")",
                "fig.update_layout(xaxis_tickangle=-45)",
                "fig.show()",
                "",
                "top_states_by_bio_intensity = state_panel.sort_values(",
                "    \"bio_updates_per_1000_enrol\", ascending=False",
                ").head(15)",
                "",
                "fig = px.bar(",
                "    top_states_by_bio_intensity,",
                "    x=\"state\",",
                "    y=\"bio_updates_per_1000_enrol\",",
                "    title=\"Top 15 States by Biometric Update Intensity (per 1,000 Enrolments)\",",
                "    labels={\"bio_updates_per_1000_enrol\": \"Biometric Updates per 1,000 Enrolments\"},",
                ")",
                "fig.update_layout(xaxis_tickangle=-45)",
                "fig.show()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-d8676058",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 9.3 Age-Profile of Enrolments by State",
                "",
                "age_profile = enrol_state_total.melt(",
                "    id_vars=[\"state\"],",
                "    value_vars=[\"age_0_5\", \"age_5_17\", \"age_18_greater\"],",
                "    var_name=\"age_group\",",
                "    value_name=\"enrolments\",",
                ")",
                "",
                "fig = px.bar(",
                "    age_profile,",
                "    x=\"state\",",
                "        ",
                "    y=\"enrolments\",",
                "    color=\"age_group\",",
                "    title=\"Age-wise Aadhaar Enrolments by State\",",
                ")",
                "fig.update_layout(xaxis_tickangle=-60)",
                "fig.show()",
                "",
                "# Normalised shares for better cross-state comparison",
                "age_share_long = state_panel.melt(",
                "    id_vars=[\"state\"],",
                "    value_vars=[\"share_age_0_5\", \"share_age_5_17\", \"share_age_18_greater\"],",
                "    var_name=\"age_group_share\",",
                "    value_name=\"share\",",
                ")",
                "",
                "fig = px.bar(",
                "    age_share_long,",
                "    x=\"state\",",
                "    y=\"share\",",
                "    color=\"age_group_share\",",
                "    title=\"Age-share of Enrolments by State\",",
                ")",
                "fig.update_layout(xaxis_tickangle=-60)",
                "fig.show()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-4b818f89",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 9.4 Daily State-Level Time-Series for Selected States (Example)",
                "",
                "focus_states = [",
                "    \"Uttar Pradesh\",",
                "    \"Bihar\",",
                "    \"Karnataka\",",
                "    \"Maharashtra\",",
                "]",
                "",
                "enrol_focus = enrol_state_daily[enrol_state_daily[\"state\"].isin(focus_states)]",
                "",
                "fig = px.line(",
                "    enrol_focus,",
                "    x=\"date\",",
                "    y=\"total_enrol\",",
                "    color=\"state\",",
                "    title=\"Daily Aadhaar Enrolments for Selected States\",",
                ")",
                "fig.show()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-cc28bfa2",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 9.5 Demographic vs Biometric Update Mix by State",
                "",
                "update_mix = state_panel.copy()",
                "update_mix[\"demo_share_of_updates\"] = update_mix[\"total_demo_updates\"] / (",
                "    update_mix[\"total_demo_updates\"] + update_mix[\"total_bio_updates\"] + 1e-9",
                ")",
                "",
                "fig = px.scatter(",
                "    update_mix,",
                "    x=\"demo_updates_per_1000_enrol\",",
                "    y=\"bio_updates_per_1000_enrol\",",
                "    text=\"state\",",
                "    color=\"demo_share_of_updates\",",
                "    color_continuous_scale=\"Viridis\",",
                "    labels={",
                "        \"demo_updates_per_1000_enrol\": \"Demographic Updates per 1,000 Enrolments\",",
                "        \"bio_updates_per_1000_enrol\": \"Biometric Updates per 1,000 Enrolments\",",
                "    },",
                "    title=\"State-wise Mix and Intensity of Demographic vs Biometric Updates\",",
                ")",
                "fig.update_traces(textposition=\"top center\")",
                "fig.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-2b5060df",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "### Interpretation Notes for Visualisations (for the Report)",
                "",
                "In the PDF report, this section should translate the above plots into **clear textual insights**, for example:",
                "- Which states drive the highest **absolute enrolments**, and which have disproportionately high shares of **child enrolment** (0–5 and 5–17).",
                "- States with **high demographic update intensity** per enrolment could indicate:",
                "  - High population mobility (e.g., migration, urbanisation), or",
                "  - Greater uptake of mobile seeding and KYC processes.",
                "- States with **high biometric update intensity** might indicate:",
                "  - Younger populations (biometrics change more over time), or",
                "  - Operational issues in initial biometric capture quality.",
                "- Comparison of national time series can identify **periods of concentrated campaigns**, policy changes, or seasonal patterns (e.g., post-festival surges)."
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-24bbeddb",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 10. Anomaly Detection: Simple Z-score Based Spikes in Daily Activity",
                "",
                "def detect_anomalies(series: pd.Series, window: int = 7, z_threshold: float = 3.0) -> pd.Series:",
                "    \"\"\"Detect anomalies using rolling mean and standard deviation.",
                "",
                "    Returns a boolean Series indicating anomalous points.",
                "    \"\"\"",
                "    rolling_mean = series.rolling(window=window, min_periods=window).mean()",
                "    rolling_std = series.rolling(window=window, min_periods=window).std()",
                "    z_scores = (series - rolling_mean) / (rolling_std + 1e-9)",
                "    anomalies = z_scores.abs() > z_threshold",
                "    return anomalies",
                "",
                "# Example: National enrolment anomalies",
                "merged_nat_daily[\"enrol_anomaly\"] = detect_anomalies(",
                "    merged_nat_daily[\"total_enrol\"], window=7, z_threshold=2.5",
                ")",
                "",
                "anomalous_days = merged_nat_daily[merged_nat_daily[\"enrol_anomaly\"]]",
                "anomalous_days.head(10)"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-0064a237",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 10.1 Visualising Anomalies",
                "",
                "fig = px.line(",
                "    merged_nat_daily,",
                "    x=\"date\",",
                "    y=\"total_enrol\",",
                "    title=\"National Daily Enrolments with Anomaly Markers\",",
                ")",
                "",
                "fig.add_scatter(",
                "    x=anomalous_days[\"date\"],",
                "    y=anomalous_days[\"total_enrol\"],",
                "    mode=\"markers\",",
                "    marker=dict(color=\"red\", size=10),",
                "    name=\"Anomalies\",",
                ")",
                "",
                "fig.show()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-cc118064",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 10.2 State-Level Anomaly Rates (Example)",
                "",
                "def state_anomaly_rate(enrol_state_daily_df: pd.DataFrame) -> pd.DataFrame:",
                "    records = []",
                "    for state_name, group in enrol_state_daily_df.groupby(\"state\"):",
                "        group_sorted = group.sort_values(\"date\")",
                "        anomalies = detect_anomalies(group_sorted[\"total_enrol\"], window=7, z_threshold=2.5)",
                "        anomaly_rate = anomalies.mean()",
                "        records.append({\"state\": state_name, \"enrol_anomaly_rate\": anomaly_rate})",
                "    return pd.DataFrame(records)",
                "",
                "state_anomaly_df = state_anomaly_rate(enrol_state_daily)",
                "state_anomaly_df.sort_values(\"enrol_anomaly_rate\", ascending=False).head(10)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-db7854de",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "### Interpretation Notes for Anomalies and Risk Indicators",
                "",
                "For the written report:",
                "- Highlight **dates with unusually high enrolment spikes** and cross-reference them with known campaigns, policy announcements, or deadlines where possible.",
                "- Identify states with **consistently high anomaly rates**, which may signal:",
                "  - Operational bottlenecks (e.g., bursty enrolment patterns due to limited centre availability), or",
                "  - Potential data-quality or misuse patterns that require deeper audit.",
                "- Propose an operational rule: e.g., \"If anomaly rate exceeds X% over the last Y days, trigger a targeted review of that state’s centres or operators.\""
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-8c9f4dc3",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 11. Lightweight Predictive Indicators (Forecasting Demand)",
                "",
                "from sklearn.linear_model import LinearRegression",
                "",
                "def add_time_index(df: pd.DataFrame) -> pd.DataFrame:",
                "    df_sorted = df.sort_values(\"date\").copy()",
                "    df_sorted[\"t\"] = (df_sorted[\"date\"] - df_sorted[\"date\"].min()).dt.days",
                "    return df_sorted",
                "",
                "# Example: Simple trend model for national enrolments",
                "enrol_ts = add_time_index(enrol_nat_daily)",
                "",
                "X = enrol_ts[[\"t\"]].values",
                "y = enrol_ts[\"total_enrol\"].values",
                "",
                "model = LinearRegression()",
                "model.fit(X, y)",
                "",
                "# Forecast for the next 14 days (as a simple illustrative predictive indicator)",
                "last_t = enrol_ts[\"t\"].max()",
                "future_t = np.arange(last_t + 1, last_t + 15)",
                "future_dates = enrol_ts[\"date\"].max() + pd.to_timedelta(future_t - last_t, unit=\"D\")",
                "future_pred = model.predict(future_t.reshape(-1, 1))",
                "",
                "forecast_df = pd.DataFrame({\"date\": future_dates, \"predicted_enrol\": future_pred})",
                "forecast_df.head()"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-501d36b3",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# 11.1 Visualising Trend and Forecast",
                "",
                "fig = px.line(",
                "    enrol_ts,",
                "    x=\"date\",",
                "    y=\"total_enrol\",",
                "    title=\"Observed vs Forecasted National Enrolments (Simple Trend Model)\",",
                ")",
                "",
                "fig.add_scatter(",
                "    x=forecast_df[\"date\"],",
                "    y=forecast_df[\"predicted_enrol\"],",
                "    mode=\"lines+markers\",",
                "    name=\"Forecast\",",
                "    line=dict(color=\"red\", dash=\"dash\"),",
                ")",
                "",
                "fig.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-b781029d",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "### Interpretation Notes for Predictive Indicators",
                "",
                "In the report, you can explain how even simple models can support **forward-looking planning**:",
                "- Use short-term forecasts of enrolment and update volumes to **allocate additional kits and staff** to high-demand states during expected peaks.",
                "- Extend the approach to state-wise or district-wise models where sufficient historical data is available.",
                "- Combine predicted demand with observed update-to-enrolment ratios to identify **regions where capacity for updates (especially biometric updates) needs to be strengthened.**"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-0f5c5463",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## 12. Synthesis: Key Insights and Solution Frameworks (Template for PDF)",
                "",
                "This final section is meant to be **directly used in the PDF report**. After running the notebook and inspecting the actual figures, you can fill in specific numbers and state names. Below is a suggested structure and example language you can adapt based on the outputs.",
                "",
                "### 12.1 Key Patterns and Trends",
                "",
                "1. **Enrolment Intensity and Age Structure**",
                "- Certain states (e.g., _State A, State B_) account for the largest share of total Aadhaar enrolments in the sample period.",
                "- States such as _State C_ exhibit a **higher share of child enrolments** (0–5 and 5–17), indicating either recent expansion of coverage among children or integration of Aadhaar enrolment with school/ICDS systems.",
                "",
                "2. **Update Behaviour and Lifecycle**",
                "- Demographic updates per 1,000 enrolments are highest in _State D_ and _State E_, consistent with higher migration and mobile number churn.",
                "- Biometric update intensity is particularly high in _State F_, potentially reflecting either a young population profile or initial capture quality issues.",
                "",
                "3. **Temporal Dynamics and Campaign Effects**",
                "- The national time series reveals **distinct peaks in enrolment and updates** around [dates/events], which likely correspond to targeted campaigns or scheme deadlines.",
                "- Outside these peaks, there is a relatively stable baseline, suggesting predictable workload for centres.",
                "",
                "4. **Anomalies and Risk Indicators**",
                "- Z-score based anomaly detection flags **unusually high spikes** in enrolment on [dates] in [states/districts].",
                "- Some states show **consistently elevated anomaly rates**, which may warrant further audit of centre-level behaviour and processes.",
                "",
                "### 12.2 Proposed Solution Frameworks",
                "",
                "1. **Inclusion and Outreach Targeting Framework**",
                "- Use **age-profile and enrolment intensity** indicators to select states/districts where child or adult coverage appears relatively low.",
                "- Launch targeted campaigns:",
                "  - **School-based drives** in regions with low 5–17 enrolment.",
                "  - **Women-focused drives** leveraging SHGs, Anganwadi centres, and health facilities if gender-disaggregated data is integrated later.",
                "",
                "2. **Capacity Planning and Resource Allocation Framework**",
                "- Use **state- and district-level demand forecasts** to adjust:",
                "  - Number of enrolment/update kits and staff per centre.",
                "  - Centre operating hours and appointment scheduling.",
                "- Prioritise extra capacity in locations with both high predicted demand and high update-to-enrolment ratios.",
                "",
                "3. **Risk-Based Supervision and Quality Assurance Framework**",
                "- Maintain an **anomaly score** for each state and (if data is available) each enrolment centre/operator.",
                "- For high-risk entities, apply:",
                "  - Additional checks on documentation or biometrics.",
                "  - Targeted refresher training on capture quality.",
                "  - Periodic audits.",
                "",
                "4. **Dashboard and Monitoring Framework**",
                "- Build a multi-level dashboard (national/state/district) with core KPIs:",
                "  - Total enrolments and age-wise shares.",
                "  - Demographic and biometric update intensities.",
                "  - Anomaly rates and recent flagged events.",
                "  - Short-term demand forecasts.",
                "- Integrate simple decision rules (e.g., thresholds for alerts) to make the analytics immediately actionable.",
                "",
                "---",
                "",
                "### 12.3 How to Use this Notebook for the Hackathon Submission",
                "",
                "- **Problem Statement and Approach:** Take content from Sections 1 and 3.",
                "- **Datasets Used:** Summarise Section 2 and mention all three datasets and key columns.",
                "- **Methodology:** Use the step-wise explanation from Sections 3, 7, 8, 10, and 11.",
                "- **Data Analysis and Visualisation:** Export crucial plots (Sections 9–11) as images and embed them into the PDF with concise interpretations.",
                "- Attach this notebook (or an HTML/PDF export of it) as the **code/technical appendix** to demonstrate reproducibility of your analysis."
            ]
        }
    ]
}

## 2. Datasets Used

The following datasets from the UIDAI hackathon bundle are used. Each is split into multiple CSV partitions; we will programmatically load and concatenate all parts for each dataset.

### 2.1 Aadhaar Enrolment Dataset

**Files (example names):**
- `api_data_aadhar_enrolment_0_500000.csv`
- `api_data_aadhar_enrolment_500000_1000000.csv`
- `api_data_aadhar_enrolment_1000000_1006029.csv`

**Columns:**
- `date` – Date of enrolment activity (string, format like `DD-MM-YYYY`).
- `state` – State/UT name.
- `district` – District name.
- `pincode` – 6-digit postal pincode.
- `age_0_5` – Number of enrolments for children aged 0–5 on that date and location.
- `age_5_17` – Number of enrolments for children/adolescents aged 5–17.
- `age_18_greater` – Number of enrolments for adults aged 18 and above.

Together, these columns provide **age-structured enrolment volume** per pincode per day.

### 2.2 Aadhaar Demographic Update Dataset

**Files (example names):**
- `api_data_aadhar_demographic_0_500000.csv`
- `api_data_aadhar_demographic_500000_1000000.csv`
- `api_data_aadhar_demographic_1000000_1500000.csv`
- `api_data_aadhar_demographic_1500000_2000000.csv`
- `api_data_aadhar_demographic_2000000_2071700.csv`

**Columns:**
- `date` – Date of demographic update activity.
- `state` – State/UT name.
- `district` – District name.
- `pincode` – Postal pincode.
- `demo_age_5_17` – Number of **demographic updates** for age group 5–17 (e.g., name, address, mobile number, etc.).
- `demo_age_17_` – Number of **demographic updates** for age 17 and above (column name truncated, but interpreted as 17+).

This dataset captures **non-biometric profile changes**, which are often proxies for migration, household mobility, and KYC-related updates.

### 2.3 Aadhaar Biometric Update Dataset

**Files (example names):**
- `api_data_aadhar_biometric_0_500000.csv`
- `api_data_aadhar_biometric_500000_1000000.csv`
- `api_data_aadhar_biometric_1000000_1500000.csv`
- `api_data_aadhar_biometric_1500000_1861108.csv`

**Columns:**
- `date` – Date of biometric update activity.
- `state` – State/UT name.
- `district` – District name.
- `pincode` – Postal pincode.
- `bio_age_5_17` – Number of **biometric updates** for age group 5–17.
- `bio_age_17_` – Number of **biometric updates** for age 17 and above.

Biometric updates can reflect both **quality refresh needs** (e.g., children whose biometrics change as they grow) and **operational issues** (poor initial capture).

## 3. Methodology Overview

The methodology is organised into distinct stages, making the analysis reproducible and extensible:

1. **Data Ingestion and Schema Harmonisation**
   - Read all CSV partitions for each dataset and vertically concatenate them.
   - Parse `date` into a proper datetime field.
   - Ensure consistent casing and trimming of `state`, `district`, and datatypes for `pincode`.
   - Retain a common key: `(date, state, district, pincode)`.

2. **Data Cleaning and Quality Checks**
   - Check for missing, invalid, or duplicate keys.
   - Validate non-negativity and reasonableness of counts.
   - Identify potential data-entry anomalies (e.g., very large counts on a single day at a single pincode).

3. **Feature Engineering and Derived Metrics**
   - Aggregate data to multiple levels: pincode, district, state, and national.
   - Compute age-wise proportions and totals:
     - Enrolment profile: child vs adult share by state.
     - Update profile: ratio of demographic to biometric updates by age group.
   - Derive **intensity metrics**, e.g., updates per 1,000 enrolments (proxy, since individual-level link is absent).
   - Construct **time-series indicators**: 7-day rolling averages, seasonal patterns.

4. **Exploratory Data Analysis (EDA) and Visualisation**
   - Time series of daily enrolments and updates at national and state levels.
   - State-wise rankings and distributions of enrolment and update intensities.
   - Age-group patterns across states and over time.

5. **Anomaly Detection and Risk Indicators**
   - Spot unusually high spikes in enrolment or updates using z-score based methods on residuals.
   - Compute anomaly indicators at pincode/state level and examine their geographical and temporal clustering.

6. **Predictive / Forward-Looking Indicators (Lightweight Models)**
   - Use recent historical activity to build simple forecasting models of daily enrolment/update volumes (per state).
   - Use regression-based scoring to identify locations likely to see high update intensity relative to enrolments.

7. **Synthesis into Policy and Operational Insights**
   - Map analytical findings to operational levers: capacity planning, outreach targeting, risk-based supervision.
   - Illustrate how derived indicators can feed into dashboards and decision rules.

In [1]:
# 4. Environment Setup and Library Imports

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

plt.style.use("seaborn-v0_8")
sns.set(rc={"figure.figsize": (12, 6)})

# Configure pandas display
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

# Base directory (adjust if running in a different folder)
BASE_DIR = Path(r"c:/Users/msi/Desktop/uidai")

enrol_dir = BASE_DIR / "api_data_aadhar_enrolment" / "api_data_aadhar_enrolment"
demo_dir = BASE_DIR / "api_data_aadhar_demographic" / "api_data_aadhar_demographic"
bio_dir = BASE_DIR / "api_data_aadhar_biometric" / "api_data_aadhar_biometric"

enrol_dir, demo_dir, bio_dir

(WindowsPath('c:/Users/msi/Desktop/uidai/api_data_aadhar_enrolment/api_data_aadhar_enrolment'),
 WindowsPath('c:/Users/msi/Desktop/uidai/api_data_aadhar_demographic/api_data_aadhar_demographic'),
 WindowsPath('c:/Users/msi/Desktop/uidai/api_data_aadhar_biometric/api_data_aadhar_biometric'))

In [None]:
# 5. Helper Functions for Loading and Cleaning

def load_and_concat_csvs(directory: Path, prefix: str) -> pd.DataFrame:
    """Load all CSV files in a directory whose names start with a given prefix and concatenate them.

    Parameters
    ----------
    directory : Path
        Directory containing the CSV partitions.
    prefix : str
        File name prefix, e.g. "api_data_aadhar_enrolment".

    Returns
source

def missing_summary(df: pd.DataFrame, name: str) -> pd.Series:
    summary = df.isna().mean().sort_values(ascending=False)
    print(f"\nMissing-value fraction for {name} (top 10):")
    print(summary.head(10))
    return summary

miss_enrol = missing_summary(enrol, "enrolment")
miss_demo = missing_summary(demo, "demographic updates")
miss_bio = missing_summary(bio, "biometric updates")

miss_enrol, miss_demo, miss_bio

In [3]:
# 6. Load Enrolment, Demographic, and Biometric Datasets

enrol_raw = load_and_concat_csvs(enrol_dir, "api_data_aadhar_enrolment")
demo_raw = load_and_concat_csvs(demo_dir, "api_data_aadhar_demographic")
bio_raw = load_and_concat_csvs(bio_dir, "api_data_aadhar_biometric")

enrol = parse_common_fields(enrol_raw)
demo = parse_common_fields(demo_raw)
bio = parse_common_fields(bio_raw)

enrol.head(), demo.head(), bio.head()

(        date          state          district pincode  age_0_5  age_5_17  age_18_greater
 0 2025-03-02      Meghalaya  East Khasi Hills  793121       11        61              37
 1 2025-03-09      Karnataka   Bengaluru Urban  560043       14        33              39
 2 2025-03-09  Uttar Pradesh      Kanpur Nagar  208001       29        82              12
 3 2025-03-09  Uttar Pradesh           Aligarh  202133       62        29              15
 4 2025-03-09      Karnataka   Bengaluru Urban  560016       14        16              21,
         date           state    district pincode  demo_age_5_17  demo_age_17_
 0 2025-03-01   Uttar Pradesh   Gorakhpur  273213             49           529
 1 2025-03-01  Andhra Pradesh    Chittoor  517132             22           375
 2 2025-03-01         Gujarat      Rajkot  360006             65           765
 3 2025-03-01  Andhra Pradesh  Srikakulam  532484             24           314
 4 2025-03-01       Rajasthan     Udaipur  313801             45

## 7. Data Cleaning and Basic Quality Checks

We now perform basic validation and cleaning steps to ensure that the analyses rest on a consistent and robust base:
- Check for missing dates, states, districts, and pincodes.
- Verify that all counts are non-negative.
- Identify duplicated keys `(date, state, district, pincode)` per dataset.
- Summarise the overall time coverage and geographic coverage.

In [5]:
# 7.1 Missing Values Summary

def missing_summary(df: pd.DataFrame, name: str) -> pd.Series:
    summary = df.isna().mean().sort_values(ascending=False)
    print(f\nMissing-value fraction for {name} (top 10):
,
,
,
enrolment")
miss_demo = missing_summary(demo, "demographic updates")
miss_bio = missing_summary(bio, "biometric updates")

miss_enrol, miss_demo, miss_bio

SyntaxError: unexpected character after line continuation character (3155234840.py, line 5)

In [6]:
# 7.2 Non-Negativity and Basic Distribution Checks for Count Columns

enrol_count_cols = ["age_0_5", "age_5_17", "age_18_greater"]
demo_count_cols = ["demo_age_5_17", "demo_age_17_"]
bio_count_cols = ["bio_age_5_17", "bio_age_17_"]

def check_non_negative(df: pd.DataFrame, columns: list, name: str) -> None:
    for column in columns:
        if column in df.columns:
            negative_count = (df[column] < 0).sum()
            print(f"{name} - {column}: {negative_count} rows with negative values")

check_non_negative(enrol, enrol_count_cols, "Enrolment")
check_non_negative(demo, demo_count_cols, "Demographic updates")
check_non_negative(bio, bio_count_cols, "Biometric updates")

Enrolment - age_0_5: 0 rows with negative values
Enrolment - age_5_17: 0 rows with negative values
Enrolment - age_18_greater: 0 rows with negative values
Demographic updates - demo_age_5_17: 0 rows with negative values
Demographic updates - demo_age_17_: 0 rows with negative values
Biometric updates - bio_age_5_17: 0 rows with negative values
Biometric updates - bio_age_17_: 0 rows with negative values


In [7]:
# 7.3 Duplication Checks

def key_duplication_rate(df: pd.DataFrame, name: str) -> None:
    key_columns = ["date", "state", "district", "pincode"]
    df_key = df[key_columns]
    duplicated_fraction = df_key.duplicated().mean()
    print(f"{name}: {duplicated_fraction:.4%} of key rows are duplicated")

key_duplication_rate(enrol, "Enrolment")
key_duplication_rate(demo, "Demographic updates")
key_duplication_rate(bio, "Biometric updates")

Enrolment: 2.2819% of key rows are duplicated
Demographic updates: 22.8605% of key rows are duplicated
Biometric updates: 5.0989% of key rows are duplicated


In [None]:
# 7.4 Time Coverage and Geographic Coverage

def coverage_summary(df: pd.DataFrame, name: str) -> None:
    print(f"\n{name} coverage summary:")
    print("Date range:", df["date"].min(), "to", df["date"].max())
    print("# States:", df["state"].nunique())
    print("# Districts:", df["district"].nunique())
    print("# Pincodes:", df["pincode"].nunique())
    print("# Rows:", len(df))

coverage_summary(enrol, "Enrolment")
coverage_summary(demo, "Demographic updates")
coverage_summary(bio, "Biometric updates")

SyntaxError: unexpected character after line continuation character (1601854368.py, line 4)

### Interpretation Notes (for the Report)

In the final PDF report, this section should summarise:
- Overall temporal span of the data and whether it covers continuous days or selective sampling.
- Breadth of geographic coverage and any states/districts with especially sparse representation.
- Any unexpected missingness or negative counts and how they were handled (e.g., filtered out, treated as zero).

In [9]:
# 8. Aggregate Metrics and Unified Panel

        
# 8.1 Aggregate Enrolments by Key Levels
enrol["total_enrol"] = enrol["age_0_5"] + enrol["age_5_17"] + enrol["age_18_greater"]

enrol_state_daily = enrol.groupby(["date", "state"], as_index=False)[
    ["age_0_5", "age_5_17", "age_18_greater", "total_enrol"]
]

enrol_state_total = enrol_state_daily.groupby("state", as_index=False)[
    ["age_0_5", "age_5_17", "age_18_greater", "total_enrol"]
].sum()

# Age-share within each state
for column in ["age_0_5", "age_5_17", "age_18_greater"]:
    enrol_state_total[f"share_{column}"] = enrol_state_total[column] / enrol_state_total["total_enrol"]

enrol_state_total.head()

AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'

In [None]:
# 8.2 Aggregate Demographic and Biometric Updates by State and Date

demo["total_demo_updates"] = demo["demo_age_5_17"] + demo["demo_age_17_"]
bio["total_bio_updates"] = bio["bio_age_5_17"] + bio["bio_age_17_"]

demo_state_daily = demo.groupby(["date", "state"], as_index=False)[
    ["demo_age_5_17", "demo_age_17_", "total_demo_updates"]
].sum()
bio_state_daily = bio.groupby(["date", "state"], as_index=False)[
    ["bio_age_5_17", "bio_age_17_", "total_bio_updates"]
].sum()

demo_state_total = demo_state_daily.groupby("state", as_index=False).sum(numeric_only=True)
bio_state_total = bio_state_daily.groupby("state", as_index=False).sum(numeric_only=True)

demo_state_total.head(), bio_state_total.head()

In [None]:
# 8.3 Merge into a Unified State-Level Panel and Derive Intensity Ratios

state_panel = enrol_state_total.merge(
    demo_state_total[["state", "demo_age_5_17", "demo_age_17_", "total_demo_updates"]],
    on="state",
    how="left",
).merge(
    bio_state_total[["state", "bio_age_5_17", "bio_age_17_", "total_bio_updates"]],
    on="state",
    how="left",
)

# Replace missing update counts with zero where states appear only in enrolment
for column in [
    "demo_age_5_17",
    "demo_age_17_",
    "total_demo_updates",
    "bio_age_5_17",
    "bio_age_17_",
    "total_bio_updates",
]:
    if column in state_panel.columns:
        state_panel[column] = state_panel[column].fillna(0)

# Ratios: updates per enrolment (proxy intensity measures)
state_panel["demo_updates_per_1000_enrol"] = 1000 * state_panel["total_demo_updates"] / state_panel["total_enrol"]
state_panel["bio_updates_per_1000_enrol"] = 1000 * state_panel["total_bio_updates"] / state_panel["total_enrol"]

state_panel.sort_values("total_enrol", ascending=False).head(10)

## 9. Data Analysis and Visualisations

In this section, we generate key visualisations and interpret them to extract insights about:
- The **distribution and intensity** of Aadhaar enrolments across states and age groups.
- The **relative burden of demographic vs biometric updates**, indicating stability of records and operational quality.
- **Temporal trends and anomalies** in enrolment and updates.
- **Derived indicators** that can inform policy and operational decisions.

In [None]:
# 9.1 National-Level Daily Time Series of Enrolments and Updates

enrol_nat_daily = enrol.groupby("date", as_index=False)["total_enrol"].sum()
demo_nat_daily = demo.groupby("date", as_index=False)["total_demo_updates"].sum()
bio_nat_daily = bio.groupby("date", as_index=False)["total_bio_updates"].sum()

merged_nat_daily = enrol_nat_daily.merge(demo_nat_daily, on="date", how="outer").merge(
    bio_nat_daily, on="date", how="outer"
).fillna(0).sort_values("date")

fig = px.line(
    merged_nat_daily,
    x="date",
    y=["total_enrol", "total_demo_updates", "total_bio_updates"],
    labels={"value": "Count", "date": "Date", "variable": "Metric"},
    title="National Daily Aadhaar Enrolments vs Demographic and Biometric Updates",
)
fig.show()

In [None]:
# 9.2 Top States by Total Enrolments and Update Intensity

top_states_by_enrol = state_panel.sort_values("total_enrol", ascending=False).head(15)

fig = px.bar(
    top_states_by_enrol,
    x="state",
    y="total_enrol",
    title="Top 15 States by Total Aadhaar Enrolments",
    labels={"total_enrol": "Total Enrolments", "state": "State"},
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

top_states_by_demo_intensity = state_panel.sort_values(
    "demo_updates_per_1000_enrol", ascending=False
).head(15)

fig = px.bar(
    top_states_by_demo_intensity,
    x="state",
    y="demo_updates_per_1000_enrol",
    title="Top 15 States by Demographic Update Intensity (per 1,000 Enrolments)",
    labels={"demo_updates_per_1000_enrol": "Demographic Updates per 1,000 Enrolments"},
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

top_states_by_bio_intensity = state_panel.sort_values(
    "bio_updates_per_1000_enrol", ascending=False
).head(15)

fig = px.bar(
    top_states_by_bio_intensity,
    x="state",
    y="bio_updates_per_1000_enrol",
    title="Top 15 States by Biometric Update Intensity (per 1,000 Enrolments)",
    labels={"bio_updates_per_1000_enrol": "Biometric Updates per 1,000 Enrolments"},
)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [None]:
# 9.3 Age-Profile of Enrolments by State

age_profile = enrol_state_total.melt(
    id_vars=["state"],
    value_vars=["age_0_5", "age_5_17", "age_18_greater"],
    var_name="age_group",
    value_name="enrolments",
)

fig = px.bar(
    age_profile,
    x="state",
        
    y="enrolments",
    color="age_group",
    title="Age-wise Aadhaar Enrolments by State",
)
fig.update_layout(xaxis_tickangle=-60)
fig.show()

# Normalised shares for better cross-state comparison
age_share_long = state_panel.melt(
    id_vars=["state"],
    value_vars=["share_age_0_5", "share_age_5_17", "share_age_18_greater"],
    var_name="age_group_share",
    value_name="share",
)

fig = px.bar(
    age_share_long,
    x="state",
    y="share",
    color="age_group_share",
    title="Age-share of Enrolments by State",
)
fig.update_layout(xaxis_tickangle=-60)
fig.show()

In [None]:
# 9.4 Daily State-Level Time-Series for Selected States (Example)

focus_states = [
    "Uttar Pradesh",
    "Bihar",
    "Karnataka",
    "Maharashtra",
]

enrol_focus = enrol_state_daily[enrol_state_daily["state"].isin(focus_states)]

fig = px.line(
    enrol_focus,
    x="date",
    y="total_enrol",
    color="state",
    title="Daily Aadhaar Enrolments for Selected States",
)
fig.show()

In [None]:
# 9.5 Demographic vs Biometric Update Mix by State

update_mix = state_panel.copy()
update_mix["demo_share_of_updates"] = update_mix["total_demo_updates"] / (
    update_mix["total_demo_updates"] + update_mix["total_bio_updates"] + 1e-9
)

fig = px.scatter(
    update_mix,
    x="demo_updates_per_1000_enrol",
    y="bio_updates_per_1000_enrol",
    text="state",
    color="demo_share_of_updates",
    color_continuous_scale="Viridis",
    labels={
        "demo_updates_per_1000_enrol": "Demographic Updates per 1,000 Enrolments",
        "bio_updates_per_1000_enrol": "Biometric Updates per 1,000 Enrolments",
    },
    title="State-wise Mix and Intensity of Demographic vs Biometric Updates",
)
fig.update_traces(textposition="top center")
fig.show()

### Interpretation Notes for Visualisations (for the Report)

In the PDF report, this section should translate the above plots into **clear textual insights**, for example:
- Which states drive the highest **absolute enrolments**, and which have disproportionately high shares of **child enrolment** (0–5 and 5–17).
- States with **high demographic update intensity** per enrolment could indicate:
  - High population mobility (e.g., migration, urbanisation), or
  - Greater uptake of mobile seeding and KYC processes.
- States with **high biometric update intensity** might indicate:
  - Younger populations (biometrics change more over time), or
  - Operational issues in initial biometric capture quality.
- Comparison of national time series can identify **periods of concentrated campaigns**, policy changes, or seasonal patterns (e.g., post-festival surges).

In [None]:
# 10. Anomaly Detection: Simple Z-score Based Spikes in Daily Activity

def detect_anomalies(series: pd.Series, window: int = 7, z_threshold: float = 2.5) -> pd.Series:
    """Detect anomalies using rolling mean and standard deviation.

    Returns a boolean Series indicating anomalous points.
    """
    rolling_mean = series.rolling(window=window, min_periods=window).mean()
    rolling_std = series.rolling(window=window, min_periods=window).std()
    z_scores = (series - rolling_mean) / (rolling_std + 1e-9)
    anomalies = z_scores.abs() > z_threshold
    return anomalies

# Example: National enrolment anomalies
merged_nat_daily["enrol_anomaly"] = detect_anomalies(
    merged_nat_daily["total_enrol"], window=7, z_threshold=2.5
)

anomalous_days = merged_nat_daily[merged_nat_daily["enrol_anomaly"]]
anomalous_days.head(10)

In [None]:
# 10.1 Visualising Anomalies

fig = px.line(
    merged_nat_daily,
    x="date",
    y="total_enrol",
    title="National Daily Enrolments with Anomaly Markers",
)

fig.add_scatter(
    x=anomalous_days["date"],
    y=anomalous_days["total_enrol"],
    mode="markers",
    marker=dict(color="red", size=10),
    name="Anomalies",
)

fig.show()

In [None]:
# 10.2 State-Level Anomaly Rates (Example)

def state_anomaly_rate(enrol_state_daily_df: pd.DataFrame) -> pd.DataFrame:
    records = []
    for state_name, group in enrol_state_daily_df.groupby("state"):
        group_sorted = group.sort_values("date")
        anomalies = detect_anomalies(group_sorted["total_enrol"], window=7, z_threshold=2.5)
        anomaly_rate = anomalies.mean()
        records.append({"state": state_name, "enrol_anomaly_rate": anomaly_rate})
    return pd.DataFrame(records)

state_anomaly_df = state_anomaly_rate(enrol_state_daily)
state_anomaly_df.sort_values("enrol_anomaly_rate", ascending=False).head(10)

### Interpretation Notes for Anomalies and Risk Indicators

For the written report:
- Highlight **dates with unusually high enrolment spikes** and cross-reference them with known campaigns, policy announcements, or deadlines where possible.
- Identify states with **consistently high anomaly rates**, which may signal:
  - Operational bottlenecks (e.g., bursty enrolment patterns due to limited centre availability), or
  - Potential data-quality or misuse patterns that require deeper audit.
- Propose an operational rule: e.g., "If anomaly rate exceeds X% over the last Y days, trigger a targeted review of that state’s centres or operators."

In [None]:
# 11. Lightweight Predictive Indicators (Forecasting Demand)

from sklearn.linear_model import LinearRegression

def add_time_index(df: pd.DataFrame) -> pd.DataFrame:
    df_sorted = df.sort_values("date").copy()
    df_sorted["t"] = (df_sorted["date"] - df_sorted["date"].min()).dt.days
    return df_sorted

# Example: Simple trend model for national enrolments
enrol_ts = add_time_index(enrol_nat_daily)

X = enrol_ts[["t"]].values
y = enrol_ts["total_enrol"].values

model = LinearRegression()
model.fit(X, y)

# Forecast for the next 14 days (as a simple illustrative predictive indicator)
last_t = enrol_ts["t"].max()
future_t = np.arange(last_t + 1, last_t + 15)
future_dates = enrol_ts["date"].max() + pd.to_timedelta(future_t - last_t, unit="D")
future_pred = model.predict(future_t.reshape(-1, 1))

forecast_df = pd.DataFrame({"date": future_dates, "predicted_enrol": future_pred})
forecast_df.head()

In [None]:
# 11.1 Visualising Trend and Forecast

fig = px.line(
    enrol_ts,
    x="date",
    y="total_enrol",
    title="Observed vs Forecasted National Enrolments (Simple Trend Model)",
)

fig.add_scatter(
    x=forecast_df["date"],
    y=forecast_df["predicted_enrol"],
    mode="lines+markers",
    name="Forecast",
    line=dict(color="red", dash="dash"),
)

fig.show()

### Interpretation Notes for Predictive Indicators

In the report, you can explain how even simple models can support **forward-looking planning**:
- Use short-term forecasts of enrolment and update volumes to **allocate additional kits and staff** to high-demand states during expected peaks.
- Extend the approach to state-wise or district-wise models where sufficient historical data is available.
- Combine predicted demand with observed update-to-enrolment ratios to identify **regions where capacity for updates (especially biometric updates) needs to be strengthened.**

## 12. Synthesis: Key Insights and Solution Frameworks (Template for PDF)

This final section is meant to be **directly used in the PDF report**. After running the notebook and inspecting the actual figures, you can fill in specific numbers and state names. Below is a suggested structure and example language you can adapt based on the outputs.

### 12.1 Key Patterns and Trends

1. **Enrolment Intensity and Age Structure**
- Certain states (e.g., _State A, State B_) account for the largest share of total Aadhaar enrolments in the sample period.
- States such as _State C_ exhibit a **higher share of child enrolments** (0–5 and 5–17), indicating either recent expansion of coverage among children or integration of Aadhaar enrolment with school/ICDS systems.

2. **Update Behaviour and Lifecycle**
- Demographic updates per 1,000 enrolments are highest in _State D_ and _State E_, consistent with higher migration and mobile number churn.
- Biometric update intensity is particularly high in _State F_, potentially reflecting either a young population profile or initial capture quality issues.

3. **Temporal Dynamics and Campaign Effects**
- The national time series reveals **distinct peaks in enrolment and updates** around [dates/events], which likely correspond to targeted campaigns or scheme deadlines.
- Outside these peaks, there is a relatively stable baseline, suggesting predictable workload for centres.

4. **Anomalies and Risk Indicators**
- Z-score based anomaly detection flags **unusually high spikes** in enrolment on [dates] in [states/districts].
- Some states show **consistently elevated anomaly rates**, which may warrant further audit of centre-level behaviour and processes.

### 12.2 Proposed Solution Frameworks

1. **Inclusion and Outreach Targeting Framework**
- Use **age-profile and enrolment intensity** indicators to select states/districts where child or adult coverage appears relatively low.
- Launch targeted campaigns:
  - **School-based drives** in regions with low 5–17 enrolment.
  - **Women-focused drives** leveraging SHGs, Anganwadi centres, and health facilities if gender-disaggregated data is integrated later.

2. **Capacity Planning and Resource Allocation Framework**
- Use **state- and district-level demand forecasts** to adjust:
  - Number of enrolment/update kits and staff per centre.
  - Centre operating hours and appointment scheduling.
- Prioritise extra capacity in locations with both high predicted demand and high update-to-enrolment ratios.

3. **Risk-Based Supervision and Quality Assurance Framework**
- Maintain an **anomaly score** for each state and (if data is available) each enrolment centre/operator.
- For high-risk entities, apply:
  - Additional checks on documentation or biometrics.
  - Targeted refresher training on capture quality.
  - Periodic audits.

4. **Dashboard and Monitoring Framework**
- Build a multi-level dashboard (national/state/district) with core KPIs:
  - Total enrolments and age-wise shares.
  - Demographic and biometric update intensities.
  - Anomaly rates and recent flagged events.
  - Short-term demand forecasts.
- Integrate simple decision rules (e.g., thresholds for alerts) to make the analytics immediately actionable.

---

### 12.3 How to Use this Notebook for the Hackathon Submission

- **Problem Statement and Approach:** Take content from Sections 1 and 3.
- **Datasets Used:** Summarise Section 2 and mention all three datasets and key columns.
- **Methodology:** Use the step-wise explanation from Sections 3, 7, 8, 10, and 11.
- **Data Analysis and Visualisation:** Export crucial plots (Sections 9–11) as images and embed them into the PDF with concise interpretations.
- Attach this notebook (or an HTML/PDF export of it) as the **code/technical appendix** to demonstrate reproducibility of your analysis.