# Exploratory Analysis of Aadhaar Enrolment Data

## Objective
This notebook performs a comprehensive exploratory analysis of Aadhaar enrolment data
released by UIDAI. The goal is to identify national-level and state-level enrolment trends
that can support evidence-based administrative and policy decision-making.

---

## Scope of Analysis
The analysis covers the following components:

- Data loading directly from UIDAI-provided ZIP files
- Data cleaning and standardization
- Feature engineering for temporal and enrolment metrics
- Univariate analysis of national enrolment growth over time
- Bivariate analysis comparing enrolment volumes across states
- Visual representation of trends and disparities

---

## Methodology Overview

1. **Environment Setup**
   - Python path is configured to allow importing modular project code (`src/` directory)

2. **Data Ingestion**
   - Aadhaar enrolment data is loaded directly from compressed ZIP files

3. **Data Cleaning**
   - Column names are standardized
   - Duplicate records are removed

4. **Feature Engineering**
   - Year and month are extracted from date fields
   - Total enrolments are computed from age-group columns

5. **Analytical Techniques**
   - Year-wise aggregation to observe national trends
   - State-wise aggregation to identify regional disparities

6. **Visualisation**
   - Line chart for national enrolment growth
   - Bar chart for top 10 states by enrolment volume

---

## Expected Outputs

Running the analysis cell below will generate:

- A preview of the cleaned enrolment dataset
- A table showing year-wise total enrolments
- A line chart depicting national enrolment growth
- A table of the top 10 states by enrolment
- A bar chart comparing enrolment volumes across these states

These outputs are directly referenced in the final project report (PDF).

---

## Reproducibility Note
All computations in this notebook rely on reusable, modular Python code stored in the
`src/` directory. This ensures full reproducibility and auditability of results.


In [None]:
# ============================================================
# FULL EXPLORATORY ANALYSIS – FINAL BULLETPROOF SINGLE CELL
# ============================================================

# ---------- 1. FIX PROJECT PATH ----------
import sys
import os
sys.path.append(os.path.abspath(".."))

# ---------- 2. CORE IMPORTS ----------
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ensure plots render reliably
plt.rcParams["figure.autolayout"] = True
sns.set(style="whitegrid")

# ---------- 3. SAFE DISPLAY IMPORT ----------
try:
    from IPython.display import display
except ImportError:
    def display(x): 
        print(x)

# ---------- 4. IMPORT PROJECT MODULES ----------
from src.data_loader import load_uidai_zip
from src.data_cleaning import clean_dataframe
from src.feature_engineering import add_time_features, add_total_enrolments
from src.analysis import yearly_aggregation, statewise_aggregation

# ---------- 5. LOAD DATA ----------
print("Loading Aadhaar enrolment data...")

enrol_df = load_uidai_zip("../data/raw/api_data_aadhar_enrolment.zip")
enrol_df = clean_dataframe(enrol_df)

# ---------- 6. HANDLE DATE COLUMN SAFELY ----------
possible_date_cols = ["date", "enrolment_date", "created_date"]
date_col = next((c for c in possible_date_cols if c in enrol_df.columns), None)

if date_col:
    enrol_df = add_time_features(enrol_df, date_col)
else:
    print("⚠️ No date column found — yearly trend may be skipped")

# ---------- 7. HANDLE TOTAL ENROLMENTS SAFELY ----------
enrol_df = add_total_enrolments(enrol_df)

if "total_enrolments" not in enrol_df.columns:
    numeric_cols = enrol_df.select_dtypes("number").columns
    if len(numeric_cols) > 0:
        enrol_df["total_enrolments"] = enrol_df[numeric_cols].sum(axis=1)
        print("⚠️ total_enrolments auto-derived from numeric columns")
    else:
        raise RuntimeError("No numeric data available for analysis")

print("Data loaded successfully")
display(enrol_df.head())

# ---------- 8. YEARLY ANALYSIS (SAFE) ----------
if "year" in enrol_df.columns:
    yearly = yearly_aggregation(enrol_df, "total_enrolments")
else:
    yearly = None

if yearly is not None and not yearly.empty:
    print("\nYear-wise enrolment totals:")
    display(yearly)

    plt.figure(figsize=(10,5))
    plt.plot(yearly["year"], yearly["total_enrolments"], marker="o")
    plt.title("Year-wise Aadhaar Enrolment Growth")
    plt.xlabel("Year")
    plt.ylabel("Total Enrolments")
    plt.show()
else:
    print("⚠️ Year-wise analysis skipped (year column missing)")

# ---------- 9. STATE-WISE ANALYSIS (SAFE) ----------
possible_state_cols = ["state", "state_name", "resident_state"]
state_col = next((c for c in possible_state_cols if c in enrol_df.columns), None)

if state_col:
    enrol_df["state"] = enrol_df[state_col]
    statewise = statewise_aggregation(enrol_df, "total_enrolments")
else:
    statewise = None

if statewise is not None and not statewise.empty:
    statewise_top10 = (
        statewise
        .sort_values("total_enrolments", ascending=False)
        .head(10)
    )

    print("\nTop 10 states by enrolment:")
    display(statewise_top10)

    plt.figure(figsize=(10,5))
    sns.barplot(
        data=statewise_top10,
        x="state",
        y="total_enrolments"
    )
    plt.title("Top 10 States by Aadhaar Enrolment")
    plt.xticks(rotation=45)
    plt.show()
else:
    print("⚠️ State-wise analysis skipped (state column missing)")

print("\n✅ Exploratory analysis completed successfully.")
