# Notebook 01: Data Understanding  
## UIDAI Aadhaar Enrolment and Update Datasets

### Objective
This notebook documents the initial structural understanding of the Aadhaar
enrolment, demographic update, and biometric update datasets provided by UIDAI.

The goal is to:
- Understand dataset structure and columns
- Identify time and geographic dimensions
- Assess suitability for aggregated societal signal analysis
- Clearly state data limitations

No analytical conclusions are drawn in this notebook.

In [6]:
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 50)


In [7]:
enrolment_df = pd.read_csv("/Users/dhruvgourisaria/uidai-aadhaar-enrolment-update-analysis/data/raw/aadhaar_enrolment_master.csv")
demographic_df = pd.read_csv("/Users/dhruvgourisaria/uidai-aadhaar-enrolment-update-analysis/data/raw/aadhaar_demographic_master.csv")
biometric_df = pd.read_csv("/Users/dhruvgourisaria/uidai-aadhaar-enrolment-update-analysis/data/raw/aadhaar_biometric_master.csv")

enrolment_df.shape, demographic_df.shape, biometric_df.shape

((1006029, 7), (2071700, 6), (1861108, 6))

At a high level, the datasets represent aggregated counts of Aadhaar-related
activities across geography and time.

- Enrolment data captures new Aadhaar registrations
- Demographic update data captures updates to resident information
- Biometric update data captures biometric re-capture activity

All datasets are aggregated and do not contain individual-level identifiers.

In [8]:
enrolment_df.columns

Index(['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17',
       'age_18_greater'],
      dtype='object')

### Aadhaar Enrolment Dataset

This dataset records counts of new Aadhaar enrolments.

Typical fields include:
- State and district identifiers
- Time fields (month/year or date)
- Enrolment count

Policy relevance:
- Indicates coverage expansion
- Helps identify regions with continuing enrolment demand
- Useful as contextual information for inclusion analysis

In [9]:
demographic_df.columns

Index(['date', 'state', 'district', 'pincode', 'demo_age_5_17',
       'demo_age_17_'],
      dtype='object')

### Aadhaar Demographic Update Dataset

This dataset captures aggregated counts of demographic update requests.

Update categories may include:
- Name updates
- Address updates
- Mobile number updates

Policy relevance:
- Different update types may reflect different underlying processes
- Some updates are socially expected life events
- Others may reflect mobility or access-related changes

At this stage, no assumptions are made about causality.

In [10]:
biometric_df.columns

Index(['date', 'state', 'district', 'pincode', 'bio_age_5_17', 'bio_age_17_'], dtype='object')

### Aadhaar Biometric Update Dataset

This dataset records aggregated biometric update activity, such as
fingerprint or iris re-capture.

Policy relevance:
- Indicates system maintenance demand
- Reflects service load at enrolment/update centres
- Useful for operational planning at aggregate level

### Time Dimensions

Across all datasets, time is represented using month, year,
or date-based fields.

These fields allow:
- Temporal trend analysis
- Seasonal pattern identification
- Comparison of activity across periods

All future analysis will use aggregated time units only.

### Geographic Dimensions

Geographic granularity is primarily at:
- State level
- District level

District-level aggregation enables:
- Sub-state policy analysis
- Regional heterogeneity assessment
- Localised service planning

### Data Limitations

Key limitations of the datasets include:
- Data is aggregated and not individual-level
- No demographic attributes such as age or gender are available
- Update counts do not indicate reasons for updates
- The data reflects administrative transactions, not direct social outcomes

As a result, all analysis must remain cautious and non-inferential at the individual level.

This notebook establishes a structural understanding of the UIDAI datasets.

The next step focuses on:
- Cleaning and standardising data
- Aggregating to district Ã— month level
- Preparing datasets for analytical use

These steps are covered in Notebook 02.