# Aadhar Enrollment Analysis 

### Importing Libraries 

In [1]:
import pandas as pd 
import warnings


## Loading Dataset

In [2]:
file_1 = pd.read_csv('api_data_aadhar_enrolment_0_500000.csv')
file_2 = pd.read_csv('api_data_aadhar_enrolment_500000_1000000.csv')
file_3 = pd.read_csv('api_data_aadhar_enrolment_1000000_1006029.csv')

In [3]:

print(file_1.columns.equals(file_2.columns))
print(file_2.columns.equals(file_3.columns))


True
True


In [4]:
# identifying mis-matches
set(file_1.columns) ^ set(file_2.columns)


set()

In [14]:
# concatenate the csv's
aadhar_raw = pd.concat(
    [file_1, file_2, file_3],
    axis=0,
    ignore_index=True
)


In [15]:
aadhar_raw.shape

(1006029, 7)

### Note:
The Aadhaar enrolment dataset was provided in multiple file partitions due to size constraints. All files were loaded using a consistent schema and concatenated vertically to reconstruct the complete dataset for analysis.

## Data Validation & Cleaning

In [16]:
aadhar_raw.duplicated().sum()

np.int64(22957)

In [17]:
total_rows = len(aadhar_raw)
dup_rows = aadhar_raw.duplicated().sum()

dup_rows, total_rows, dup_rows / total_rows


(np.int64(22957), 1006029, np.float64(0.022819421706531322))

In [18]:
dups = aadhar_raw[aadhar_raw.duplicated(keep=False)]
dups.head(10)


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
359389,13-10-2025,Punjab,Jalandhar,144041,2,1,0
359390,13-10-2025,Punjab,Jalandhar,144101,1,0,0
359391,13-10-2025,Punjab,Jalandhar,144102,2,0,0
359392,13-10-2025,Punjab,Jalandhar,144418,1,0,0
359393,13-10-2025,Punjab,Jalandhar,144419,1,0,0
359394,13-10-2025,Punjab,Jalandhar,144702,1,1,0
359395,13-10-2025,Punjab,Jalandhar,144801,0,1,0
359396,13-10-2025,Punjab,Kapurthala,144401,5,1,1
359397,13-10-2025,Punjab,Kapurthala,144601,4,2,2
359398,13-10-2025,Punjab,Kapurthala,144804,2,0,0


In [19]:
dups.groupby(dups.columns.tolist()).size().value_counts().head()


2    22957
Name: count, dtype: int64

In [20]:
# dropping duplicates
aadhar_clean = aadhar_raw.drop_duplicates()
aadhar_clean.shape

(983072, 7)

In [21]:
aadhar_clean.duplicated().sum()

np.int64(0)

### Note:
Post loading, the dataset was validated for schema consistency, data types, and exact duplicate records. Identified duplicates arising from file-level overlaps were removed to prevent artificial inflation of enrolment counts, ensuring analytical accuracy