# Enrollment EDA (Universal 10-Step Framework)
This notebook is reshuffled and rewritten to follow the phase-wise EDA framework from Step 1 to Step 10.

How to use:
1. Run cells top-to-bottom.
2. Do not skip Step 1-4.
3. Use advanced cells after core checks pass.


## STEP 1 ? Understand Business Context
**Business meaning**
- One row represents enrollment counts for one location and date.
- Business event: Aadhaar enrollment activity captured by age bucket.
- Measured values: `age_0_5`, `age_5_17`, `age_18_greater`.
- Decisions supported: trend tracking, quality governance, regional planning.

**Beginner goal**
- Understand what this table means before coding.

**Advanced goal**
- Translate business meaning into target grain and quality contracts.


In [173]:
import pandas as pd

enrolment1 = pd.read_csv(r'C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\data\api_data_aadhar_enrolment\api_data_aadhar_enrolment\api_data_aadhar_enrolment_0_500000.csv')
enrolment2 = pd.read_csv(r'C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\data\api_data_aadhar_enrolment\api_data_aadhar_enrolment\api_data_aadhar_enrolment_500000_1000000.csv')
enrolment3= pd.read_csv(r'C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\data\api_data_aadhar_enrolment\api_data_aadhar_enrolment\api_data_aadhar_enrolment_1000000_1006029.csv')



print(enrolment1.head())
print(enrolment2.head())
print(enrolment3.head())   


         date          state          district  pincode  age_0_5  age_5_17  \
0  02-03-2025      Meghalaya  East Khasi Hills   793121       11        61   
1  09-03-2025      Karnataka   Bengaluru Urban   560043       14        33   
2  09-03-2025  Uttar Pradesh      Kanpur Nagar   208001       29        82   
3  09-03-2025  Uttar Pradesh           Aligarh   202133       62        29   
4  09-03-2025      Karnataka   Bengaluru Urban   560016       14        16   

   age_18_greater  
0              37  
1              39  
2              12  
3              15  
4              21  
         date           state  district  pincode  age_0_5  age_5_17  \
0  26-10-2025  Andhra Pradesh  Nalgonda   508004        0         1   
1  26-10-2025  Andhra Pradesh  Nalgonda   508238        1         0   
2  26-10-2025  Andhra Pradesh  Nalgonda   508278        1         0   
3  26-10-2025  Andhra Pradesh   Nandyal   518432        0         1   
4  26-10-2025  Andhra Pradesh   Nandyal   518543        

In [174]:
# Aim: Set project paths and load Enrollment base table.
# Expected Output: data_aadhar_enrollment_full loaded with valid shape and columns.
# What You Get: A stable base dataframe for all EDA steps.
# Data Engineer Learning: Always isolate path logic to make notebooks portable.

from pathlib import Path
import pandas as pd

enrollment_path = Path(r"C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\scripts\EDA\panda_eda\data\data_aadhar_enrollment_full.csv")

if not enrollment_path.exists():
    raise FileNotFoundError(f"Missing enrollment file: {enrollment_path}")

data_aadhar_enrollment_full = pd.read_csv(enrollment_path)

print("Enrollment path:", enrollment_path)
print("Rows, Cols:", data_aadhar_enrollment_full.shape)



Enrollment path: C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\scripts\EDA\panda_eda\data\data_aadhar_enrollment_full.csv
Rows, Cols: (1006029, 7)


In [175]:
# Aim: Record Step-1 business context in a machine-readable table.
# Expected Output: One small table summarizing business definition.
# What You Get: Documentation artifact you can show in interviews/reviews.
# Data Engineer Learning: Good EDA includes explicit semantic documentation.

business_context = pd.DataFrame([
    {'field': 'row_definition', 'value': 'Enrollment counts for one date-state-district-pincode record'},
    {'field': 'business_event', 'value': 'Aadhaar enrollment activity'},
    {'field': 'measures', 'value': 'age_0_5, age_5_17, age_18_greater'},
    {'field': 'decision_support', 'value': 'Trend monitoring, data quality controls, geographic planning'},
])

display(business_context)


Unnamed: 0,field,value
0,row_definition,Enrollment counts for one date-state-district-...
1,business_event,Aadhaar enrollment activity
2,measures,"age_0_5, age_5_17, age_18_greater"
3,decision_support,"Trend monitoring, data quality controls, geogr..."


## STEP 2 ? Structural Profiling
Check structure before transformations:
- row count, column count
- data types
- sample records
- basic stats


### Integrated 63-Step Cells For This Section
Included steps: 1, 2, 3, 4, 5, 6, 7


#### Integrated Step 3 (from eroll.ipynb)


#### Integrated Step 4 (from eroll.ipynb)


In [176]:
# Aim: Preview first rows
# Expected Output: Valid output for Step 4 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 4 improves trust in downstream analytics.

display(data_aadhar_enrollment_full.head(10))


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,02-03-2025,Meghalaya,East Khasi Hills,793121,11,61,37
1,09-03-2025,Karnataka,Bengaluru Urban,560043,14,33,39
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,09-03-2025,Uttar Pradesh,Aligarh,202133,62,29,15
4,09-03-2025,Karnataka,Bengaluru Urban,560016,14,16,21
5,09-03-2025,Bihar,Sitamarhi,843331,20,49,12
6,09-03-2025,Bihar,Sitamarhi,843330,23,24,42
7,09-03-2025,Uttar Pradesh,Bahraich,271865,26,60,14
8,09-03-2025,Uttar Pradesh,Firozabad,283204,28,26,10
9,09-03-2025,Bihar,Purbi Champaran,845418,30,48,10


#### Integrated Step 5 (from eroll.ipynb)


In [177]:
# Aim: Inspect numeric summary
# Expected Output: Valid output for Step 5 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 5 improves trust in downstream analytics.

display(data_aadhar_enrollment_full.describe(include='all'))


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
count,1006029,1006029,1006029,1006029.0,1006029.0,1006029.0,1006029.0
unique,92,55,985,,,,
top,15-12-2025,Uttar Pradesh,Pune,,,,
freq,19426,110369,6663,,,,
mean,,,,518641.5,3.525709,1.710074,0.1673441
std,,,,205636.0,17.53851,14.36963,3.220525
min,,,,100000.0,0.0,0.0,0.0
25%,,,,363641.0,1.0,0.0,0.0
50%,,,,517417.0,2.0,0.0,0.0
75%,,,,700104.0,3.0,1.0,0.0


#### Integrated Step 6 (from eroll.ipynb)


In [178]:
# Aim: Inspect table size
# Expected Output: Valid output for Step 6 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 6 improves trust in downstream analytics.

print('shape', data_aadhar_enrollment_full.shape)


shape (1006029, 7)


#### Integrated Step 7 (from eroll.ipynb)


In [179]:
# Aim: Count unique raw states
# Expected Output: Valid output for Step 7 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 7 improves trust in downstream analytics.

print('unique_states', data_aadhar_enrollment_full['state'].nunique()); display(pd.Series(sorted(data_aadhar_enrollment_full['state'].dropna().astype(str).unique())).head(100))


unique_states 55


0                                           100000
1                        Andaman & Nicobar Islands
2                      Andaman and Nicobar Islands
3                                   Andhra Pradesh
4                                Arunachal Pradesh
5                                            Assam
6                                            Bihar
7                                       Chandigarh
8                                     Chhattisgarh
9                             Dadra & Nagar Haveli
10                          Dadra and Nagar Haveli
11        Dadra and Nagar Haveli and Daman and Diu
12                                     Daman & Diu
13                                   Daman and Diu
14                                           Delhi
15                                             Goa
16                                         Gujarat
17                                         Haryana
18                                Himachal Pradesh
19                             

In [180]:
# Aim: Inspect table structure and schema health.
# Expected Output: shape, columns list, dtypes summary, sample rows.
# What You Get: Structural baseline to detect schema drift.
# Data Engineer Learning: Structure-first profiling prevents downstream surprises.

print('Shape:', data_aadhar_enrollment_full.shape)
print('\nColumns:')
print(list(data_aadhar_enrollment_full.columns))
print('\nInfo:')
print(data_aadhar_enrollment_full.info())
print('\nSample rows:')
display(data_aadhar_enrollment_full.head(10))


Shape: (1006029, 7)

Columns:
['date', 'state', 'district', 'pincode', 'age_0_5', 'age_5_17', 'age_18_greater']

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1006029 entries, 0 to 1006028
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   date            1006029 non-null  object
 1   state           1006029 non-null  object
 2   district        1006029 non-null  object
 3   pincode         1006029 non-null  int64 
 4   age_0_5         1006029 non-null  int64 
 5   age_5_17        1006029 non-null  int64 
 6   age_18_greater  1006029 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 53.7+ MB
None

Sample rows:


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
0,02-03-2025,Meghalaya,East Khasi Hills,793121,11,61,37
1,09-03-2025,Karnataka,Bengaluru Urban,560043,14,33,39
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,29,82,12
3,09-03-2025,Uttar Pradesh,Aligarh,202133,62,29,15
4,09-03-2025,Karnataka,Bengaluru Urban,560016,14,16,21
5,09-03-2025,Bihar,Sitamarhi,843331,20,49,12
6,09-03-2025,Bihar,Sitamarhi,843330,23,24,42
7,09-03-2025,Uttar Pradesh,Bahraich,271865,26,60,14
8,09-03-2025,Uttar Pradesh,Firozabad,283204,28,26,10
9,09-03-2025,Bihar,Purbi Champaran,845418,30,48,10


In [181]:
# Aim: Generate numeric profile for quick sanity checks.
# Expected Output: describe() summary for numeric columns.
# What You Get: Range, spread, and count overview.
# Data Engineer Learning: Numeric profiling quickly exposes impossible values.

display(data_aadhar_enrollment_full.describe(include='all'))


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
count,1006029,1006029,1006029,1006029.0,1006029.0,1006029.0,1006029.0
unique,92,55,985,,,,
top,15-12-2025,Uttar Pradesh,Pune,,,,
freq,19426,110369,6663,,,,
mean,,,,518641.5,3.525709,1.710074,0.1673441
std,,,,205636.0,17.53851,14.36963,3.220525
min,,,,100000.0,0.0,0.0,0.0
25%,,,,363641.0,1.0,0.0,0.0
50%,,,,517417.0,2.0,0.0,0.0
75%,,,,700104.0,3.0,1.0,0.0


## STEP 3 ? Grain Identification (Most Important)
Natural key hypothesis:
- `(date, state, district, pincode)`

If this is not clear, modeling fails.


### Integrated 63-Step Cells For This Section
Included steps: 8, 9, 10


#### Integrated Step 8 (from eroll.ipynb)


In [182]:
# Aim: Count key-based duplicates
# Expected Output: Valid output for Step 8 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 8 improves trust in downstream analytics.

natural_key=['date','state','district','pincode']; print('dup_key', int(data_aadhar_enrollment_full.duplicated(subset=natural_key).sum()))


dup_key 22957


#### Integrated Step 9 (from eroll.ipynb)


In [183]:
# Aim: View duplicate groups in detail
# Expected Output: Valid output for Step 9 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 9 improves trust in downstream analytics.

dup_examples = data_aadhar_enrollment_full[data_aadhar_enrollment_full.duplicated(subset=natural_key, keep=False)].sort_values(natural_key); print('dup_group_rows', len(dup_examples)); display(dup_examples.head(30))


dup_group_rows 45914


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
588088,02-11-2025,Assam,Dibrugarh,786007,2,0,0
590380,02-11-2025,Assam,Dibrugarh,786007,2,0,0
588089,02-11-2025,Assam,Dibrugarh,786008,3,0,0
590381,02-11-2025,Assam,Dibrugarh,786008,3,0,0
588090,02-11-2025,Assam,Dibrugarh,786012,1,0,0
590382,02-11-2025,Assam,Dibrugarh,786012,1,0,0
588091,02-11-2025,Assam,Dibrugarh,786184,5,0,0
590383,02-11-2025,Assam,Dibrugarh,786184,5,0,0
588092,02-11-2025,Assam,Dibrugarh,786610,3,2,0
590384,02-11-2025,Assam,Dibrugarh,786610,3,2,0


#### Integrated Step 10 (from eroll.ipynb)


In [184]:
# Aim: Count exact row duplicates
# Expected Output: Valid output for Step 10 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 10 improves trust in downstream analytics.

dup_full = int(data_aadhar_enrollment_full.duplicated().sum()); print('dup_full', dup_full)


dup_full 22957


In [185]:
# Aim: Validate natural grain and key uniqueness.
# Expected Output: duplicate count at candidate natural key.
# What You Get: Evidence that (date,state,district,pincode) is the row grain.
# Data Engineer Learning: Grain must be validated before any aggregation/modeling.

natural_key = ['date', 'state', 'district', 'pincode']

dup_grain_count = int(data_aadhar_enrollment_full.duplicated(subset=natural_key).sum())
print('Natural Key:', natural_key)
print('Duplicate rows at natural key:', dup_grain_count)


Natural Key: ['date', 'state', 'district', 'pincode']
Duplicate rows at natural key: 22957


In [186]:
# Aim: Show duplicate key examples for root-cause review.
# Expected Output: duplicate sample rows sorted by natural key.
# What You Get: Concrete records to investigate ingestion or source duplication.
# Data Engineer Learning: Always inspect duplicate examples, not only counts.

dup_examples = data_aadhar_enrollment_full[
    data_aadhar_enrollment_full.duplicated(subset=natural_key, keep=False)
].sort_values(natural_key)

print('Duplicate sample size:', len(dup_examples))
display(dup_examples.head(30))


Duplicate sample size: 45914


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater
588088,02-11-2025,Assam,Dibrugarh,786007,2,0,0
590380,02-11-2025,Assam,Dibrugarh,786007,2,0,0
588089,02-11-2025,Assam,Dibrugarh,786008,3,0,0
590381,02-11-2025,Assam,Dibrugarh,786008,3,0,0
588090,02-11-2025,Assam,Dibrugarh,786012,1,0,0
590382,02-11-2025,Assam,Dibrugarh,786012,1,0,0
588091,02-11-2025,Assam,Dibrugarh,786184,5,0,0
590383,02-11-2025,Assam,Dibrugarh,786184,5,0,0
588092,02-11-2025,Assam,Dibrugarh,786610,3,2,0
590384,02-11-2025,Assam,Dibrugarh,786610,3,2,0


## STEP 4 ? Data Quality Assessment
### 4.1 Null Analysis
### 4.2 Duplicate Analysis
### 4.3 Range Validation
### 4.4 Format Issues
### 4.5 Outlier Detection


### Integrated 63-Step Cells For This Section
Included steps: 11, 12, 13, 14, 15, 16, 17, 41, 42, 43, 44, 45, 46, 47, 53, 54, 55, 56, 57, 58, 59


#### Integrated Step 11 (from eroll.ipynb)


In [187]:
# Aim: Drop key-based duplicates
# Expected Output: Valid output for Step 11 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 11 improves trust in downstream analytics.

data_aadhar_enrollment_dedup = data_aadhar_enrollment_full.drop_duplicates(subset=natural_key, keep='first').reset_index(drop=True); print('rows_after_dedup', len(data_aadhar_enrollment_dedup))


rows_after_dedup 983072


#### Integrated Step 12 (from eroll.ipynb)


In [188]:
# Aim: Re-check duplicates after drop
# Expected Output: Valid output for Step 12 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 12 improves trust in downstream analytics.

print('dup_key_after', int(data_aadhar_enrollment_dedup.duplicated(subset=natural_key).sum()))


dup_key_after 0


#### Integrated Step 13 (from eroll.ipynb)


In [189]:
# Aim: Document before/after duplicate impact
# Expected Output: Valid output for Step 13 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 13 improves trust in downstream analytics.

dup_impact = pd.DataFrame([{'metric':'rows_before','value':len(data_aadhar_enrollment_full)},{'metric':'rows_after','value':len(data_aadhar_enrollment_dedup)},{'metric':'full_dup_before','value':dup_full},{'metric':'key_dup_after','value':int(data_aadhar_enrollment_dedup.duplicated(subset=natural_key).sum())}]); display(dup_impact)


Unnamed: 0,metric,value
0,rows_before,1006029
1,rows_after,983072
2,full_dup_before,22957
3,key_dup_after,0


#### Integrated Step 14 (from eroll.ipynb)


In [190]:
# Aim: Check date range
# Expected Output: Valid output for Step 14 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 14 improves trust in downstream analytics.

tmp_date = pd.to_datetime(data_aadhar_enrollment_dedup['date'], dayfirst=True, errors='coerce'); print(tmp_date.min(), tmp_date.max())


2025-03-02 00:00:00 2025-12-31 00:00:00


#### Integrated Step 15 (from eroll.ipynb)


In [191]:
# Aim: Check null dates
# Expected Output: Valid output for Step 15 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 15 improves trust in downstream analytics.

print('null_dates', int(pd.to_datetime(data_aadhar_enrollment_dedup['date'], dayfirst=True, errors='coerce').isna().sum()))


null_dates 0


#### Integrated Step 16 (from eroll.ipynb)


In [192]:
# Aim: Re-check unique raw states after dedupe
# Expected Output: Valid output for Step 16 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 16 improves trust in downstream analytics.

print('unique_states_after_dedup', data_aadhar_enrollment_dedup['state'].nunique())


unique_states_after_dedup 55


#### Integrated Step 17 (from eroll.ipynb)


In [193]:
# Aim: List raw state labels
# Expected Output: Valid output for Step 17 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 17 improves trust in downstream analytics.

display(pd.Series(sorted(data_aadhar_enrollment_dedup['state'].dropna().astype(str).unique())))


0                                           100000
1                        Andaman & Nicobar Islands
2                      Andaman and Nicobar Islands
3                                   Andhra Pradesh
4                                Arunachal Pradesh
5                                            Assam
6                                            Bihar
7                                       Chandigarh
8                                     Chhattisgarh
9                             Dadra & Nagar Haveli
10                          Dadra and Nagar Haveli
11        Dadra and Nagar Haveli and Daman and Diu
12                                     Daman & Diu
13                                   Daman and Diu
14                                           Delhi
15                                             Goa
16                                         Gujarat
17                                         Haryana
18                                Himachal Pradesh
19                             

#### Integrated Step 41 (from eroll.ipynb)


In [194]:
# Aim: Create eda_enroll working copy
# Expected Output: Valid output for Step 41 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 41 improves trust in downstream analytics.

eda_enroll = data_aadhar_enrollment_dedup.copy(); print('eda_enroll_rows', len(eda_enroll))


eda_enroll_rows 983072


#### Integrated Step 42 (from eroll.ipynb)


In [195]:
# Aim: Convert date to datetime in eda_enroll
# Expected Output: Valid output for Step 42 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 42 improves trust in downstream analytics.

eda_enroll['date'] = pd.to_datetime(eda_enroll['date'], dayfirst=True, errors='coerce'); print('null_date_after_parse', int(eda_enroll['date'].isna().sum()))


null_date_after_parse 0


#### Integrated Step 43 (from eroll.ipynb)


In [196]:
# Aim: Create total_enrollment
# Expected Output: Valid output for Step 43 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 43 improves trust in downstream analytics.

eda_enroll['total_enrollment']=eda_enroll['age_0_5'].fillna(0)+eda_enroll['age_5_17'].fillna(0)+eda_enroll['age_18_greater'].fillna(0); print('total_enrollment_created')


total_enrollment_created


#### Integrated Step 44 (from eroll.ipynb)


#### Integrated Step 45 (from eroll.ipynb)


#### Integrated Step 46 (from eroll.ipynb)


In [197]:
# Aim: Missing-value summary by column
# Expected Output: Valid output for Step 46 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 46 improves trust in downstream analytics.

missing_summary = eda_enroll.isna().sum().reset_index(name='missing_count').rename(columns={'index':'column'}).sort_values('missing_count', ascending=False); display(missing_summary)


Unnamed: 0,column,missing_count
0,date,0
1,state,0
2,district,0
3,pincode,0
4,age_0_5,0
5,age_5_17,0
6,age_18_greater,0
7,total_enrollment,0


#### Integrated Step 47 (from eroll.ipynb)


In [198]:
# Aim: Negative-value checks in numeric columns
# Expected Output: Valid output for Step 47 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 47 improves trust in downstream analytics.

print('negative_rows', int((eda_enroll[['age_0_5','age_5_17','age_18_greater','total_enrollment']] < 0).any(axis=1).sum()))


negative_rows 0


#### Integrated Step 53 (from eroll.ipynb)


In [199]:
# Aim: Null counts on main dataframe
# Expected Output: Valid output for Step 53 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 53 improves trust in downstream analytics.

display(data_aadhar_enrollment_dedup.isnull().sum())


date              0
state             0
district          0
pincode           0
age_0_5           0
age_5_17          0
age_18_greater    0
dtype: int64

#### Integrated Step 54 (from eroll.ipynb)


In [200]:
# Aim: Null counts on eda_enroll
# Expected Output: Valid output for Step 54 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 54 improves trust in downstream analytics.

display(eda_enroll.isnull().sum())


date                0
state               0
district            0
pincode             0
age_0_5             0
age_5_17            0
age_18_greater      0
total_enrollment    0
dtype: int64

#### Integrated Step 55 (from eroll.ipynb)


In [201]:
# Aim: Missing percentage by column
# Expected Output: Valid output for Step 55 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 55 improves trust in downstream analytics.

display((eda_enroll.isnull().sum()/len(eda_enroll)*100).sort_values(ascending=False))


date                0.0
state               0.0
district            0.0
pincode             0.0
age_0_5             0.0
age_5_17            0.0
age_18_greater      0.0
total_enrollment    0.0
dtype: float64

#### Integrated Step 56 (from eroll.ipynb)


In [202]:
# Aim: Empty-string diagnostics
# Expected Output: Valid output for Step 56 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 56 improves trust in downstream analytics.

empty_counts = eda_enroll.apply(lambda x: (x.astype(str).str.strip()=='').sum()); display(empty_counts.sort_values(ascending=False))


date                0
state               0
district            0
pincode             0
age_0_5             0
age_5_17            0
age_18_greater      0
total_enrollment    0
dtype: int64

#### Integrated Step 57 (from eroll.ipynb)


In [203]:
# Aim: Combined null+empty diagnostics
# Expected Output: Valid output for Step 57 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 57 improves trust in downstream analytics.

combined_missing = eda_enroll.apply(lambda x: x.isnull().sum() + (x.astype(str).str.strip()=='').sum()); display(combined_missing.sort_values(ascending=False))


date                0
state               0
district            0
pincode             0
age_0_5             0
age_5_17            0
age_18_greater      0
total_enrollment    0
dtype: int64

#### Integrated Step 58 (from eroll.ipynb)


In [204]:
# Aim: Fully empty row detection
# Expected Output: Valid output for Step 58 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 58 improves trust in downstream analytics.

print('fully_empty_rows', int(eda_enroll.isnull().all(axis=1).sum()))


fully_empty_rows 0


#### Integrated Step 59 (from eroll.ipynb)


In [205]:
# Aim: Numeric profile table
# Expected Output: Valid output for Step 59 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 59 improves trust in downstream analytics.

profile = pd.DataFrame({'column':eda_enroll.columns,'dtype':[str(eda_enroll[c].dtype) for c in eda_enroll.columns],'null_count':[int(eda_enroll[c].isna().sum()) for c in eda_enroll.columns],'null_pct':[float(eda_enroll[c].isna().mean()*100) for c in eda_enroll.columns],'unique_count':[int(eda_enroll[c].nunique(dropna=True)) for c in eda_enroll.columns]}); display(profile.sort_values(['null_pct','unique_count'], ascending=[False,False]))


Unnamed: 0,column,dtype,null_count,null_pct,unique_count
3,pincode,int64,0,0.0,19463
7,total_enrollment,int64,0,0.0,1028
2,district,object,0,0.0,985
4,age_0_5,int64,0,0.0,671
5,age_5_17,int64,0,0.0,624
6,age_18_greater,int64,0,0.0,199
0,date,datetime64[ns],0,0.0,92
1,state,object,0,0.0,55


In [206]:
# Aim: Compute null profile (count + percent) for all columns.
# Expected Output: Null summary table.
# What You Get: Missingness ranking for quality prioritization.
# Data Engineer Learning: Missingness must be measured both absolute and relative.

null_count = data_aadhar_enrollment_full.isnull().sum()
null_pct = (null_count / len(data_aadhar_enrollment_full)) * 100
null_profile = pd.DataFrame({'null_count': null_count, 'null_pct': null_pct}).sort_values('null_count', ascending=False)

display(null_profile)


Unnamed: 0,null_count,null_pct
date,0,0.0
state,0,0.0
district,0,0.0
pincode,0,0.0
age_0_5,0,0.0
age_5_17,0,0.0
age_18_greater,0,0.0


In [207]:
# Aim: Evaluate duplicates (full-row and grain-level) and prepare deduplicated table.
# Expected Output: Before/after duplicate counts and new dataframe.
# What You Get: data_aadhar_enrollment_dedup for cleaner analytics.
# Data Engineer Learning: Keep raw and deduplicated versions separated for traceability.

dup_full = int(data_aadhar_enrollment_full.duplicated().sum())
dup_grain = int(data_aadhar_enrollment_full.duplicated(subset=natural_key).sum())

data_aadhar_enrollment_dedup = data_aadhar_enrollment_full.drop_duplicates(subset=natural_key, keep='first').reset_index(drop=True)

post_dup_grain = int(data_aadhar_enrollment_dedup.duplicated(subset=natural_key).sum())

dup_summary = pd.DataFrame([
    {'metric': 'full_row_duplicates_before', 'value': dup_full},
    {'metric': 'grain_duplicates_before', 'value': dup_grain},
    {'metric': 'grain_duplicates_after', 'value': post_dup_grain},
    {'metric': 'rows_before', 'value': len(data_aadhar_enrollment_full)},
    {'metric': 'rows_after', 'value': len(data_aadhar_enrollment_dedup)},
])

display(dup_summary)


Unnamed: 0,metric,value
0,full_row_duplicates_before,22957
1,grain_duplicates_before,22957
2,grain_duplicates_after,0
3,rows_before,1006029
4,rows_after,983072


In [208]:
# Aim: Validate numeric range constraints for measure columns.
# Expected Output: Count of negative rows and optional suspicious values.
# What You Get: Range validation status for age measure integrity.
# Data Engineer Learning: Contract checks should be explicit and measurable.

measure_cols = ['age_0_5', 'age_5_17', 'age_18_greater']
neg_mask = (data_aadhar_enrollment_dedup[measure_cols] < 0).any(axis=1)
neg_count = int(neg_mask.sum())
print('Rows with negative age counts:', neg_count)
if neg_count > 0:
    display(data_aadhar_enrollment_dedup.loc[neg_mask, natural_key + measure_cols].head(20))


Rows with negative age counts: 0


In [209]:
# Aim: Build total_enrollment and detect outliers using z-score.
# Expected Output: Outlier count and sample rows.
# What You Get: Early anomaly list for investigation.
# Data Engineer Learning: Outliers can indicate either data issues or real events.

eda_df = data_aadhar_enrollment_dedup.copy()
eda_df['total_enrollment'] = eda_df['age_0_5'].fillna(0) + eda_df['age_5_17'].fillna(0) + eda_df['age_18_greater'].fillna(0)

mu = eda_df['total_enrollment'].mean()
sig = eda_df['total_enrollment'].std()
eda_df['total_z'] = (eda_df['total_enrollment'] - mu) / (sig if sig and sig != 0 else np.nan)

outliers = eda_df[eda_df['total_z'].abs() > 3]
print('Outlier rows (|z| > 3):', len(outliers))
display(outliers[natural_key + ['total_enrollment', 'total_z']].head(30))


Outlier rows (|z| > 3): 2941


Unnamed: 0,date,state,district,pincode,total_enrollment,total_z
0,02-03-2025,Meghalaya,East Khasi Hills,793121,109,3.243601
2,09-03-2025,Uttar Pradesh,Kanpur Nagar,208001,123,3.682025
3,09-03-2025,Uttar Pradesh,Aligarh,202133,106,3.149653
10,09-03-2025,Uttar Pradesh,Maharajganj,273164,114,3.400181
11,09-03-2025,Bihar,Sitamarhi,843317,145,4.370978
13,09-03-2025,Bihar,Sitamarhi,843324,269,8.254163
14,09-03-2025,Uttar Pradesh,Ghaziabad,201102,174,5.279142
15,09-03-2025,Haryana,Faridabad,121004,150,4.527558
17,09-03-2025,Bihar,Madhubani,847108,160,4.840718
23,09-03-2025,Uttar Pradesh,Gautam Buddha Nagar,201301,290,8.9118


### Added for Checklist Alignment: Step 4 Exact Duplicate Analysis
This cell adds the exact duplicate-inspection and drop pattern from your checklist.


In [210]:
# Aim: Run exact duplicate analysis pattern and perform explicit key-based drop.
# Expected Output: duplicate groups preview + before/after duplicate counts.
# What You Get: A direct checklist-style duplicate handling cell.
# Data Engineer Learning: Separate exact duplicates from business-key duplicates before deciding drop logic.

df = eda_enroll.copy() if 'eda_enroll' in globals() else data_aadhar_enrollment_full.copy()
dup_view = df[df.duplicated(subset=['date','state','district','pincode'], keep=False)].sort_values(by=['date','state'])
print('Duplicate groups (key-based):', len(dup_view))
display(dup_view.head(20))

before = int(df.duplicated(subset=['date','state','district','pincode']).sum())
df_dedup_check = df.drop_duplicates(subset=['date','state','district','pincode'], keep='first').copy()
after = int(df_dedup_check.duplicated(subset=['date','state','district','pincode']).sum())
print('Before key-duplicate count:', before)
print('After key-duplicate count :', after)


Duplicate groups (key-based): 0


Unnamed: 0,date,state,district,pincode,age_0_5,age_5_17,age_18_greater,total_enrollment


Before key-duplicate count: 0
After key-duplicate count : 0


## STEP 5 ? Domain Validation
Validate domain values, especially district/state naming quality.
This is where canonical cleaning is built.


### Integrated 63-Step Cells For This Section
Included steps: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34


#### Integrated Step 18 (from eroll.ipynb)


#### Integrated Step 19 (from eroll.ipynb)


#### Integrated Step 20 (from eroll.ipynb)


#### Integrated Step 21 (from eroll.ipynb)


#### Integrated Step 22 (from eroll.ipynb)


#### Integrated Step 23 (from eroll.ipynb)


#### Integrated Step 24 (from eroll.ipynb)


#### Integrated Step 25 (from eroll.ipynb)


#### Integrated Step 26 (from eroll.ipynb)


#### Integrated Step 27 (from eroll.ipynb)


#### Integrated Step 28 (from eroll.ipynb)


#### Integrated Step 29 (from eroll.ipynb)


#### Integrated Step 30 (from eroll.ipynb)


#### Integrated Step 31 (from eroll.ipynb)


#### Integrated Step 32 (from eroll.ipynb)


#### Integrated Step 33 (from eroll.ipynb)


#### Integrated Step 34 (from eroll.ipynb)


## STEP 6 ? Column Classification
Classify columns into dimensions/measures/time for modeling readiness.


### Added for Checklist Alignment: Step 6 Data Type and Format Validation
This cell adds explicit date conversion, pincode length checks, and object-column inspection.


In [211]:
# Aim: Validate date datatype, pincode format length, and object columns.
# Expected Output: datetime conversion status, pincode length distribution, object column list.
# What You Get: Explicit format-validation evidence for key fields.
# Data Engineer Learning: Format checks are data contracts, not optional checks.

df = eda_enroll.copy() if 'eda_enroll' in globals() else data_aadhar_enrollment_full.copy()
df['date'] = pd.to_datetime(df['date'], errors='coerce', dayfirst=True)
print('Null dates after conversion:', int(df['date'].isna().sum()))

pincode_len = df['pincode'].astype(str).str.replace(r'\.0$', '', regex=True).str.strip().str.len().value_counts(dropna=False).sort_index()
print('Pincode length distribution:')
display(pincode_len.to_frame('count'))

obj_cols = df.select_dtypes(include=['object']).columns.tolist()
print('Object columns:', obj_cols)


Null dates after conversion: 0
Pincode length distribution:


Unnamed: 0_level_0,count
pincode,Unnamed: 1_level_1
6,983072


Object columns: ['state', 'district']


## STEP 7 ? Relationship Feasibility
Feasibility only (no modeling implementation):
- Do Enrollment/Demographic/Biometric share keys?
- Can they connect at date-location grain?


In [212]:
# Aim: Load related tables and check shared schema feasibility.
# Expected Output: Basic table sizes and key-column existence check.
# What You Get: Early confirmation whether joins are feasible.
# Data Engineer Learning: Feasibility checks prevent expensive late-stage integration failures.

from pathlib import Path
import pandas as pd

demo_path = Path(r"C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\scripts\EDA\panda_eda\data\data_aadhar_demographic_full.csv")
bio_path  = Path(r"C:\Users\Atul bhardwaj\OneDrive\Desktop\coding 2 year\IdentityLakehouse\scripts\EDA\panda_eda\data\data_aadhar_biometric_full.csv")

demo_df = pd.read_csv(demo_path)
bio_df = pd.read_csv(bio_path)

required_keys = ['date', 'state', 'district', 'pincode']
enroll_df = eda_df if 'eda_df' in globals() else data_aadhar_enrollment_full

feasibility = pd.DataFrame([
    {'table': 'enrollment',  'rows': len(enroll_df), 'has_all_keys': all(k in enroll_df.columns for k in required_keys)},
    {'table': 'demographic', 'rows': len(demo_df),   'has_all_keys': all(k in demo_df.columns for k in required_keys)},
    {'table': 'biometric',   'rows': len(bio_df),    'has_all_keys': all(k in bio_df.columns for k in required_keys)},
])

display(feasibility)


Unnamed: 0,table,rows,has_all_keys
0,enrollment,983072,True
1,demographic,2071700,True
2,biometric,1861108,True


In [213]:
# Aim: Evaluate key overlap ratios from Enrollment to Demographic/Biometric.
# Expected Output: Join coverage percentages.
# What You Get: Quantified relationship feasibility at key grain.
# Data Engineer Learning: Coverage metrics tell whether conformed dimensions are achievable.

def prep_key(df):
    x = df.copy()
    x['date_key'] = pd.to_datetime(x['date'], dayfirst=True, errors='coerce').dt.strftime('%Y-%m-%d')
    x['state_key'] = x['state'].astype(str).str.strip().str.lower()
    x['district_key'] = x['district'].astype(str).str.strip().str.lower()
    x['pincode_key'] = x['pincode'].astype(str).str.replace(r'\.0$', '', regex=True).str.strip()
    return x[['date_key', 'state_key', 'district_key', 'pincode_key']].dropna().drop_duplicates()

enroll_keys = prep_key(eda_df)
demo_keys = prep_key(demo_df)
bio_keys = prep_key(bio_df)

enroll_not_demo = enroll_keys.merge(demo_keys, how='left', on=enroll_keys.columns.tolist(), indicator=True)
enroll_not_demo = enroll_not_demo[enroll_not_demo['_merge'] == 'left_only']

enroll_not_bio = enroll_keys.merge(bio_keys, how='left', on=enroll_keys.columns.tolist(), indicator=True)
enroll_not_bio = enroll_not_bio[enroll_not_bio['_merge'] == 'left_only']

coverage = pd.DataFrame([
    {'metric': 'enroll_key_count', 'value': len(enroll_keys)},
    {'metric': 'enroll_in_demo_pct', 'value': round(100 * (1 - len(enroll_not_demo)/max(len(enroll_keys),1)), 4)},
    {'metric': 'enroll_in_bio_pct', 'value': round(100 * (1 - len(enroll_not_bio)/max(len(enroll_keys),1)), 4)},
])

display(coverage)


Unnamed: 0,metric,value
0,enroll_key_count,982290.0
1,enroll_in_demo_pct,66.3304
2,enroll_in_bio_pct,73.8784


## STEP 8 ? Volume & Cardinality Analysis
Cardinality guides partitioning and indexing strategy.


### Integrated 63-Step Cells For This Section
Included steps: 35, 36, 37, 38, 48, 49, 50


#### Integrated Step 35 (from eroll.ipynb)


In [214]:
# Aim: Count unique pincodes in full data
# Expected Output: Valid output for Step 35 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 35 improves trust in downstream analytics.

print('full_unique_pincode', data_aadhar_enrollment_dedup['pincode'].nunique())


full_unique_pincode 19463


#### Integrated Step 36 (from eroll.ipynb)


In [215]:
# Aim: Count unique pincodes in UP subset
# Expected Output: Valid output for Step 36 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 36 improves trust in downstream analytics.

# print('up_unique_pincode', data_aadhar_enrollment_full_uttar_pradesh['pincode'].nunique())


#### Integrated Step 37 (from eroll.ipynb)


#### Integrated Step 38 (from eroll.ipynb)


#### Integrated Step 48 (from eroll.ipynb)


#### Integrated Step 49 (from eroll.ipynb)


#### Integrated Step 50 (from eroll.ipynb)


In [216]:
# Aim: Monthly enrollment trend
# Expected Output: Valid output for Step 50 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 50 improves trust in downstream analytics.

monthly_trend = eda_enroll.dropna(subset=['date']).assign(month=lambda d:d['date'].dt.to_period('M').astype(str)).groupby('month', as_index=False)['total_enrollment'].sum().sort_values('month'); display(monthly_trend)


Unnamed: 0,month,total_enrollment
0,2025-03,16582
1,2025-04,257438
2,2025-05,183616
3,2025-06,215734
4,2025-07,616868
5,2025-09,1475879
6,2025-10,779617
7,2025-11,1052584
8,2025-12,733442


In [217]:
# Aim: Run explicit Step 8 cardinality and distribution checks.
# Expected Output: unique state count, district-per-state table, unique pincode count.
# What You Get: clear dimensional cardinality evidence for schema design.
# Data Engineer Learning: cardinality drives partitioning, indexing, and dimension sizing.

df = eda_enroll.copy() if 'eda_enroll' in globals() else (eda_df.copy() if 'eda_df' in globals() else data_aadhar_enrollment_full.copy())

print('Unique states:', df['state'].nunique() if 'state' in df.columns else 'state column missing')
if all(c in df.columns for c in ['state', 'district']):
    state_district_cardinality = df.groupby('state')['district'].nunique().sort_values(ascending=False)
    display(state_district_cardinality.head(20).to_frame('district_nunique'))
else:
    print('state/district columns missing')

print('Unique pincodes:', df['pincode'].nunique() if 'pincode' in df.columns else 'pincode column missing')


Unique states: 55


Unnamed: 0_level_0,district_nunique
state,Unnamed: 1_level_1
Uttar Pradesh,89
Madhya Pradesh,61
West Bengal,58
Karnataka,56
Maharashtra,53
Bihar,48
Andhra Pradesh,47
Tamil Nadu,46
Odisha,45
Rajasthan,43


Unique pincodes: 19463


## STEP 9 ? Data Consistency Across Tables
Check whether Enrollment contains keys/dimensions missing in Demographic/Biometric and vice versa.


### Integrated 63-Step Cells For This Section
Included steps: 39, 40, 51, 52


#### Integrated Step 39 (from eroll.ipynb)


#### Integrated Step 40 (from eroll.ipynb)


In [218]:
# Self-contained Step 39 + 40
df = eda_df if 'eda_df' in globals() else data_aadhar_enrollment_full

up_df = df[df['state'].astype(str).str.strip().str.lower().eq('uttar pradesh')].copy()

pin_district_count = (
    up_df.groupby('pincode', as_index=False)['district']
    .nunique()
    .rename(columns={'district': 'district_count'})
)

problem_pins = (
    pin_district_count[pin_district_count['district_count'] > 1]
    .sort_values('district_count', ascending=False)
)

print('UP rows:', len(up_df))
print('problem_pins:', len(problem_pins))
display(problem_pins.head(40))

UP rows: 108066
problem_pins: 404


Unnamed: 0,pincode,district_count
1065,244102,4
493,221306,3
495,221308,3
599,224132,3
103,203131,3
497,221310,3
106,203141,3
1496,274304,3
500,221314,3
108,203155,3


#### Integrated Step 51 (from eroll.ipynb)


#### Integrated Step 52 (from eroll.ipynb)


In [219]:
# Aim: Compute cross-table consistency mismatches at shared grain.
# Expected Output: mismatch counts and sample keys.
# What You Get: Data consistency diagnostics between tables.
# Data Engineer Learning: Cross-table consistency is mandatory before conformed modeling.

grain_cols = ['date_key', 'state_key', 'district_key', 'pincode_key']

demo_not_enroll = demo_keys.merge(enroll_keys, how='left', on=grain_cols, indicator=True)
demo_not_enroll = demo_not_enroll[demo_not_enroll['_merge'] == 'left_only'].drop(columns=['_merge'])

bio_not_enroll = bio_keys.merge(enroll_keys, how='left', on=grain_cols, indicator=True)
bio_not_enroll = bio_not_enroll[bio_not_enroll['_merge'] == 'left_only'].drop(columns=['_merge'])

consistency_summary = pd.DataFrame([
    {'metric': 'enroll_not_in_demo', 'value': len(enroll_not_demo)},
    {'metric': 'enroll_not_in_bio', 'value': len(enroll_not_bio)},
    {'metric': 'demo_not_in_enroll', 'value': len(demo_not_enroll)},
    {'metric': 'bio_not_in_enroll', 'value': len(bio_not_enroll)},
])

display(consistency_summary)
print('Sample enrollment keys missing in demographic:')
display(enroll_not_demo.head(20))
print('Sample enrollment keys missing in biometric:')
display(enroll_not_bio.head(20))


Unnamed: 0,metric,value
0,enroll_not_in_demo,330733
1,enroll_not_in_bio,256590
2,demo_not_in_enroll,944962
3,bio_not_in_enroll,1038552


Sample enrollment keys missing in demographic:


Unnamed: 0,date_key,state_key,district_key,pincode_key,_merge
0,2025-03-02,meghalaya,east khasi hills,793121,left_only
1,2025-03-09,karnataka,bengaluru urban,560043,left_only
2,2025-03-09,uttar pradesh,kanpur nagar,208001,left_only
3,2025-03-09,uttar pradesh,aligarh,202133,left_only
4,2025-03-09,karnataka,bengaluru urban,560016,left_only
5,2025-03-09,bihar,sitamarhi,843331,left_only
6,2025-03-09,bihar,sitamarhi,843330,left_only
7,2025-03-09,uttar pradesh,bahraich,271865,left_only
8,2025-03-09,uttar pradesh,firozabad,283204,left_only
9,2025-03-09,bihar,purbi champaran,845418,left_only


Sample enrollment keys missing in biometric:


Unnamed: 0,date_key,state_key,district_key,pincode_key,_merge
0,2025-03-02,meghalaya,east khasi hills,793121,left_only
1,2025-03-09,karnataka,bengaluru urban,560043,left_only
2,2025-03-09,uttar pradesh,kanpur nagar,208001,left_only
3,2025-03-09,uttar pradesh,aligarh,202133,left_only
4,2025-03-09,karnataka,bengaluru urban,560016,left_only
5,2025-03-09,bihar,sitamarhi,843331,left_only
6,2025-03-09,bihar,sitamarhi,843330,left_only
7,2025-03-09,uttar pradesh,bahraich,271865,left_only
8,2025-03-09,uttar pradesh,firozabad,283204,left_only
9,2025-03-09,bihar,purbi champaran,845418,left_only


In [220]:
# Aim: Save consistency report artifacts for auditability.
# Expected Output: CSV files in consistency_reports folder.
# What You Get: Reproducible artifacts for review and handoff.
# Data Engineer Learning: Persist intermediate diagnostics as evidence, not just notebook output.

report_dir = Path('scripts/EDA/consistency_reports')
report_dir.mkdir(parents=True, exist_ok=True)

consistency_summary.to_csv(report_dir / 'cross_table_coverage_summary.csv', index=False)
enroll_not_demo.to_csv(report_dir / 'enrollment_keys_missing_in_demographic.csv', index=False)
enroll_not_bio.to_csv(report_dir / 'enrollment_keys_missing_in_biometric.csv', index=False)
demo_not_enroll.to_csv(report_dir / 'demographic_keys_missing_in_enrollment.csv', index=False)
bio_not_enroll.to_csv(report_dir / 'biometric_keys_missing_in_enrollment.csv', index=False)

print('Saved reports to:', report_dir)


Saved reports to: scripts\EDA\consistency_reports


### Added for Checklist Alignment: Step 9 Cross-Column Consistency
This cell adds explicit pincode->district and district->state stability checks.


In [221]:
# Aim: Run explicit Step 9 cross-column consistency checks.
# Expected Output: pincode->district and district->state uniqueness distributions.
# What You Get: stability signal for geo relationships and join reliability.
# Data Engineer Learning: unstable many-to-many geo keys create referential-quality risks.

df = eda_enroll.copy() if 'eda_enroll' in globals() else (eda_df.copy() if 'eda_df' in globals() else data_aadhar_enrollment_full.copy())

if all(c in df.columns for c in ['pincode', 'district']):
    pin_to_district = df.groupby('pincode')['district'].nunique().sort_values(ascending=False)
    print('Pincodes mapped to >1 district:', int((pin_to_district > 1).sum()))
    display(pin_to_district.head(20).to_frame('district_nunique'))
else:
    print('pincode/district columns missing')

if all(c in df.columns for c in ['district', 'state']):
    district_to_state = df.groupby('district')['state'].nunique().sort_values(ascending=False)
    print('Districts mapped to >1 state:', int((district_to_state > 1).sum()))
    display(district_to_state.head(20).to_frame('state_nunique'))
else:
    print('district/state columns missing')


Pincodes mapped to >1 district: 6576


Unnamed: 0_level_0,district_nunique
pincode,Unnamed: 1_level_1
500090,7
500037,6
450661,6
713130,6
571442,6
500014,6
533464,6
831002,6
509371,6
509339,6


Districts mapped to >1 state: 78


Unnamed: 0_level_0,state_nunique
district,Unnamed: 1_level_1
Hooghly,4
Kargil,3
Daman,3
Diu,3
Doda,3
HOOGHLY,3
Bargarh,2
Aurangabad,2
Bijapur,2
Cuttack,2


## STEP 10 ? Document Findings
Mandatory documentation fields:
- Grain
- Natural Key
- Identified issues
- Data inconsistencies
- Expected fixes
- Modeling direction


In [222]:
# Aim: Export final findings artifacts.
# Expected Output: final_eda_findings_table.csv and final_eda_findings.md files.
# What You Get: Persistent deliverables for portfolio/review.
# Data Engineer Learning: Good notebooks produce durable, shareable outputs.

import pandas as pd
import numpy as np

# Use available dataframe
df = eda_df if 'eda_df' in globals() else data_aadhar_enrollment_full

# Core metrics
dup_full = int(df.duplicated().sum())
dup_key = int(df.duplicated(subset=['date','state','district','pincode']).sum()) if all(c in df.columns for c in ['date','state','district','pincode']) else np.nan
null_total = int(df.isna().sum().sum())

if all(c in df.columns for c in ['age_0_5','age_5_17','age_18_greater']):
    neg_rows = int((df[['age_0_5','age_5_17','age_18_greater']] < 0).any(axis=1).sum())
else:
    neg_rows = np.nan

# Optional pincode conflict check
if all(c in df.columns for c in ['pincode','district']):
    pin_conflicts = int((df.groupby('pincode')['district'].nunique() > 1).sum())
else:
    pin_conflicts = np.nan

eda_findings_doc = pd.DataFrame([
    {'section': 'Grain', 'details': 'One row = enrollment counts for one date-state-district-pincode combination.'},
    {'section': 'Natural Key', 'details': '(date, state, district, pincode)'},
    {'section': 'Identified issues', 'details': f'Full duplicates={dup_full}; Key duplicates={dup_key}; Total nulls={null_total}; Negative-age rows={neg_rows}'},
    {'section': 'Data inconsistencies', 'details': f'Pincode mapped to multiple districts={pin_conflicts}'},
    {'section': 'Expected fixes', 'details': 'Deduplicate on natural key, enforce contracts, standardize names using alias mapping, quarantine invalid rows.'},
    {'section': 'Modeling direction', 'details': 'Use cleaned location/date grain, conformed dimensions, and publish only after quality checks pass.'},
])

display(eda_findings_doc)

Unnamed: 0,section,details
0,Grain,One row = enrollment counts for one date-state...
1,Natural Key,"(date, state, district, pincode)"
2,Identified issues,Full duplicates=0; Key duplicates=0; Total nul...
3,Data inconsistencies,Pincode mapped to multiple districts=6576
4,Expected fixes,"Deduplicate on natural key, enforce contracts,..."
5,Modeling direction,"Use cleaned location/date grain, conformed dim..."


## Advanced Track (After Step 10)
These advanced analyses are optional but recommended for production-readiness.


### Integrated 63-Step Cells For This Section
Included steps: 60, 61, 62, 63


#### Integrated Step 60 (from eroll.ipynb)


#### Integrated Step 61 (from eroll.ipynb)


#### Integrated Step 62 (from eroll.ipynb)


In [223]:
# Aim: Data contracts, drift, freshness, anomaly classes
# Expected Output: Valid output for Step 62 is produced.
# What You Get: Reliable intermediate evidence for EDA completion.
# Data Engineer Learning: Step 62 improves trust in downstream analytics.

import pandas as pd
import numpy as np
from pathlib import Path

# Self-contained dataset bootstrap for Advanced block
if 'eda_enroll' in globals():
    pass
elif 'eda_df' in globals():
    eda_enroll = eda_df.copy()
elif 'data_aadhar_enrollment_full' in globals():
    eda_enroll = data_aadhar_enrollment_full.copy()
else:
    candidate_files = [
        Path('scripts/EDA/panda_eda/data/data_aadhar_enrollment_full.csv'),
        Path('scripts/EDA/panda_eda/eda_enrollment/data/data_aadhar_enrollment_full.csv'),
    ]
    src = next((f for f in candidate_files if f.exists()), None)
    if src is None:
        raise FileNotFoundError(f'Enrollment file not found in: {candidate_files}')
    eda_enroll = pd.read_csv(src)

if 'date' in eda_enroll.columns:
    eda_enroll['date'] = pd.to_datetime(eda_enroll['date'], errors='coerce', dayfirst=True)

for col in ['age_0_5', 'age_5_17', 'age_18_greater']:
    if col in eda_enroll.columns:
        eda_enroll[col] = pd.to_numeric(eda_enroll[col], errors='coerce')

if 'total_enrollment' not in eda_enroll.columns and all(c in eda_enroll.columns for c in ['age_0_5', 'age_5_17', 'age_18_greater']):
    eda_enroll['total_enrollment'] = (
        eda_enroll['age_0_5'].fillna(0)
        + eda_enroll['age_5_17'].fillna(0)
        + eda_enroll['age_18_greater'].fillna(0)
    )

strict_date_ok = pd.to_datetime(eda_enroll['date'], errors='coerce').notna() if 'date' in eda_enroll.columns else pd.Series([False] * len(eda_enroll))
pincode_ok = eda_enroll['pincode'].astype(str).str.replace(r'\.0$', '', regex=True).str.match(r'^\d{6}$', na=False) if 'pincode' in eda_enroll.columns else pd.Series([False] * len(eda_enroll))
non_negative_ok = (eda_enroll[['age_0_5', 'age_5_17', 'age_18_greater']] >= 0).all(axis=1) if all(c in eda_enroll.columns for c in ['age_0_5', 'age_5_17', 'age_18_greater']) else pd.Series([False] * len(eda_enroll))

contract = pd.DataFrame([
    {'check': 'valid_date', 'failed_rows': int((~strict_date_ok).sum())},
    {'check': 'valid_6_digit_pincode', 'failed_rows': int((~pincode_ok).sum())},
    {'check': 'non_negative_age', 'failed_rows': int((~non_negative_ok).sum())},
])

month_vals = {}
if 'date' in eda_enroll.columns and 'total_enrollment' in eda_enroll.columns:
    month_vals = {
        m: g['total_enrollment'].values
        for m, g in eda_enroll.dropna(subset=['date']).assign(month=lambda d: d['date'].dt.to_period('M').astype(str)).groupby('month')
    }

months = sorted(month_vals.keys())
drift_rows = []
if len(months) >= 2:
    base = month_vals[months[0]]
    for m in months[1:]:
        cur = month_vals[m]
        q = np.linspace(0, 1, 11)
        edges = np.unique(np.quantile(base, q))
        if len(edges) >= 3:
            e, _ = np.histogram(base, bins=edges)
            a, _ = np.histogram(cur, bins=edges)
            er = np.clip(e / max(e.sum(), 1), 1e-6, None)
            ar = np.clip(a / max(a.sum(), 1), 1e-6, None)
            psi = float(np.sum((ar - er) * np.log(ar / er)))
        else:
            psi = np.nan
        drift_rows.append({'month': m, 'psi_total_enrollment': psi})

drift = pd.DataFrame(drift_rows)

# Build state_month in-place to avoid run-order dependency
if 'date' in eda_enroll.columns and 'total_enrollment' in eda_enroll.columns and 'state' in eda_enroll.columns:
    state_month = (
        eda_enroll.dropna(subset=['date'])
        .assign(month=lambda d: d['date'].dt.to_period('M').astype(str))
        .groupby(['state', 'month'], as_index=False)['total_enrollment']
        .sum()
        .sort_values(['state', 'month'])
    )
    state_month['pct_change'] = state_month.groupby('state')['total_enrollment'].pct_change()
    state_month['z'] = (
        (state_month['total_enrollment'] - state_month.groupby('state')['total_enrollment'].transform('mean'))
        / state_month.groupby('state')['total_enrollment'].transform('std').replace(0, np.nan)
    )
    anom = state_month.copy()
    anom['anomaly_class'] = np.where(
        (anom['pct_change'].abs() > 0.6) & (anom['z'].abs() > 3),
        'likely_data_issue',
        'normal'
    )
else:
    anom = pd.DataFrame(columns=['state', 'month', 'total_enrollment', 'pct_change', 'z', 'anomaly_class'])

display(contract)
display(drift)
display(anom[anom['anomaly_class'] != 'normal'].head(40))


Unnamed: 0,check,failed_rows
0,valid_date,0
1,valid_6_digit_pincode,0
2,non_negative_age,0


Unnamed: 0,month,psi_total_enrollment
0,2025-04,1.135077
1,2025-05,0.976072
2,2025-06,1.215223
3,2025-07,1.534281
4,2025-09,3.903214
5,2025-10,2.320635
6,2025-11,2.409039
7,2025-12,2.201698


Unnamed: 0,state,month,total_enrollment,pct_change,z,anomaly_class


#### Integrated Step 63 (from eroll.ipynb)


### Added for Checklist Alignment: Step 12 Trend and Time Analysis
This cell adds the exact monthly trend flow: create month then aggregate enrollment.


In [224]:
# Aim: Build a simple monthly enrollment trend table.
# Expected Output: month-wise total_enrollment series sorted by month.
# What You Get: Clear trend baseline for growth/drop interpretation.
# Data Engineer Learning: Time-bucketing is core for monitoring and anomaly review.

import pandas as pd

if 'eda_enroll' in globals():
    df = eda_enroll.copy()
elif 'eda_df' in globals():
    df = eda_df.copy()
elif 'data_aadhar_enrollment_full' in globals():
    df = data_aadhar_enrollment_full.copy()
else:
    raise NameError('No base dataframe found. Run Step 62 cell first.')

df['date'] = pd.to_datetime(df['date'], errors='coerce', dayfirst=True)
if 'total_enrollment' not in df.columns and all(c in df.columns for c in ['age_0_5', 'age_5_17', 'age_18_greater']):
    df['total_enrollment'] = df['age_0_5'].fillna(0) + df['age_5_17'].fillna(0) + df['age_18_greater'].fillna(0)

df['month'] = df['date'].dt.to_period('M').astype(str)
monthly = df.groupby('month', as_index=False)['total_enrollment'].sum().sort_values('month')
display(monthly.head(24))
print('Total months:', monthly['month'].nunique())


Unnamed: 0,month,total_enrollment
0,2025-03,16582
1,2025-04,257438
2,2025-05,183616
3,2025-06,215734
4,2025-07,616868
5,2025-09,1475879
6,2025-10,779617
7,2025-11,1052584
8,2025-12,733442


Total months: 9


In [225]:
# Overall Review Cell 1/15: prepare stable review dataframe
import pandas as pd
import numpy as np
from pathlib import Path

if 'eda_enroll' in globals():
    review_df = eda_enroll.copy()
elif 'eda_df' in globals():
    review_df = eda_df.copy()
elif 'data_aadhar_enrollment_full' in globals():
    review_df = data_aadhar_enrollment_full.copy()
else:
    candidate_files = [
        Path('scripts/EDA/panda_eda/data/data_aadhar_enrollment_full.csv'),
        Path('scripts/EDA/panda_eda/eda_enrollment/data/data_aadhar_enrollment_full.csv'),
    ]
    src = next((f for f in candidate_files if f.exists()), None)
    if src is None:
        raise FileNotFoundError(f'Enrollment file not found in: {candidate_files}')
    review_df = pd.read_csv(src)

if 'date' in review_df.columns:
    review_df['date'] = pd.to_datetime(review_df['date'], errors='coerce', dayfirst=True)

for c in ['age_0_5', 'age_5_17', 'age_18_greater']:
    if c in review_df.columns:
        review_df[c] = pd.to_numeric(review_df[c], errors='coerce')

if 'total_enrollment' not in review_df.columns and all(c in review_df.columns for c in ['age_0_5', 'age_5_17', 'age_18_greater']):
    review_df['total_enrollment'] = review_df['age_0_5'].fillna(0) + review_df['age_5_17'].fillna(0) + review_df['age_18_greater'].fillna(0)

# Ensure summary_metrics exists for export cell
summary_metrics = pd.DataFrame([
    {'metric': 'rows', 'value': int(len(review_df))},
    {'metric': 'columns', 'value': int(review_df.shape[1])},
    {'metric': 'null_cells_total', 'value': int(review_df.isna().sum().sum())},
    {'metric': 'full_row_duplicates', 'value': int(review_df.duplicated().sum())},
])

print('review_df shape:', review_df.shape)
print('columns:', len(review_df.columns))


review_df shape: (983072, 8)
columns: 8


In [226]:
# Overall Review Cell 3/15: schema and datatype review
schema_review = pd.DataFrame({
    'column': review_df.columns,
    'dtype': [str(review_df[c].dtype) for c in review_df.columns],
    'non_null_count': [int(review_df[c].notna().sum()) for c in review_df.columns],
    'null_count': [int(review_df[c].isna().sum()) for c in review_df.columns],
    'null_pct': [round(100 * review_df[c].isna().mean(), 4) for c in review_df.columns],
})
display(schema_review.sort_values(['null_pct', 'column'], ascending=[False, True]).head(25))


Unnamed: 0,column,dtype,non_null_count,null_count,null_pct
4,age_0_5,int64,983072,0,0.0
6,age_18_greater,int64,983072,0,0.0
5,age_5_17,int64,983072,0,0.0
0,date,datetime64[ns],983072,0,0.0
2,district,object,983072,0,0.0
3,pincode,int64,983072,0,0.0
1,state,object,983072,0,0.0
7,total_enrollment,int64,983072,0,0.0


In [227]:
# Overall Review Cell 4/15: key completeness and key-null checks
key_cols = ['date', 'state', 'district', 'pincode']
key_exists = [c for c in key_cols if c in review_df.columns]
key_nulls = {c: int(review_df[c].isna().sum()) for c in key_exists}
key_nulls_df = pd.DataFrame([{'key_column': k, 'null_count': v, 'null_pct': round(100*v/max(len(review_df),1), 4)} for k,v in key_nulls.items()])
print('Available key columns:', key_exists)
display(key_nulls_df)


Available key columns: ['date', 'state', 'district', 'pincode']


Unnamed: 0,key_column,null_count,null_pct
0,date,0,0.0
1,state,0,0.0
2,district,0,0.0
3,pincode,0,0.0


In [228]:
# Overall Review Cell 5/15: duplicate audit
if all(c in review_df.columns for c in ['date', 'state', 'district', 'pincode']):
    dup_key = int(review_df.duplicated(subset=['date', 'state', 'district', 'pincode']).sum())
else:
    dup_key = np.nan

dup_full = int(review_df.duplicated().sum())

duplicate_audit = pd.DataFrame([
    {'metric': 'full_row_duplicates', 'value': dup_full},
    {'metric': 'key_duplicates_raw_key', 'value': dup_key},
])
display(duplicate_audit)


Unnamed: 0,metric,value
0,full_row_duplicates,0
1,key_duplicates_raw_key,0


In [229]:
# Overall Review Cell 6/15: missing and empty diagnostics
null_pct = (review_df.isna().mean() * 100).round(4)
empty_cnt = {}
for c in review_df.columns:
    if review_df[c].dtype == 'object':
        empty_cnt[c] = int(review_df[c].astype(str).str.strip().eq('').sum())
    else:
        empty_cnt[c] = 0

missing_diag = pd.DataFrame({
    'column': review_df.columns,
    'null_pct': [null_pct[c] for c in review_df.columns],
    'empty_string_count': [empty_cnt[c] for c in review_df.columns],
    'combined_missing_like': [int(review_df[c].isna().sum()) + int(empty_cnt[c]) for c in review_df.columns],
}).sort_values(['combined_missing_like', 'null_pct'], ascending=[False, False])

display(missing_diag.head(25))


Unnamed: 0,column,null_pct,empty_string_count,combined_missing_like
0,date,0.0,0,0
1,state,0.0,0,0
2,district,0.0,0,0
3,pincode,0.0,0,0
4,age_0_5,0.0,0,0
5,age_5_17,0.0,0,0
6,age_18_greater,0.0,0,0
7,total_enrollment,0.0,0,0


In [230]:
# Overall Review Cell 7/15: date coverage and monthly completeness
if 'date' in review_df.columns:
    min_date = review_df['date'].min()
    max_date = review_df['date'].max()
    review_df['month_review'] = review_df['date'].dt.to_period('M').astype(str)
    month_counts = review_df.groupby('month_review', as_index=False).size().rename(columns={'size':'row_count'})
    print('date_min:', min_date)
    print('date_max:', max_date)
    display(month_counts.tail(24))
else:
    print('date column not available for coverage check')


date_min: 2025-03-02 00:00:00
date_max: 2025-12-31 00:00:00


Unnamed: 0,month_review,row_count
0,2025-03,168
1,2025-04,847
2,2025-05,549
3,2025-06,582
4,2025-07,1184
5,2025-09,356059
6,2025-10,203488
7,2025-11,264183
8,2025-12,156012


In [231]:
# Overall Review Cell 12/15: trend review with MoM growth
if 'date' in review_df.columns and 'total_enrollment' in review_df.columns:
    trend = review_df.dropna(subset=['date']).assign(month=lambda d: d['date'].dt.to_period('M').astype(str)).groupby('month', as_index=False)['total_enrollment'].sum().sort_values('month')
    trend['mom_growth_pct'] = trend['total_enrollment'].pct_change() * 100
    display(trend.tail(24))
else:
    print('date/total_enrollment not available for trend review')


Unnamed: 0,month,total_enrollment,mom_growth_pct
0,2025-03,16582,
1,2025-04,257438,1452.514775
2,2025-05,183616,-28.675642
3,2025-06,215734,17.49194
4,2025-07,616868,185.939166
5,2025-09,1475879,139.253617
6,2025-10,779617,-47.17609
7,2025-11,1052584,35.012961
8,2025-12,733442,-30.31986


In [232]:
# Overall Review Cell 13/15: risk register table
risk_rows = [
    {'risk_id':'R1', 'risk':'Duplicate business keys', 'severity':'High', 'check':'key duplicate count', 'status':'Open'},
    {'risk_id':'R2', 'risk':'Null key fields', 'severity':'High', 'check':'nulls in date/state/district/pincode', 'status':'Open'},
    {'risk_id':'R3', 'risk':'District/state naming inconsistency', 'severity':'Medium', 'check':'mapping coverage and mismatch rows', 'status':'Open'},
    {'risk_id':'R4', 'risk':'Pincode referential conflicts', 'severity':'Medium', 'check':'pincode->district/state nunique', 'status':'Open'},
    {'risk_id':'R5', 'risk':'Outlier spikes', 'severity':'Medium', 'check':'z-score and pct-change anomalies', 'status':'Open'},
]
final_eda_risk_register = pd.DataFrame(risk_rows)
display(final_eda_risk_register)


Unnamed: 0,risk_id,risk,severity,check,status
0,R1,Duplicate business keys,High,key duplicate count,Open
1,R2,Null key fields,High,nulls in date/state/district/pincode,Open
2,R3,District/state naming inconsistency,Medium,mapping coverage and mismatch rows,Open
3,R4,Pincode referential conflicts,Medium,pincode->district/state nunique,Open
4,R5,Outlier spikes,Medium,z-score and pct-change anomalies,Open


In [233]:
# Overall Review Cell 14/15: rule status table
rule_rows = [
    {'rule':'Natural key uniqueness', 'expected':'0 duplicates', 'current':'See duplicate audit cell', 'pass_flag':'Check'},
    {'rule':'Key null checks', 'expected':'0 null keys', 'current':'See key completeness cell', 'pass_flag':'Check'},
    {'rule':'No negative age counts', 'expected':'0 negative rows', 'current':'See negative checks', 'pass_flag':'Check'},
    {'rule':'Pincode stability', 'expected':'minimal conflicts', 'current':'See pincode consistency cell', 'pass_flag':'Check'},
    {'rule':'Trend stability', 'expected':'explainable spikes', 'current':'See trend/outlier cells', 'pass_flag':'Check'},
]
final_eda_rule_status = pd.DataFrame(rule_rows)
display(final_eda_rule_status)


Unnamed: 0,rule,expected,current,pass_flag
0,Natural key uniqueness,0 duplicates,See duplicate audit cell,Check
1,Key null checks,0 null keys,See key completeness cell,Check
2,No negative age counts,0 negative rows,See negative checks,Check
3,Pincode stability,minimal conflicts,See pincode consistency cell,Check
4,Trend stability,explainable spikes,See trend/outlier cells,Check


In [234]:
# Overall Review Cell 15/15: export final review artifacts
from pathlib import Path
import pandas as pd

report_dir = Path('scripts/EDA/panda_eda/consistency_reports')
report_dir.mkdir(parents=True, exist_ok=True)

# Rebuild summary_metrics if missing
if 'summary_metrics' not in globals() and 'review_df' in globals():
    summary_metrics = pd.DataFrame([
        {'metric': 'rows', 'value': int(len(review_df))},
        {'metric': 'columns', 'value': int(review_df.shape[1])},
        {'metric': 'null_cells_total', 'value': int(review_df.isna().sum().sum())},
        {'metric': 'full_row_duplicates', 'value': int(review_df.duplicated().sum())},
    ])

if 'summary_metrics' in globals():
    summary_metrics.to_csv(report_dir / 'overall_review_summary_metrics.csv', index=False)
if 'final_eda_risk_register' in globals():
    final_eda_risk_register.to_csv(report_dir / 'overall_review_risk_register.csv', index=False)
if 'final_eda_rule_status' in globals():
    final_eda_rule_status.to_csv(report_dir / 'overall_review_rule_status.csv', index=False)

final_note = report_dir / 'overall_review_findings.md'
lines = [
    '# Overall Enrollment EDA Review',
    '',
    '- Scope: consolidated review across profiling, quality, consistency, and trend checks.',
    '- Output files: summary metrics, risk register, rule status.',
    '- Next: convert open checks into scheduled data quality pipeline rules.',
]
final_note.write_text('\n'.join(lines), encoding='utf-8')

print('Saved review outputs in:', report_dir)
print('-', report_dir / 'overall_review_summary_metrics.csv')
print('-', report_dir / 'overall_review_risk_register.csv')
print('-', report_dir / 'overall_review_rule_status.csv')
print('-', final_note)


Saved review outputs in: scripts\EDA\panda_eda\consistency_reports
- scripts\EDA\panda_eda\consistency_reports\overall_review_summary_metrics.csv
- scripts\EDA\panda_eda\consistency_reports\overall_review_risk_register.csv
- scripts\EDA\panda_eda\consistency_reports\overall_review_rule_status.csv
- scripts\EDA\panda_eda\consistency_reports\overall_review_findings.md
