# Data Preprocessing Pipeline

## 📚 Import Required Libraries

We'll start by importing the essential Python libraries needed for data manipulation and preprocessing.

In [29]:
import pandas as pd

# 📊 Loading Raw Data and Labels

## Data Sources

**Original Dataset**: [UCI Devanagari Handwritten Character Dataset](https://archive.ics.uci.edu/dataset/389/devanagari+handwritten+character+dataset)

### What We're Working With:

#### **Data File (`data.csv`)**
- Contains pixel values of all grayscaled images (32×32 = 1024 features per image)
- Each row represents one handwritten character image
- Pixel values are normalized and ready for neural network input

#### **Labels File (`labels.csv`)**  
- Contains the corresponding character labels in multiple formats
- Includes English transliterations, Devanagari characters, and phonetic representations

### Why This Preprocessing Matters:
Since this project builds neural networks from scratch using only NumPy (no CNN libraries like PyTorch), having all pixel values in numerical form is essential. This notebook handles:
- Data cleaning and validation
- Label standardization and mapping
- Adding proper Devanagari character labels
- Preparing the dataset for machine learning

In [30]:
data = pd.read_csv('rawData/data.csv')
labels = pd.read_csv('rawData/labels.csv')

In [31]:
data.head()

Unnamed: 0,pixel_0000,pixel_0001,pixel_0002,pixel_0003,pixel_0004,pixel_0005,pixel_0006,pixel_0007,pixel_0008,pixel_0009,...,pixel_1015,pixel_1016,pixel_1017,pixel_1018,pixel_1019,pixel_1020,pixel_1021,pixel_1022,pixel_1023,character
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka


In [32]:
labels.head(20)

Unnamed: 0,Numerals,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,Class,Label,Devanagari Label,Phonetics
2,0,0,०,Śūn'ya
3,1,1,१,ēka
4,2,2,२,du'ī
5,3,3,३,tīna
6,4,4,४,cāra
7,5,5,५,pām̐ca
8,6,6,६,cha
9,7,7,७,sāta


# 🔄 Creating Working Copies

## Best Practice: Work with Copies

It's always safe to perform all preprocessing operations on copies of the original data. This ensures we preserve the raw data and can always revert if needed.

In [33]:
dataCopy = data.copy()
labelsCopy = labels.copy()

# 🔍 Data Quality Assessment

## Checking for Missing Values

Before proceeding with any data manipulation, we need to identify and handle missing values. We have two main strategies:
1. **Remove rows** with missing values (if the loss is acceptable)
2. **Impute values** using mean, median, or other strategies

In [34]:
labelsCopy.isnull().sum()   # Check for missing values in labels

Numerals      7
Unnamed: 1    9
Unnamed: 2    9
Unnamed: 3    9
dtype: int64

# 🧹 Data Cleaning: Removing Missing Values

Since the missing values in our labels can be safely removed without significant data loss, we'll drop those rows to ensure data integrity.

In [35]:
labelsCopy = labelsCopy.dropna()    # Drop missing values from labels

# 📋 Label Structure Optimization

## Column Management and Header Setup

We'll perform several important operations:
1. **Remove unnecessary columns** (like "Numerals") that don't add value to our classification task
2. **Set proper headers** using the first row of data
3. **Remove duplicate entries** to ensure data uniqueness

In [36]:
labelsCopy = labelsCopy.drop(columns="Numerals")    # Drop the 'Numerals' column from labels

In [37]:
labelsCopy.columns = labelsCopy.iloc[0] # Set the first row as header

In [38]:
labelsCopy = labelsCopy.drop_duplicates()   # Remove duplicate rows
labelsCopy = labelsCopy.drop(labelsCopy.index[0])   # Remove the first row
labelsCopy = labelsCopy.reset_index(drop=True)   # Reset index

In [39]:
uniqueLabels = labelsCopy['Label'].unique()  # Get all the labels which are unique
print(f"Total unique labels: {uniqueLabels}")

Total unique labels: ['0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 'a' 'aa' 'i' 'ee' 'u' 'oo' 'ae'
 'ai' 'o' 'au' 'an' 'ah' 'ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja'
 'jha' 'yna' 'ta' 'tha' 'da' 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na'
 'pa' 'pha' 'ba' 'ma' 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw'
 'patalosaw' 'ha' 'ksha' 'tra' 'gya']


# 🔤 Label Analysis and Validation

## Understanding Our Character Set

We need to examine all unique labels to ensure we have the complete Devanagari character set. The labels include both English transliterations and the actual Devanagari characters, which adds significant value to our dataset for multilingual applications.

In [40]:
uniqueLabelsConsonants = uniqueLabels[22:]      # Only take Devanagari consonant alphabets
print(uniqueLabelsConsonants)
print(len(uniqueLabelsConsonants))

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'ma' 'ya' 'ra'
 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha' 'tra' 'gya']
35


# ⚠️ Data Inconsistency Detection

## Missing Character Investigation

**Expected**: 36 unique Devanagari consonant characters  
**Found**: 35 characters

This discrepancy indicates a data quality issue that needs investigation and correction.

In [41]:
print(labelsCopy[22:])

1          Label Devanagari Label Phonetics
22            ka                क        ka
23           kha                ख       kha
24            ga                ग        ga
25           gha                घ       gha
26           kna                ङ        ṅa
27           cha                च        ca
28          chha                छ       cha
29            ja                ज        ja
30           jha                झ       jha
31           yna                ञ        ña
32            ta                ट        ṭa
33           tha                ठ       ṭha
34            da                ड        ḍa
35           dha                ढ       ḍha
36           ana                ण        ṇa
37           taa                त        ta
38          thaa                थ       tha
39           daa                द        da
40          dhaa                ध       dha
41            na                न        na
42            pa                प        pa
43           pha                

# 🔧 Data Correction

## Fixing Mislabeled Characters

Upon detailed inspection, we discovered that the character "bha" was incorrectly represented in the Label column. This type of data validation is crucial for ensuring model accuracy.

In [42]:
labelsCopy.loc[45, "Label"] = "bha"     # Update label for index 45
uniqueLabels = labelsCopy['Label'].unique()
uniqueLabelsConsonants = uniqueLabels[22:]
print(uniqueLabelsConsonants)

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'bha' 'ma'
 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha'
 'tra' 'gya']


# ✅ Label Validation Complete

The label dataset is now clean and ready for integration with the main dataset.

# 🔗 Data Integration Preparation

## Label Standardization for Merging

In the main dataset's "character" column, we need to clean the labels by removing prefixes to match them with our cleaned label dataset. This will enable us to add rich Devanagari character information to each image record.

In [43]:
dataCopy['character'] = dataCopy['character'].str.split('_').str[-1]    # Keep only the last part after '_'

In [44]:
uniqueLabels = dataCopy['character'].unique()
print(uniqueLabels[:36])

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'taamatar'
 'thaa' 'daa' 'dhaa' 'adna' 'tabala' 'tha' 'da' 'dha' 'na' 'pa' 'pha' 'ba'
 'bha' 'ma' 'yaw' 'ra' 'la' 'waw' 'motosaw' 'petchiryakha' 'patalosaw'
 'ha' 'chhya' 'tra' 'gya']


# 🗺️ Label Mapping Strategy

## Creating Label Translation Dictionary

Perfect! The Devanagari consonants are in the same order and quantity. Now we'll create a mapping dictionary to replace the old English-only labels with our cleaned, comprehensive labels that include:
- English transliteration
- Devanagari characters  
- Phonetic representations

In [45]:
labelDict = {key:value for key, value in zip(uniqueLabels[:36], uniqueLabelsConsonants)}    # Map labels from "labelsCopy" to labels from "dataCopy"
print(labelDict)

{'ka': 'ka', 'kha': 'kha', 'ga': 'ga', 'gha': 'gha', 'kna': 'kna', 'cha': 'cha', 'chha': 'chha', 'ja': 'ja', 'jha': 'jha', 'yna': 'yna', 'taamatar': 'ta', 'thaa': 'tha', 'daa': 'da', 'dhaa': 'dha', 'adna': 'ana', 'tabala': 'taa', 'tha': 'thaa', 'da': 'daa', 'dha': 'dhaa', 'na': 'na', 'pa': 'pa', 'pha': 'pha', 'ba': 'ba', 'bha': 'bha', 'ma': 'ma', 'yaw': 'ya', 'ra': 'ra', 'la': 'la', 'waw': 'va', 'motosaw': 'motosaw', 'petchiryakha': 'petchiryosaw', 'patalosaw': 'patalosaw', 'ha': 'ha', 'chhya': 'ksha', 'tra': 'tra', 'gya': 'gya'}


# 🔄 Label Replacement

## Applying the Label Mapping

Now we replace the old labels with our new, comprehensive label system.

In [46]:
dataCopy["character"] = dataCopy["character"].replace(labelDict)    # Replace labels in "dataCopy" using the mapping dictionary

In [47]:
print(dataCopy["character"].unique())

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'bha' 'ma'
 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha'
 'tra' 'gya' '0' '1' '2' '3' '4' '5' '6' '7' '8' '9']


# 📝 Preparing for Data Merge

## Column Name Standardization

Renaming columns to consistent names makes the merging process cleaner and more reliable.

In [48]:
dataCopy.rename(columns={"character": "Label"}, inplace=True)   # Rename the 'character' column to 'Label'

# 🔗 Dataset Integration

## Left Join Operation

We perform a left join where:
- **Left table**: Main dataset with pixel data
- **Right table**: Clean label dataset
- **Join key**: "Label" column
- **Result**: Matched labels get additional information; unmatched labels become NULL (which we'll validate)

In [49]:
dataCopy = dataCopy.merge(labelsCopy, on='Label', how='left')   #Merging two dataframes left joining the "Label" column
dataCopy.head()

Unnamed: 0,pixel_0000,pixel_0001,pixel_0002,pixel_0003,pixel_0004,pixel_0005,pixel_0006,pixel_0007,pixel_0008,pixel_0009,...,pixel_1017,pixel_1018,pixel_1019,pixel_1020,pixel_1021,pixel_1022,pixel_1023,Label,Devanagari Label,Phonetics
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka


In [50]:
dataCopy.isna().sum()   # Check for missing values in "dataCopy"

pixel_0000          0
pixel_0001          0
pixel_0002          0
pixel_0003          0
pixel_0004          0
                   ..
pixel_1022          0
pixel_1023          0
Label               0
Devanagari Label    0
Phonetics           0
Length: 1027, dtype: int64

# ✅ Data Integration Complete

## Final Data Quality Check

Excellent! The dataset is now clean with:
- **No missing values**
- **Correct character labels**
- **Complete Devanagari character information**

In [51]:
dataCopy["Devanagari Label"].unique()

array(['क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड',
       'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य',
       'र', 'ल', 'व', 'श', 'ष', 'स', 'ह', 'क्ष', 'त्र', 'ज्ञ', '०', '१',
       '२', '३', '४', '५', '६', '७', '८', '९'], dtype=object)

In [52]:
print(dataCopy[["Devanagari Label", "Label", "Phonetics"]].value_counts())
print(len(dataCopy[["Devanagari Label", "Label", "Phonetics"]].value_counts()))

Devanagari Label  Label         Phonetics
क                 ka            ka           2000
स                 patalosaw     sa           2000
ब                 ba            ba           2000
भ                 bha           bha          2000
म                 ma            ma           2000
य                 ya            ya           2000
र                 ra            ra           2000
ल                 la            la           2000
व                 va            va           2000
श                 motosaw       śa           2000
ष                 petchiryosaw  ṣa           2000
ह                 ha            ha           2000
क्ष               ksha          kṣa          2000
०                 0             Śūn'ya       2000
१                 1             ēka          2000
२                 2             du'ī         2000
३                 3             tīna         2000
४                 4             cāra         2000
५                 5             pām̐ca       2000
६       

# 🎉 Preprocessing Success!

## Dataset Enhancement Complete

Perfect! Our final dataset now includes:
- ✅ **English transliteration labels** (ka, kha, ga, etc.)
- ✅ **Devanagari character labels** (क, ख, ग, etc.) 
- ✅ **Phonetic representations**
- ✅ **Zero missing values**
- ✅ **Data integrity maintained**

This multilingual labeling makes the dataset valuable for various applications including cross-script learning and educational tools.

# 📊 Train-Test Split Strategy

## Creating Balanced Datasets

Now we'll create proper training and testing sets using the industry-standard **80/20 split ratio**:

### Why Stratified Splitting Matters:
- **Each character has 2,000 samples**
- **Equal representation**: Every character gets exactly 1,600 training samples and 400 test samples
- **Prevents bias**: Ensures the model sees balanced examples of each character
- **Reliable evaluation**: Test set accurately represents the full character distribution

### Split Strategy:
- **Training Set**: 80% (1,600 samples × 36 characters = 57,600 total)
- **Test Set**: 20% (400 samples × 36 characters = 14,400 total)

In [53]:
trainList = []
testList = []

for label, group in dataCopy.groupby('Label'):
    # Shuffle the group to randomize
    group = group.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # 80% for train (1600), 20% for test (400)
    trainSize = int(0.8 * len(group))
    trainGroup = group[:trainSize]
    testGroup = group[trainSize:]

    trainList.append(trainGroup)
    testList.append(testGroup)

# Combine into train and test DataFrames
trainData = pd.concat(trainList, ignore_index=True)
testData = pd.concat(testList, ignore_index=True)

# Verify the split
print(f"Train set size: {len(trainData)}")
print(f"Test set size: {len(testData)}")
print("Train label distribution:")
print(trainData['Label'].value_counts())
print("Test label distribution:")
print(testData['Label'].value_counts())

Train set size: 73600
Test set size: 18400
Train label distribution:
Label
0               1600
patalosaw       1600
ka              1600
kha             1600
kna             1600
ksha            1600
la              1600
ma              1600
motosaw         1600
na              1600
pa              1600
petchiryosaw    1600
1               1600
pha             1600
ra              1600
ta              1600
taa             1600
tha             1600
thaa            1600
tra             1600
va              1600
ya              1600
jha             1600
ja              1600
ha              1600
gya             1600
2               1600
3               1600
4               1600
5               1600
6               1600
7               1600
8               1600
9               1600
ana             1600
ba              1600
bha             1600
cha             1600
chha            1600
da              1600
daa             1600
dha             1600
dhaa            1600
ga              1600
g

# 🎯 Preprocessing Pipeline Complete!

## Summary of Accomplishments

Our comprehensive data preprocessing pipeline has successfully:

✅ **Data Quality**: Cleaned and validated all data  
✅ **Label Enhancement**: Added multilingual character information  
✅ **Balanced Splitting**: Created proper train/test datasets  
✅ **Format Standardization**: Prepared data for neural network training  
✅ **Reproducibility**: Used fixed random seeds for consistent results

# 📤 Dataset Availability

## Open Access Dataset

The fully preprocessed dataset is freely available to the research and education community:

**📊 Kaggle Dataset**: [Devanagari MNIST - Preprocessed](https://www.kaggle.com/datasets/prabeshsagarbaral/mnistdevanagari)

### What You Get:
- **Ready-to-use CSV files** (trainDataMNIST.csv, testDataMNIST.csv)
- **Clean, balanced data** with proper train/test splits
- **Multilingual labels** (English, Devanagari, Phonetic)
- **72,000 total samples** across 36 character classes
- **Educational documentation** and usage examples

**License**: Creative Commons Attribution 4.0 International (CC BY 4.0)