Lets import the required libraries. 

In [29]:
import pandas as pd

Loading the "data" and "labels". 

Data Sources:

Images: https://archive.ics.uci.edu/dataset/389/devanagari+handwritten+character+dataset

The "Data" contains the each pixel values of all the images which were grayscaled. This is done beacuse we will feed these pixel values to a neural network and classify the combination of those pixels into devanagari alphabet. Since this project is not using CNN or libraries like pytorch, having all values in numerical form helps. This notebook deals with cleaning the data and adding correct labels to make it ready for the project.  

In [30]:
data = pd.read_csv('rawData/data.csv')
labels = pd.read_csv('rawData/labels.csv')

In [31]:
data.head()

Unnamed: 0,pixel_0000,pixel_0001,pixel_0002,pixel_0003,pixel_0004,pixel_0005,pixel_0006,pixel_0007,pixel_0008,pixel_0009,...,pixel_1015,pixel_1016,pixel_1017,pixel_1018,pixel_1019,pixel_1020,pixel_1021,pixel_1022,pixel_1023,character
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,character_01_ka


In [32]:
labels.head(20)

Unnamed: 0,Numerals,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,Class,Label,Devanagari Label,Phonetics
2,0,0,०,Śūn'ya
3,1,1,१,ēka
4,2,2,२,du'ī
5,3,3,३,tīna
6,4,4,४,cāra
7,5,5,५,pām̐ca
8,6,6,६,cha
9,7,7,७,sāta


It is always safe to do all preprocessing on the copy of original data. Let's make an independent copy. 

In [33]:
dataCopy = data.copy()
labelsCopy = labels.copy()

Now we check for NULL values. We can either remove them altogether or we can add any new value in place of them. 

In [34]:
labelsCopy.isnull().sum()   # Check for missing values in labels

Numerals      7
Unnamed: 1    9
Unnamed: 2    9
Unnamed: 3    9
dtype: int64

Since the NULL value in "labelsCopy" can be safely removed, we are going to drop those rows. 

In [35]:
labelsCopy = labelsCopy.dropna()    # Drop missing values from labels

We won't need the "Numerals" column because it doesnot add much value. Then we will rename the column according to its features which can be seen in 1st row. Then we will drop the duplicate rows. 

In [36]:
labelsCopy = labelsCopy.drop(columns="Numerals")    # Drop the 'Numerals' column from labels

In [37]:
labelsCopy.columns = labelsCopy.iloc[0] # Set the first row as header

In [38]:
labelsCopy = labelsCopy.drop_duplicates()   # Remove duplicate rows
labelsCopy = labelsCopy.drop(labelsCopy.index[0])   # Remove the first row
labelsCopy = labelsCopy.reset_index(drop=True)   # Reset index

In [39]:
uniqueLabels = labelsCopy['Label'].unique()  # Get all the labels which are unique
print(f"Total unique labels: {uniqueLabels}")

Total unique labels: ['0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 'a' 'aa' 'i' 'ee' 'u' 'oo' 'ae'
 'ai' 'o' 'au' 'an' 'ah' 'ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja'
 'jha' 'yna' 'ta' 'tha' 'da' 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na'
 'pa' 'pha' 'ba' 'ma' 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw'
 'patalosaw' 'ha' 'ksha' 'tra' 'gya']


We are checking all the labels because we have to include these additional labels in the dataset which will give it more value. Currently it only consist of labels in english. One way to do it is matching the english labels in both datasets and adding devanagari labels corresponding to it. 

In [40]:
uniqueLabelsConsonants = uniqueLabels[22:]      # Only take Devanagari consonant alphabets
print(uniqueLabelsConsonants)
print(len(uniqueLabelsConsonants))

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'ma' 'ya' 'ra'
 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha' 'tra' 'gya']
35


Since we are talking about devanagari alphabets, there should be 36 unique in total. But we only get 35 and one is missing. 

In [41]:
print(labelsCopy[22:])

1          Label Devanagari Label Phonetics
22            ka                क        ka
23           kha                ख       kha
24            ga                ग        ga
25           gha                घ       gha
26           kna                ङ        ṅa
27           cha                च        ca
28          chha                छ       cha
29            ja                ज        ja
30           jha                झ       jha
31           yna                ञ        ña
32            ta                ट        ṭa
33           tha                ठ       ṭha
34            da                ड        ḍa
35           dha                ढ       ḍha
36           ana                ण        ṇa
37           taa                त        ta
38          thaa                थ       tha
39           daa                द        da
40          dhaa                ध       dha
41            na                न        na
42            pa                प        pa
43           pha                

Upon further inscpection, we see that the alphabet "bha" has been misrepresented in the "Label" column. So, we fix that. 

In [42]:
labelsCopy.loc[45, "Label"] = "bha"     # Update label for index 45
uniqueLabels = labelsCopy['Label'].unique()
uniqueLabelsConsonants = uniqueLabels[22:]
print(uniqueLabelsConsonants)

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'bha' 'ma'
 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha'
 'tra' 'gya']


The dataset finally looks in good place. 

In the "character" column in the "data_copy" , we need to remove the string before the actual label so that we can add new devanagari labels from "labels_copy" matching the previous labels in "data_copy".

In [43]:
dataCopy['character'] = dataCopy['character'].str.split('_').str[-1]    # Keep only the last part after '_'

In [44]:
uniqueLabels = dataCopy['character'].unique()
print(uniqueLabels[:36])

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'taamatar'
 'thaa' 'daa' 'dhaa' 'adna' 'tabala' 'tha' 'da' 'dha' 'na' 'pa' 'pha' 'ba'
 'bha' 'ma' 'yaw' 'ra' 'la' 'waw' 'motosaw' 'petchiryakha' 'patalosaw'
 'ha' 'chhya' 'tra' 'gya']


We can see that devangari consonants are of same number and same order, now we replace the "data_copy" labels with labels of "labels_copy". This is done to match the old labels and add new labels. We do that first by creating a dictionary of key value pairs and replace them. 

In [45]:
labelDict = {key:value for key, value in zip(uniqueLabels[:36], uniqueLabelsConsonants)}    # Map labels from "labelsCopy" to labels from "dataCopy"
print(labelDict)

{'ka': 'ka', 'kha': 'kha', 'ga': 'ga', 'gha': 'gha', 'kna': 'kna', 'cha': 'cha', 'chha': 'chha', 'ja': 'ja', 'jha': 'jha', 'yna': 'yna', 'taamatar': 'ta', 'thaa': 'tha', 'daa': 'da', 'dhaa': 'dha', 'adna': 'ana', 'tabala': 'taa', 'tha': 'thaa', 'da': 'daa', 'dha': 'dhaa', 'na': 'na', 'pa': 'pa', 'pha': 'pha', 'ba': 'ba', 'bha': 'bha', 'ma': 'ma', 'yaw': 'ya', 'ra': 'ra', 'la': 'la', 'waw': 'va', 'motosaw': 'motosaw', 'petchiryakha': 'petchiryosaw', 'patalosaw': 'patalosaw', 'ha': 'ha', 'chhya': 'ksha', 'tra': 'tra', 'gya': 'gya'}


Now we replace old labels with new labels. 

In [46]:
dataCopy["character"] = dataCopy["character"].replace(labelDict)    # Replace labels in "dataCopy" using the mapping dictionary

In [47]:
print(dataCopy["character"].unique())

['ka' 'kha' 'ga' 'gha' 'kna' 'cha' 'chha' 'ja' 'jha' 'yna' 'ta' 'tha' 'da'
 'dha' 'ana' 'taa' 'thaa' 'daa' 'dhaa' 'na' 'pa' 'pha' 'ba' 'bha' 'ma'
 'ya' 'ra' 'la' 'va' 'motosaw' 'petchiryosaw' 'patalosaw' 'ha' 'ksha'
 'tra' 'gya' '0' '1' '2' '3' '4' '5' '6' '7' '8' '9']


Renaming the merging columns to same name makes it easier. 

In [48]:
dataCopy.rename(columns={"character": "Label"}, inplace=True)   # Rename the 'character' column to 'Label'

We left join the tables where left table is the "data_copy" table. It means that matching labels will be joined and if there are any rows where there are no matching labels, it will be set to NULL. 

In [49]:
dataCopy = dataCopy.merge(labelsCopy, on='Label', how='left')   #Merging two dataframes left joining the "Label" column
dataCopy.head()

Unnamed: 0,pixel_0000,pixel_0001,pixel_0002,pixel_0003,pixel_0004,pixel_0005,pixel_0006,pixel_0007,pixel_0008,pixel_0009,...,pixel_1017,pixel_1018,pixel_1019,pixel_1020,pixel_1021,pixel_1022,pixel_1023,Label,Devanagari Label,Phonetics
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,ka,क,ka


In [50]:
dataCopy.isna().sum()   # Check for missing values in "dataCopy"

pixel_0000          0
pixel_0001          0
pixel_0002          0
pixel_0003          0
pixel_0004          0
                   ..
pixel_1022          0
pixel_1023          0
Label               0
Devanagari Label    0
Phonetics           0
Length: 1027, dtype: int64

Finally, the data is cleaned with no NULL and correct labels. 

In [51]:
dataCopy["Devanagari Label"].unique()

array(['क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड',
       'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य',
       'र', 'ल', 'व', 'श', 'ष', 'स', 'ह', 'क्ष', 'त्र', 'ज्ञ', '०', '१',
       '२', '३', '४', '५', '६', '७', '८', '९'], dtype=object)

In [52]:
print(dataCopy[["Devanagari Label", "Label", "Phonetics"]].value_counts())
print(len(dataCopy[["Devanagari Label", "Label", "Phonetics"]].value_counts()))

Devanagari Label  Label         Phonetics
क                 ka            ka           2000
स                 patalosaw     sa           2000
ब                 ba            ba           2000
भ                 bha           bha          2000
म                 ma            ma           2000
य                 ya            ya           2000
र                 ra            ra           2000
ल                 la            la           2000
व                 va            va           2000
श                 motosaw       śa           2000
ष                 petchiryosaw  ṣa           2000
ह                 ha            ha           2000
क्ष               ksha          kṣa          2000
०                 0             Śūn'ya       2000
१                 1             ēka          2000
२                 2             du'ī         2000
३                 3             tīna         2000
४                 4             cāra         2000
५                 5             pām̐ca       2000
६       

Finally, correct english labels, devanagari labels and phonetics are added to the main dataset. Similarly there are no NULL values.

This is perfect! 

Now, while we are at it, let's differentiate this data in train and test set. We are using the most common split ratio which is 80% train and 20% test. We can see that every digit/alphabet has 2000 data. So, to make sure that every digit/alphabet is equally represented in both test and train set, we are going to make sure that 80/20 rule applies to each digit/alphabet. 

In [53]:
trainList = []
testList = []

for label, group in dataCopy.groupby('Label'):
    # Shuffle the group to randomize
    group = group.sample(frac=1, random_state=42).reset_index(drop=True)
    
    # 80% for train (1600), 20% for test (400)
    trainSize = int(0.8 * len(group))
    trainGroup = group[:trainSize]
    testGroup = group[trainSize:]

    trainList.append(trainGroup)
    testList.append(testGroup)

# Combine into train and test DataFrames
trainData = pd.concat(trainList, ignore_index=True)
testData = pd.concat(testList, ignore_index=True)

# Verify the split
print(f"Train set size: {len(trainData)}")
print(f"Test set size: {len(testData)}")
print("Train label distribution:")
print(trainData['Label'].value_counts())
print("Test label distribution:")
print(testData['Label'].value_counts())

Train set size: 73600
Test set size: 18400
Train label distribution:
Label
0               1600
patalosaw       1600
ka              1600
kha             1600
kna             1600
ksha            1600
la              1600
ma              1600
motosaw         1600
na              1600
pa              1600
petchiryosaw    1600
1               1600
pha             1600
ra              1600
ta              1600
taa             1600
tha             1600
thaa            1600
tra             1600
va              1600
ya              1600
jha             1600
ja              1600
ha              1600
gya             1600
2               1600
3               1600
4               1600
5               1600
6               1600
7               1600
8               1600
9               1600
ana             1600
ba              1600
bha             1600
cha             1600
chha            1600
da              1600
daa             1600
dha             1600
dhaa            1600
ga              1600
g

Finally, the real work with data is done. 

Preprocessed data in available freely on Kaggle.
Kaggle: https://www.kaggle.com/datasets/prabeshsagarbaral/mnistdevanagari 