In [0]:
# %% [markdown]
# My submission for the Bengali.AI challenge available on Kaggle.
# 
# My motivation for this project is to learn a little bit about the language which I grew up with but have since forgotten due to lack of practice. Hopefully some of the grapheme (i.e. a Benagali letter) components will be familiar to me.
# 
# The goal of this Kaggle competition is to be able to identify the unique grapheme components of a given grapheme. A grapheme can have 3 components: the root, the vowel diacritic and the consonant diacritic. 

# %% [markdown]
# #### Load Libraries

# %% [code]
# Load libraries 

# Data manipulation 
import numpy as np
import pandas as pd

# Data viz
import matplotlib.pyplot as pt
import matplotlib.image as mpimg
import seaborn as sns

import PIL.Image as Image
import PIL.ImageDraw as ImageDraw
import PIL.ImageFont as ImageFont

import cv2 as cv2

%matplotlib inline

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from tensorflow import keras
from keras.utils import to_categorical
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import * 
    # will use Input, Conv2D, MaxPool2D, BatchNormalization Dropout, Flatten, Dense
from keras.callbacks import LearningRateScheduler

# Ignore warnings (oops! my bad)
import warnings
warnings.filterwarnings('ignore')

# Miscellaneous/Ad-hoc
from timeit import default_timer as timer
import math
import gc 

print('Libraries successfully imported!')

# %% [markdown]
# #### Load Data

# %% [code]
# Load data from Kaggle

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
print('Data successfully imported!')

# %% [code]
# Read data into Pandas dataframes

train = pd.read_csv('/kaggle/input/bengaliai-cv19/train.csv')
test = pd.read_csv('/kaggle/input/bengaliai-cv19/test.csv')
class_map = pd.read_csv('/kaggle/input/bengaliai-cv19/class_map.csv')

# %% [markdown]
# ## Data Exploration

# %% [code]
train.shape

# %% [code]
test.shape

# %% [code]
train.columns

# %% [code]
test.columns

# %% [code]
train.head()

# %% [code]
test.head()

# %% [markdown]
# At a higher level, the train dataset has an an incredible number of graphemes to train a ML model, however a very smaller number to test it. The train dataset has each row corresponding to an image of a grapheme, along with which grapheme_root, vowel_diacritic and consonant_diacritic featured in the grapheme. The actual alphabet is also included. 
# 
# The test dataset is of a different structure, and instead has 3 rows corresponding to one image of a grapheme. The three rows correspond to the grapheme_root, vowel_diacritic and consonant_diacritic. 

# %% [code]
# Find number of unique components 

num_roots = train.grapheme_root.nunique()
num_v_diacritics = train.vowel_diacritic.nunique()
num_c_diacritics = train.consonant_diacritic.nunique()

print('There are', num_roots, 'grapheme roots in the train dataset.')
print('There are', num_v_diacritics, 'vowel diacritics in the train dataset.')
print('There are', num_c_diacritics, 'consonant diacritics in the train dataset.')

# %% [markdown]
# #### Visualization:

# %% [code]
# Plot distribution of grapheme roots 

sns.set_style('darkgrid')
pt.figure(figsize=(20,10))
sns.countplot(x='grapheme_root', data=train, palette='mako_r')
pt.xticks(rotation=90)
pt.title('Distribution of Grapheme Roots')

# %% [markdown]
# In the train dataset, there is enormous variance on which grapheme roots are seen. This may lead to overfitting (underfitting) for some roots since there the model will be trained too precisely (or not precisely enough).

# %% [code]
# Plot distribution of vowel diacritics

sns.set_style('darkgrid')
pt.figure(figsize=(20,10))
sns.countplot(x='vowel_diacritic', data=train, palette='mako_r')
pt.xticks(rotation=90)
pt.title('Distribution of Vowel Diacritics')

# %% [markdown]
# Again, it is evident that the data is not uniformly distributed and there might be over/underfitting.

# %% [code]
# Plot distribution of consonant diacritics

sns.set_style('darkgrid')
pt.figure(figsize=(20,10))
sns.countplot(x='consonant_diacritic', data=train, palette='mako_r')
pt.xticks(rotation=90)
pt.title('Distribution of Consonant Diacritics')

# %% [markdown]
# This distribution plot shows an non-uniform distribution with type 0 being the most popular. 
# 
# Note that although the above plots show non-uniformity, it may be because some components (of the 3) tend to be used more frequently in the overall graphemes.

# %% [code]
class_map.shape

# %% [code]
class_map.columns

# %% [code]
class_map.head()

# %% [code]
# Splitting the components into their own datasets

roots = class_map.loc[class_map.component_type == 'grapheme_root']
v_diacritics = class_map.loc[class_map.component_type == 'vowel_diacritic']
c_diacritics = class_map.loc[class_map.component_type == 'consonant_diacritic']

# %% [markdown]
# The class_map dataset contains information for identifying each component. It identifies the component type, a unique label and a visual representation. 
# 
# The following visualizations will help to gain a better understanding of graphemes. Only the top 10% of roots are considered, however all diacritics (both vowel and consonant) will be visualized. Note that the following visualizations will utilize external font data (Kalpurush Fonts) which is available on Kaggle as well. 

# %% [code] {"_kg_hide-output":true}
# Grapheme Roots viz 

# Get top 10% of roots
top10 = int(0.10 * len(roots))
roots['count'] = 0 

# Format data 
for label in roots.label.unique():
    label_count = len(train.loc[train.grapheme_root == label])
    roots.loc[label, 'count'] = label_count

roots_sorted = roots.sort_values('count', ascending=False)

# %% [code]
# Vowel Diacritics viz

eleven = 11
v_diacritics['count'] = 0
v_ind_start = 168

# Format data 
for label in v_diacritics.label.unique():
    label_count = len(train.loc[train.vowel_diacritic == label])
    v_diacritics.at[(v_ind_start + label), 'count'] = label_count

# %% [code]
# Consonant Diacritics viz

seven = 7
c_diacritics['count'] = 0
c_ind_start = 179

# Format data 
for label in c_diacritics.label.unique():
    label_count = len(train.loc[train.consonant_diacritic == label])
    c_diacritics.at[(c_ind_start + label), 'count'] = label_count

# %% [markdown]
# Then the top 10 most popular grapheme roots, and the diacritics are:

# %% [code]
roots_sorted.head(top10)

# %% [code]
v_diacritics.head(eleven)

# %% [code]
c_diacritics.head(seven)

# %% [code]
# Clear memory

del test, class_map
del roots, v_diacritics, c_diacritics
gc.collect()

# %% [markdown]
# ## Building the Model

# %% [markdown]
# The CNN will be built using the Model function API available in the Keras package. The Sequential model CANNOT be used because it does not allow multi-outputs which is required for this problem.

# %% [code]
# Important variables 

# Compiling the model
optimizer = 'adam' # typically outperforms other adaptive momentum techniques
loss_func = 'categorical_crossentropy' # useful for finding probabilities for classification problems with > 2 classes
metric = ['accuracy'] # 'categorial_accuracy' will be detected automatically by Keras

# Adding layers
img_size = 64
num_channels = 1
ReLU = 'relu'
Softmax = 'softmax'
padding = 'SAME'
k3 = 3 # kernel_size = 3 x 3
k5 = 5 # kernel_size = 5 x 5
p = 2 # pool_size = 2 x 2
m = 0.15 # momentum
drop_rate = 0.3 # dropout
dense1 = 1024
dense2 = 512

# %% [code]
# Build and add layers

# Define input layer
input_shape = (img_size, img_size, num_channels)
first = Input(shape=input_shape)

input_layer = True

# Add layers
for i in [32, 64, 128, 256]:
    # i = number of filters
    if input_layer: # input_shape required if first layer
        model = Conv2D(filters=i, 
                         kernel_size=(k3,k3),
                         padding=padding,
                         activation=ReLU, 
                         input_shape=input_shape)(first)
    else: 
        model = Conv2D(filters=i,
                         kernel_size=(k3,k3),
                         padding=padding,
                         activation=ReLU)(model)
    
    # remaining layers 
    model = Conv2D(filters=i, 
                     kernel_size=(k3, k3), 
                     padding=padding, 
                     activation=ReLU)(model)
    
    model = Conv2D(filters=i, 
                     kernel_size=(k3, k3), 
                     padding=padding, 
                     activation=ReLU)(model)
    
    model = BatchNormalization(momentum=m)(model)
    
    model = MaxPool2D(pool_size=(p, p))(model)
    
    model = Conv2D(filters=i, 
                     kernel_size=(k5, k5), 
                     padding=padding, 
                     activation=ReLU)(model)
    
    if not input_layer:
        model = BatchNormalization(momentum=m)(model)
    
    model = Dropout(rate=drop_rate)(model)
    
    input_layer = False

## done

# More layers
model = Flatten()(model)

model = Dense(dense1,
              activation=ReLU)(model)

model = Dropout(rate=drop_rate)(model)

dense = Dense(dense2,
              activation=ReLU)(model)

# %% [code]
# Define output layers 

# dense_o1 == dense_3 -> True
root_output = Dense(num_roots, 
                    activation=Softmax)(dense)

# dense_o2 == dense_4 -> True
v_diacritic_output = Dense(num_v_diacritics,
                           activation=Softmax)(dense)

# dense_o3 == dense_5 -> True
c_diacritic_output = Dense(num_c_diacritics,
                           activation=Softmax)(dense)

# %% [code]
# Declare model 

model = Model(inputs=first, 
              output=[root_output, 
                      v_diacritic_output, 
                      c_diacritic_output]) # muti-output

# %% [code]
# Compile model

model.compile(loss=loss_func, optimizer=optimizer, metrics=metric)

# %% [code]
# Visual of model layers 

model.summary()

# %% [markdown]
# Since there is an incredible amount of computation to do, it is important that the learning rate is not too slow, and not so high that the loss function diverges. To achieve an optimal learning rate, a learning rate schedule will be used. This schedule will help to control the learning rate during training.

# %% [code]
# Create learning schedule

# Function to use as input for LearningRateScheduler, returns a float
def step_decay(epochs):
    
    # set variables
    initial_lrate = 1e-4
    drop = 0.65
    step_size = 10
    
    # calculate learning rate
    lrate = initial_lrate * math.pow(drop, math.floor((1+epochs)/step_size))
    
    return lrate

## done

# Get schedule
LRSched = LearningRateScheduler(step_decay)

# %% [markdown]
# ## Data Augmentation

# %% [markdown]
# Data augmentation is the process of adding more data points to the dataset in an attempt to diversify the original dataset. This allows better training of the model since there are more inputs to learn from. 
# 
# Since the model produces multiple outputs, data augmentation must be done in a manner to preserve this quality.

# %% [code]
# Function to restructure y label for input to a data augmentation generator

def restructure_y_labels(dense_dict):
    targets = None
    lengths = {}
    outputs = []
    
    for key, val in dense_dict.items():
        if targets is None:
            targets = val
        else: 
            targets = np.concatenate((targets, val), axis=1)
        
        lengths[key] = val.shape[1]
        outputs.append(key)
    
    return targets, lengths, outputs

## done

# %% [markdown]
# The following code has been adapated from Kaushal Shah, who has also participated in the same challenge: https://tinyurl.com/tk9zhyg. Note that since the following class inputs a ImageDataGenerator, the same format as single-output generators can be used to perform data augmentation.

# %% [code]
# Code adapated from Kaushal Shah

# Define custome Image Generator
class MultiOutputImageDataGenerator(ImageDataGenerator):

    # Function to call customized flow method, with parent ImageDataGenerator
    def flow(self,
             x,
             y=None,
             batch_size=32, #default
             shuffle=True, #default
             sample_weight=None,
             subset=None):

        # restructure the target labels first
        targets, target_lengths, ordered_outputs = restructure_y_labels(y)

        # generate batches of augmented data using flow from parent class (i.e. super())
        for x_batch, y_batch in super().flow(x, targets, batch_size=batch_size, shuffle=shuffle):
            target_dict = {}
            ind = 0
            for output in ordered_outputs:
                l = target_lengths[output]
                target_dict[output] = y_batch[:, ind:(ind+l)]
                ind += l
            
            # yield batches 
            yield x_batch, target_dict

## done

# %% [code]
# Perform data augmentation to generate more data

# Create an instance of a MultiOutputDataGenerator
datagen = MultiOutputImageDataGenerator(
    featurewise_center=False, 
    samplewise_center=False, 
    featurewise_std_normalization=False,
    samplewise_std_normalization=False, 
    zca_whitening=False,
    rotation_range=10, 
    width_shift_range=0.20, 
    height_shift_range=0.20,
    zoom_range=0.20,
    horizontal_flip=False, 
    vertical_flip=False)

## done

# %% [markdown]
# ## Training the CNN

# %% [markdown]
# After defining the basics of the CNN, then model must be fitted using the data. Note that the data is available through 4 parquet files. This implies that training must be done in 4 steps i.e. 4 loops.

# %% [code]
# Important variables

epochs = 32
batch_size = 64

w = 236
h = 137

s = 0.12 # test size

model_history = {} # will collect history for plotting

# %% [markdown]
# #### Process Original Train:

# %% [code]
# Processing original train dataset 

# Drop grapheme column; not needed for training 
to_drop = ['grapheme']
train = train.drop(to_drop, axis=1, inplace=False)

# Change to unassigned integers 8 bits
cnames = ['grapheme_root', 'vowel_diacritic', 'consonant_diacritic']
train[cnames] = train[cnames].astype('uint8')

# %% [markdown]
# #### Resize Data:

# %% [markdown]
# The train dataset needs to be in a format which a CNN will be able to understand, and similarly the images must be resized from the original 137 x 236 size.

# %% [code]
# Function to resize images to indicated size for a specified dataframe

def resize_img(df, size=64, width=236, height=137):
    resized_img_dict = {}
    l = df.shape[0]
    for i in range(l):
        img = cv2.resize(df.loc[df.index[i]].values.reshape(height, width),
                         (size,size),
                         interpolation=cv2.INTER_AREA)
        resized_img_dict[df.index[i]] = img.reshape(-1)
    return pd.DataFrame(resized_img_dict).T

## done

# this function will be called during training 

# %% [markdown]
# #### Load Training Data:

# %% [code]
# Function to load and format raw parquet data (for training)

def format_parquet_data(i, train):
    start = timer()
    # read in data
    file_path = '/kaggle/input/bengaliai-cv19/train_image_data_' + str(i) + '.parquet'
    parq_raw = pd.read_parquet(file_path)
    
    # need to combine with train using foreign key 'image_id'
    train_ret = pd.merge(parq_raw, train, on='image_id')
    
    # drop 'image_id' column
    to_drop = ['image_id'] # not needed for training
    train_ret.drop(to_drop, axis=1, inplace=True)
    
    end = timer()
    print('Time to load train_' + str(i) + ': ', int((end-start)//60), 'm', 
          int(round((end-start)%60, 0)), 's')
    
    return train_ret

## done

# %% [markdown]
# #### Fit and Train Model:

# %% [markdown]
# As mentioned above, this will be done in 4 iterations for each of the 4 training sets.

# %% [code]
# Training the model

# Loop for training 
start = timer()
for i in range(4):
    
    start_train = timer()
    start_proc = timer()
    
    # get train 
    train_curr = format_parquet_data(i, train)
    
    # resize images 
    to_drop = ['grapheme_root', 'vowel_diacritic', 'consonant_diacritic']
    X = train_curr.drop(to_drop, axis=1, inplace=False)
    X = resize_img(X)
   
    # normalize images
    X = X/255
    
    # one-hot encode columns, returns target labels
    Y_roots = to_categorical(train_curr.grapheme_root)
    Y_v_diacritics = to_categorical(train_curr.vowel_diacritic)
    Y_c_diacritics = to_categorical(train_curr.consonant_diacritic)
    
    del train_curr
    
    # Keras is batch first, use -1 for dynamic
    X = X.values.reshape(-1, img_size, img_size, num_channels)
    
    # split into train and test
    train_X, test_X, train_y_r, test_y_r, train_y_v, test_y_v, train_y_c, test_y_c = train_test_split(X, 
                                                                                                     Y_roots, 
                                                                                                     Y_v_diacritics, 
                                                                                                     Y_c_diacritics, 
                                                                                                     test_size=s, 
                                                                                                     random_state=42)
     
    del X
    del Y_roots, Y_v_diacritics, Y_c_diacritics
    
    end_proc = timer()
    print('Time to process train_' + str(i) + ': ', int((end_proc-start_proc)//60), 'm', 
          int(round((end_proc-start_proc)%60, 0)), 's')
    
    # fit the model
    start_fit = timer()
    datagen.fit(train_X)
    
    # target labels 
    labels = {'dense_3': train_y_r, 'dense_4': train_y_v, 'dense_5': train_y_c}
    
    history_curr = model.fit_generator(datagen.flow(x=train_X, y=labels, batch_size=batch_size),
                                       epochs=epochs, 
                                       validation_data=(test_X,[test_y_r, test_y_v, test_y_c]), 
                                       steps_per_epoch=train_X.shape[0]//batch_size, 
                                       callbacks=[LRSched])
    
    end_train = timer()
    
    print('Time to train model for train_' + str(i) + ': ', int((end_train-start_train)//60), 'm', 
          int(round((end_train-start_train)%60, 0)), 's')
    
    # add to history dictionary for model analysis
    model_history[i] = history_curr
    
    del train_X, test_X
    del train_y_r, test_y_r, train_y_v, test_y_v, train_y_c, test_y_c 
    
    gc.collect()
    
## done

end = timer()
print('Time to train model overall: ', int((end-start)//60), 'm', 
      int(round((end-start)%60, 0)), 's')

# %% [markdown]
# #### Model Analysis:

# %% [code]
# Function to plot model accuracy as each epoch is calculated

def plot_model_accuracy(model_history, epochs, i):
    
    # setup plots
    sns.set_style('darkgrid')
    fig, ax = pt.figure()
    xvalues = range(epochs).tolist()
    labels = ['Train Root Accuracy', 'Train Vowel Accuracy', 'Train Consonant Accuracy',
             'Test Root Accuracy', 'Test Vowel Accuracy', 'Test Consonant Accuracy']
    
    # add lines
    sns.lineplot(x=xvalues, y=model_history['dense_3_accuracy'], ax=ax)
    sns.lineplot(x=xvalues, y=model_history['dense_4_accuracy'], ax=ax)
    sns.lineplot(x=xvalues, y=model_history['dense_5_accuracy'], ax=ax)
    sns.lineplot(x=xvalues, y=model_history['val_dense_3_accuracy'], ax=ax)
    sns.lineplot(x=xvalues, y=model_history['val_dense_4_accuracy'], ax=ax)
    sns.lineplot(x=xvalues, y=model_history['val_dense_5_accuracy'], ax=ax)
    
    # add labels
    pt.title('Accuracy for Train_' + str(i))
    pt.xlabel('Epoch')
    pt.ylabel('Accuracy')
    fig.legend(labels=labels)
    
    pt.show()
    
## done

# %% [code]
# Plot accuracy

for i in range(4):
    plot_model_accuracy(model_history[0], epochs, i)

# %% [markdown]
# ## Submission

# %% [markdown]
# Making predictions and formatting in Kaggle-suggested method. 

# %% [code]
# Important variables: 

grapheme_comp = ['consonant_diacritic', 'grapheme_root', 'vowel_diacritic'] # in order of submission

row_id = [] # name of row in submission.csv
target = [] # prediction in submission.csv

# %% [code]
# Function to load and format raw parquet data (for testing)

def format_parquet_data_test(i):
    # read in data
    file_path = '/kaggle/input/bengaliai-cv19/test_image_data_' + str(i) + '.parquet'
    parq_raw = pd.read_parquet(file_path)
    
    # set index
    parq_raw.set_index('image_id', inplace=True)
    
    # process datadrame
    test = resize(parq_raw)
    test = test/255
    test = test.values.reshape(-1, img_size, img_size, num_channels)

    return test

## done

# Define test datasets 
test_0 = format_parquet_data_test(0)
test_1 = format_parquet_data_test(1)
test_2 = format_parquet_data_test(2)
test_3 = format_parquet_data_test(3)

# %% [code]
# Create dictionary for storing intermediary predictions

pred_storage = {'grapheme_root': [],
                'vowel_diacritic': [],
                'consonant_diacritic': []}

# %% [code]
# Loop for testing and formatting for submission.csv

for i, test in enumerate([test_0, test_1, test_2, test_3]):
    
    # make predictions
    predictions = model.predict(test) # will return 3 numpy arrays 

    # add predictions to lists
    for j, image_id in enumerate(test.index.values):
        for k in range(3):
            
            # get name and and prediction value
            name = 'Test_' + str(image_id) + '_' + grapheme_comp[k]
            loc = predictions[k][j]
            
            # append name and prediction value
            row_id.append(name)
            target.append(np.argmax(loc))

# %% [code]
# Create dataframe and submission.csv

# Format into dataframe
submission = pd.DataFrame(data={'row_id': row_id,
                                'target':target},
                          columns = ['row_id','target'])

# Ensure correct formatting
submission.head(10)

# Write to submission.csv
submission.to_csv('submission.csv',index=False)

# %% [markdown]
# Submit predictions!

# %% [code]
class_map = pd.read_csv("../input/bengaliai-cv19/class_map.csv")
sample_submission = pd.read_csv("../input/bengaliai-cv19/sample_submission.csv")
test = pd.read_csv("../input/bengaliai-cv19/test.csv")
train = pd.read_csv("../input/bengaliai-cv19/train.csv")