<h1><center><font size="6">Tensorflow/Keras/GPU for Chinese MNIST Prediction</font></center></h1>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>   
- <a href='#3'>Data exploration</a>   
- <a href='#4'>Characters classification</a>       
- <a href='#5'>Conclusions</a>       


# <a id='1'>Introduction</a>  


The objective of the Kernel is to take us through the steps of a machine learning analysis.   


We will use a dataset with adnotated images of Chinese numbers, handwritten by a number of 100 volunteers, each providing a number of 10 samples, each sample with a complete set of 15 Chinese characters for numbers.

The Chinese characters are the following:
* 零 - for 0  
* 一 - for 1
* 二 - for 2  
* 三 - for 3  
* 四 - for 4  
* 五 - for 5  
* 六 - for 6  
* 七 - for 7  
* 八 - for 8  
* 九 - for 9  
* 十 - for 10
* 百 - for 100
* 千 - for 1000
* 万 - for 10 thousands
* 亿 - for 100 millions



We start by preparing the analysis (load the libraries and the data), continue with an Exploratory Data Analysis (EDA).

We follow then with features engineering and preparation for creation of a model. The dataset is split in training, validation and test set. 

We run a model using Tensorflow through Keras interface, with GPU acceleration, using as well Dropouts, variable learning speed and early stoping based on variation of validation error accuracy.

At the end, we use the best model to predict for the test set.

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='2'>Prepare the data analysis</a>   


Before starting the analysis, we need to make few preparation: load the packages, load and inspect the data.



# <a id='21'>Load packages</a>

We load the packages used for the analysis.


In [144]:
import pandas as pd
import numpy as np
import sys
import os
import random
from pathlib import Path
import imageio
import skimage
import skimage.io
import skimage.transform
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import scipy
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, MaxPool2D, Dropout, BatchNormalization,LeakyReLU
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping, ReduceLROnPlateau, LearningRateScheduler
# from keras.utils import to_categorical
# import tensorflow_addons as tfa
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import PIL
import PIL.Image
import tensorflow as tf
import pathlib

In [145]:
data_dir = 'C://Users//dscshap3808//Documents//my_scripts_new//my_docs//chines//data//data//'
data_dir1 = pathlib.Path(data_dir)

In [146]:
# pic = list(data_dir1.glob('*.jpg'));pic
# PIL.Image.open(str(pic[0]))

In [147]:
data_df = pd.read_csv('C://Users//dscshap3808//Documents//my_scripts_new//my_docs//chines//chinese_mnist.csv')

We also set a number of parameters for the data and model.

In [148]:
# IMAGE_PATH = '..//input//chinese-mnist//data//data//'
IMAGE_WIDTH = 64
IMAGE_HEIGHT = 64
IMAGE_CHANNELS = 1
RANDOM_STATE = 42
TEST_SIZE = 0.2
VAL_SIZE = 0.2
CONV_2D_DIM_1 = 16
CONV_2D_DIM_2 = 16
CONV_2D_DIM_3 = 32
CONV_2D_DIM_4 = 64
MAX_POOL_DIM = 2
KERNEL_SIZE = 3
BATCH_SIZE = 32
NO_EPOCHS = 5 
DROPOUT_RATIO = 0.5
PATIENCE = 5
VERBOSE = 1

<a href="#0"><font size="1">Go to top</font></a>  


# <a id='22'>Load the data</a>  

Let's see first what data files do we have in the root directory.

There is a dataset file and a folder with images.  

Let's load the dataset file first.

Let's glimpse the data. First, let's check the number of columns and rows.

In [149]:
data_df.shape

(15000, 4)

There are 15000 rows and 5 columns. Let's look to the data.

In [150]:
data_df.sample(100).head()

Unnamed: 0,suite_id,sample_id,code,value
835,84,5,10,9
2154,22,4,12,100
2305,36,5,12,100
802,81,2,10,9
5496,53,6,15,100000000


In [152]:
characters = ['零','一','二','三','四','五','六','七','八','九','十','百','千','万','亿']
dicts = pd.DataFrame(characters, np.append(np.arange(0,10), [10, 100, 1000, 10000, 100000000])).reset_index()
dicts.columns =  ['value', 'characters']
data_df = data_df.merge(dicts, on = 'value', how = 'left')


In [153]:
data_df[data_df['value'] == 1000 ]

Unnamed: 0,suite_id,sample_id,code,value,characters
3000,1,1,13,1000,千
3001,1,10,13,1000,千
3002,1,2,13,1000,千
3003,1,3,13,1000,千
3004,1,4,13,1000,千
...,...,...,...,...,...
3995,99,5,13,1000,千
3996,99,6,13,1000,千
3997,99,7,13,1000,千
3998,99,8,13,1000,千


The data contains the following values:  

* suite_id - each suite corresponds to a set of handwritten samples by one volunteer;  
* sample_id - each sample wil contain a complete set of 15 characters for Chinese numbers;
* code - for each Chinese character we are using a code, with values from 1 to 15;
* value - this is the actual numerical value associated with the Chinese character for number;  
* character - the Chinese character;  

We index the files in the dataset by forming a file name from suite_id, sample_id and code. The pattern for a file is as following:

> "input_{suite_id}_{sample_id}_{code}.jpg"

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='3'>Data exploration</a>  



Let's start by checking if there are missing data, unlabeled data or data that is inconsistently labeled. 


## <a id='31'>Check for missing data</a>  

Let's create a function that check for missing data in the dataset.

In [154]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data_df)

Unnamed: 0,Total,Percent
suite_id,0,0.0
sample_id,0,0.0
code,0,0.0
value,0,0.0
characters,0,0.0


There is no missing (null) data in the dataset. Still it might be that some of the data labels are misspelled; we will check this when we will analyze each data feature.

<a href="#0"><font size="1">Go to top</font></a>  

## <a id='32'>Explore image data</a>  

Let's also check the image data. First, we check how many images are stored in the image folder.

In [155]:
image_files =  os.listdir(data_dir)
print("Number of image files: {}".format(len(image_files)))

Number of image files: 15000


Let's also check that each line in the dataset has a corresponding image in the image list.  
First, we will have to compose the name of the file from the indexes.

In [156]:
def create_file_name(x):
    file_name = f"input_{x[0]}_{x[1]}_{x[2]}.jpg"
    return file_name

In [157]:
data_df["file"] = data_df.apply(create_file_name, axis=1)

In [158]:
file_names = data_df['file']
print("Matching image names: {}".format(len(set(file_names).intersection(image_files))))

Matching image names: 15000


Let's also check the image sizes.

In [159]:
IMAGE_PATH = data_dir
def read_image_sizes(file_name):
    image = skimage.io.imread(IMAGE_PATH + file_name)
    return list(image.shape)

In [160]:
m = np.stack(data_df['file'].apply(read_image_sizes))
df = pd.DataFrame(m,columns=['w','h'])
data_df = pd.concat([data_df,df],axis=1, sort=False)

In [161]:
data_df.head()

Unnamed: 0,suite_id,sample_id,code,value,characters,file,w,h
0,1,1,10,9,九,input_1_1_10.jpg,64,64
1,1,10,10,9,九,input_1_10_10.jpg,64,64
2,1,2,10,9,九,input_1_2_10.jpg,64,64
3,1,3,10,9,九,input_1_3_10.jpg,64,64
4,1,4,10,9,九,input_1_4_10.jpg,64,64


## <a id='33'>Suites</a>  

Let's check the suites of the images. For this, we will group by `suite`.

In [162]:
print(f"Number of suites: {data_df.suite_id.nunique()}")
print(f"Samples: {data_df.sample_id.unique()}")

Number of suites: 100
Samples: [ 1 10  2  3  4  5  6  7  8  9]


We have 100 suites, each with 10 samples. This means a total of 15K images with Chinese characters.

# <a id='4'>Characters classification</a>

Our objective is to use the images that we investigated until now to correctly identify the Chinese numbers (characters).   

We have a unique dataset and we will have to split this dataset in **train** and **test**. The **train** set will be used for training a model and the test will be used for testing the model accuracy against new, fresh data, not used in training.



## <a id='40'>Split the data</a>  

First, we split the whole dataset in train and test. We will use **random_state** to ensure reproductibility of results. We also use **stratify** to ensure balanced train/validation/test sets with respect of the labels. 

The train-test split is **80%** for training set and **20%** for test set.


In [163]:
train_df, test_df = train_test_split(
    data_df, test_size=TEST_SIZE, 
    random_state=RANDOM_STATE, 
    stratify=data_df["code"].values)

Next, we will split further the **train** set in **train** and **validation**. We want to use as well a validation set to be able to measure not only how well fits the model the train data during training (or how well `learns` the training data) but also how well the model is able to generalize so that we are able to understands not only the bias but also the variance of the model.  

The train-validation split is **80%** for training set and **20%** for validation set.

In [164]:
train_df, val_df = train_test_split(
    train_df, test_size=VAL_SIZE, 
    random_state=RANDOM_STATE,
     stratify=train_df["code"].values)

Let's check the shape of the three datasets.

In [165]:
print("Train set rows: {}".format(train_df.shape[0]))
print("Test  set rows: {}".format(test_df.shape[0]))
print("Val   set rows: {}".format(val_df.shape[0]))

Train set rows: 9600
Test  set rows: 3000
Val   set rows: 2400


We are now ready to start building our first model.

## <a id='41'>Build the model</a>    


Next step in our creation of a predictive model.  

Let's define few auxiliary functions that we will need for creation of our models.

A function for reading images from the image files, scale all images to 100 x 100 x 3 (channels).

In [61]:
def read_image(file_name):
    image = skimage.io.imread(IMAGE_PATH + file_name)
    image = skimage.transform.resize(image, (IMAGE_WIDTH, IMAGE_HEIGHT, 1), mode='reflect')
    return image[:,:,:]

A function to create the dummy variables corresponding to the categorical target variable.

In [66]:
def categories_encoder(dataset, var='value'):
    X = np.stack(dataset['file'].apply(read_image))
    y = pd.get_dummies(dataset[var], drop_first=False)
    return X, y

Let's populate now the train, val and test sets with the image data and create the  dummy variables corresponding to the categorical target variable, in our case `subspecies`.

In [67]:
X_train, y_train = categories_encoder(train_df)
X_val, y_val = categories_encoder(val_df)
X_test, y_test = categories_encoder(test_df)

Now we are ready to start creating our model.  

In [72]:
# IMAGE_PATH = '..//input//chinese-mnist//data//data//'
IMAGE_WIDTH = 64
IMAGE_HEIGHT = 64
IMAGE_CHANNELS = 1
RANDOM_STATE = 42
TEST_SIZE = 0.2
VAL_SIZE = 0.2
CONV_2D_DIM_1 = 16
CONV_2D_DIM_2 = 16
CONV_2D_DIM_3 = 32
CONV_2D_DIM_4 = 64
MAX_POOL_DIM = 2
KERNEL_SIZE = 3
BATCH_SIZE = 32
NO_EPOCHS = 5 
DROPOUT_RATIO = 0.2
PATIENCE = 5
VERBOSE = 1

In [73]:
model=Sequential()
model.add(Conv2D(CONV_2D_DIM_1, kernel_size=KERNEL_SIZE, \
    input_shape=(IMAGE_WIDTH, IMAGE_HEIGHT,IMAGE_CHANNELS), 
    activation='relu', padding='same'))
model.add(Conv2D(CONV_2D_DIM_2, kernel_size=KERNEL_SIZE, \
    activation='relu', padding='same'))
    
model.add(MaxPool2D(MAX_POOL_DIM))
model.add(Dropout(DROPOUT_RATIO))
model.add(Conv2D(CONV_2D_DIM_2, kernel_size=KERNEL_SIZE, activation='relu', padding='same'))
model.add(Conv2D(CONV_2D_DIM_2, kernel_size=KERNEL_SIZE, activation='relu', padding='same'))
model.add(Dropout(DROPOUT_RATIO))
model.add(Flatten())
model.add(Dense(y_train.columns.size, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [74]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 64, 64, 16)        160       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 64, 64, 16)        2320      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 16)        0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 32, 32, 16)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 32, 32, 16)        2320      
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 32, 32, 16)        2320      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32, 32, 16)       

We are using the predefined epoch number for this experiment (50 steps).

The modes uses 2 Convolutional layers, followed by a MaxPool, a Dropout, then another 2 Convolutional layers and a Dropout. Then follows a Flatten and a Dense layer. We compile the model with **Adam** optimizer and use a categorical crossentropy loss functions. The metric used is accuracy.

We are using as well a learning function with variable learning rate (depends on the epoch number). 

At each training epoch, we evaluate the validation error and, based on its evolution, we decide if we stop the training or continue (with a prededined `patience` factor - i.e. we only stop if validation is not improving for a certain number of steps (we set the patience to 5 steps). If at a certain step the validation error is improving, we save the current model. We then will load the best model and use it for prediction of test set.

In [75]:
annealer = LearningRateScheduler(lambda x: 1e-3 * 0.99 ** (x+NO_EPOCHS))
earlystopper = EarlyStopping(monitor='loss', patience=PATIENCE, verbose=VERBOSE)
checkpointer = ModelCheckpoint('best_model.h5',
                                monitor='val_accuracy',
                                verbose=VERBOSE,
                                save_best_only=True,
                                save_weights_only=True)

In [78]:
%%time
train_model  = model.fit(X_train, y_train,
                  batch_size=BATCH_SIZE,
                  epochs=NO_EPOCHS,
                  verbose=1,
                  validation_data=(X_val, y_val),
                  callbacks=[earlystopper, checkpointer, annealer])

Epoch 1/5

Epoch 00001: val_accuracy improved from -inf to 0.77083, saving model to best_model.h5
Epoch 2/5

Epoch 00002: val_accuracy improved from 0.77083 to 0.85292, saving model to best_model.h5
Epoch 3/5

Epoch 00003: val_accuracy improved from 0.85292 to 0.89375, saving model to best_model.h5
Epoch 4/5

Epoch 00004: val_accuracy improved from 0.89375 to 0.91375, saving model to best_model.h5
Epoch 5/5

Epoch 00005: val_accuracy improved from 0.91375 to 0.92458, saving model to best_model.h5
Wall time: 6min 20s


<a href="#0"><font size="1">Go to top</font></a>  


## <a id='42'>Model evaluation</a> 


Let's start by plotting the loss error for the train and validation set. 
We define a function to visualize these values.

In [79]:
def create_trace(x,y,ylabel,color):
        trace = go.Scatter(
            x = x,y = y,
            name=ylabel,
            marker=dict(color=color),
            mode = "markers+lines",
            text=x
        )
        return trace
    
def plot_accuracy_and_loss(train_model):
    hist = train_model.history
    acc = hist['accuracy']
    val_acc = hist['val_accuracy']
    loss = hist['loss']
    val_loss = hist['val_loss']
    epochs = list(range(1,len(acc)+1))
    #define the traces
    trace_ta = create_trace(epochs,acc,"Training accuracy", "Green")
    trace_va = create_trace(epochs,val_acc,"Validation accuracy", "Red")
    trace_tl = create_trace(epochs,loss,"Training loss", "Blue")
    trace_vl = create_trace(epochs,val_loss,"Validation loss", "Magenta")
    fig = tools.make_subplots(rows=1,cols=2, subplot_titles=('Training and validation accuracy',
                                                             'Training and validation loss'))
    #add traces to the figure
    fig.append_trace(trace_ta,1,1)
    fig.append_trace(trace_va,1,1)
    fig.append_trace(trace_tl,1,2)
    fig.append_trace(trace_vl,1,2)
    #set the layout for the figure
    fig['layout']['xaxis'].update(title = 'Epoch')
    fig['layout']['xaxis2'].update(title = 'Epoch')
    fig['layout']['yaxis'].update(title = 'Accuracy', range=[0,1])
    fig['layout']['yaxis2'].update(title = 'Loss', range=[0,1])
    #plot
    iplot(fig, filename='accuracy-loss')

plot_accuracy_and_loss(train_model)

<a href="#0"><font size="1">Go to top</font></a>  


## <a id='43'>Prediction of test set</a> 


Let's continue by evaluating the **test** set **loss** and **accuracy**. We will use here the test set.

### Predict using last epoch model

In [80]:
score = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.28350144624710083
Test accuracy: 0.9276666641235352


Let's check also the test accuracy per class.

In [88]:
def test_accuracy_report(model):
    predicted = model.predict(X_test)
    test_predicted = np.argmax(predicted, axis=1)
    test_truth = np.argmax(y_test.values, axis=1)
    print(metrics.classification_report(test_truth, test_predicted)) 
    test_res = model.evaluate(X_test, y_test.values, verbose=0)
    print('Loss function: %s, accuracy:' % test_res[0], test_res[1])

In [89]:
test_accuracy_report(model)


              precision    recall  f1-score   support

           0       0.98      0.99      0.99       200
           1       0.94      0.96      0.95       200
           2       0.88      0.87      0.87       200
           3       0.93      0.88      0.90       200
           4       0.91      0.98      0.94       200
           5       0.97      0.96      0.96       200
           6       0.94      0.94      0.94       200
           7       0.97      0.95      0.96       200
           8       1.00      0.99      0.99       200
           9       0.86      0.84      0.85       200
          10       0.95      0.90      0.92       200
          11       0.97      0.84      0.90       200
          12       0.90      0.89      0.89       200
          13       0.91      0.96      0.94       200
          14       0.83      0.95      0.89       200

    accuracy                           0.93      3000
   macro avg       0.93      0.93      0.93      3000
weighted avg       0.93   

### Predict using best model

In [30]:
model_optimal = model
model_optimal.load_weights('best_model.h5')
score = model_optimal.evaluate(X_test, y_test, verbose=0)
print(f'Best validation loss: {score[0]}, accuracy: {score[1]}')

test_accuracy_report(model_optimal)

Best validation loss: 0.15444649755954742, accuracy: 0.9710000157356262
              precision    recall  f1-score   support

           一       1.00      0.97      0.98       200
           七       0.98      0.94      0.96       200
           万       0.97      0.97      0.97       200
           三       0.96      0.98      0.97       200
           九       0.93      0.93      0.93       200
           二       0.96      0.96      0.96       200
           五       1.00      1.00      1.00       200
           亿       0.91      0.98      0.94       200
           八       1.00      0.99      0.99       200
           六       0.98      0.99      0.99       200
           十       0.96      0.94      0.95       200
           千       0.94      0.96      0.95       200
           四       1.00      0.98      0.99       200
           百       0.98      0.95      0.97       200
           零       0.99      1.00      0.99       200

    accuracy                           0.97      3000
   macro

# <a id='5'>Conclusions</a>  
 
 
Training uses 64% of the total data (9,600 / 15,000 images), validation 16% of the total images (2,400 / 15,000) and test 3,000 images (20% of the total number of images).

Tensorflow/Keras with GPU, with 2 set of Convolutional layers, MaxPool and Dropout, using as well a variable learning rate, periodic saving best model and early stopping and then using best model for testing, resulted in 97% accuracy for testing set.
