This file contains the baseline model and the final model I built on F.G.'s handwritten journal image cropped tiles. The main file for the project. Also, in this file I select the most frequent 7 classes and the model is built to predict these classes. I used the top 7 most frequent since they have at least 300 labeled examples.



In [0]:
import os
import numpy as np
import keras
from keras.models import model_from_json
from skimage.transform import resize
from skimage.io import imread
from keras.utils.np_utils import to_categorical
from random import seed, sample
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from collections import Counter
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV
from pprint import pprint
from sklearn.metrics import accuracy_score

ld.tar contains the tarred labeled_data directory from https://github.com/Vinaymy/wtf/tree/master/data/labeled_data

In [0]:
!tar -xvf ld.tar

Load up the cropped labeled images and map them to class labels. 

In [0]:
rootDir = './labeled_data'
n_images = 10000
images = np.zeros((n_images,) + (20, 20))
minimum_number_of_examples = 300 # I came up with this after inspecting the distribution of frequencies of the labeled classes. 
# I decided that images with fewer than examples is too few to try predict.

labels = []
lab = {
    '0': 1, '1': 2, '2': 3, '3':4, '4':5, '5':6, '6':7, '7':8, '8':9, '9':10,
    'A': 11, 'B': 12, 'C': 13, 'D': 14, 'E': 15, 'F': 16, 'G': 17, 'H': 18, 'I': 19, 'J': 20, 'K': 21, 'L': 22, 'M': 23, 
    'N': 24, 'O': 25, 'P': 26, 'Q': 27, 'R': 28, 'S': 29, 'T': 30, 'U': 31, 'V': 32, 'W': 33, 'X': 34, 'Y': 35, 'Z': 36,
    'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'j': 20, 'k': 21, 'l': 22, 'm': 23, 
    'n': 24, 'o': 25, 'p': 26, 'q': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'x': 34, 'y': 35, 'z': 36
}
j = 0
ct = {}
for dirName, subdirList, fileList in os.walk(rootDir):
    i = 0
    for fname in fileList:
        if 'png' not in fname or fname[:3] !='IMG':
            continue
        full_path = os.path.join(dirName, fname)
        
        if dirName[-1] in ['E', 'e', 'A', 'a', 'T', 't', 'S', 's', 'D', 'd', 'N', 'n', 'O', 'o'] \
        and dirName[-2] == '/' and ct.get(dirName[-1].lower(), 0) < minimum_number_of_examples:
            ct[dirName[-1].lower()] = ct.get(dirName[-1].lower(), 0) + 1
        else:
            continue
        
        # Resize to 20 X 20
        images[j] = resize(imread(full_path, as_grey=True), (20, 20)) 
        labels.append(lab[dirName[-1]])
        j += 1
        i += 1
images = images[:len(labels)]

In the next part, I inspect

In [0]:
count = Counter(labels)

In [67]:
count.most_common() # 7 labels of and balanced since I picked 300 of each

[(29, 300), (15, 300), (14, 300), (30, 300), (24, 300), (25, 300), (11, 300)]

In [0]:
len(labels), images[0], labels[0]

In [69]:
np.unique(labels)

array([11, 14, 15, 24, 25, 29, 30])

In [0]:
uniques, ids = np.unique(labels, return_inverse=True)

Perform 80% - 20% split of examples. Then we'll have 240 examples of each character in training set and 60 examples of each in test set

In [71]:
k = sample(range(len(images)), len(images))
im_shuf = images[k]
labels_shuf = np.array(labels)[k]
    
ocr = {
    'images': im_shuf,
    'data': im_shuf.reshape((im_shuf.shape[0], -1)),
    'target': labels_shuf
}
x_train, x_test, y_train, y_test = train_test_split(im_shuf, labels_shuf, random_state=2, train_size=0.8)



In [72]:
y_train.shape

(1680,)

Build a baseline model and record its accuracy on the test set for reference. I build a linear SVM with grid search on hyperparameter C. I get an accuracy of 0.23 on the test set for the baseline model

In [73]:
base_model = LinearSVC()
param_grid = {'C':  list(np.arange(0.1,1.5,0.1))}

gs = GridSearchCV(base_model, param_grid, n_jobs=-1, cv=3, verbose=4)
x_train_resh = x_train.reshape((x_train.shape[0], -1))
x_test_resh = x_test.reshape((x_test.shape[0], -1))
gs.fit(x_train_resh, y_train)
pprint(sorted(gs.grid_scores_, key=lambda x: -x.mean_validation_score))

y_pred = gs.predict(x_test_resh)
print ('Test set shape: ', x_test_resh.shape)
print ('Target shape: ', y_test.shape)
print ('Accuracy on train set: ', accuracy_score(y_train, gs.predict(x_train_resh)))
print ('Accuracy on test set: ', accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 14 candidates, totalling 42 fits
[CV] C=0.1 ...........................................................
[CV] C=0.1 ...........................................................
[CV] .................................. C=0.1, score=0.225577 -   6.3s
[CV] C=0.1 ...........................................................
[CV] .................................. C=0.1, score=0.210714 -   7.0s
[CV] C=0.2 ...........................................................
[CV] .................................. C=0.1, score=0.222621 -   8.0s
[CV] C=0.2 ...........................................................
[CV] .................................. C=0.2, score=0.211368 -  10.2s
[CV] C=0.2 ...........................................................
[CV] .................................. C=0.2, score=0.207143 -  10.2s
[CV] C=0.30000000000000004 ...........................................
[CV] .................................. C=0.2, score=0.211849 -  10.4s
[CV] C=0.3000000

[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  1.7min


[CV] .................................. C=0.8, score=0.206039 -   9.9s
[CV] C=0.8 ...........................................................
[CV] .................................. C=0.8, score=0.196429 -   9.7s
[CV] C=0.9 ...........................................................
[CV] .................................. C=0.8, score=0.197487 -   9.8s
[CV] C=0.9 ...........................................................
[CV] .................................. C=0.9, score=0.204263 -   9.9s
[CV] C=0.9 ...........................................................
[CV] .................................. C=0.9, score=0.192857 -  10.0s
[CV] C=1.0 ...........................................................
[CV] .................................. C=0.9, score=0.201077 -   9.9s
[CV] C=1.0 ...........................................................
[CV] .................................. C=1.0, score=0.211368 -   9.8s
[CV] C=1.0 ...........................................................
[CV] .

[Parallel(n_jobs=-1)]: Done  42 out of  42 | elapsed:  3.4min finished


[mean: 0.21964, std: 0.00642, params: {'C': 0.1},
 mean: 0.21012, std: 0.00211, params: {'C': 0.2},
 mean: 0.20655, std: 0.00537, params: {'C': 0.30000000000000004},
 mean: 0.20536, std: 0.00428, params: {'C': 0.4},
 mean: 0.20417, std: 0.00337, params: {'C': 1.1},
 mean: 0.20357, std: 0.00554, params: {'C': 1.0},
 mean: 0.20119, std: 0.00247, params: {'C': 0.5},
 mean: 0.20000, std: 0.00508, params: {'C': 0.6},
 mean: 0.20000, std: 0.00430, params: {'C': 0.8},
 mean: 0.19940, std: 0.00481, params: {'C': 0.9},
 mean: 0.19643, std: 0.00614, params: {'C': 0.7000000000000001},
 mean: 0.19345, std: 0.00641, params: {'C': 1.2000000000000002},
 mean: 0.19345, std: 0.00627, params: {'C': 1.3000000000000003},
 mean: 0.19286, std: 0.01577, params: {'C': 1.4000000000000001}]
Test set shape:  (420, 400)
Target shape:  (420,)
Accuracy on train set:  0.4720238095238095
Accuracy on test set:  0.23333333333333334


Next I build a CNN for the final model

In [0]:
num_classes = len(uniques)+1
batch_size = 128

X_train = x_train.reshape(x_train.shape[0], 20, 20 , 1).astype('float32')
X_test = x_test.reshape(x_test.shape[0], 20, 20 , 1).astype('float32')

uniques, ids_tr = np.unique(y_train, return_inverse=True)
uniques, ids_te = np.unique(y_test, return_inverse=True)

Y_train = to_categorical(ids_tr, num_classes)
Y_test = to_categorical(ids_te, num_classes)

Specify the model architecture and train it and test it. I get a training accuracy of 0.941 and a test accuracy of 0.807. 70 epochs and batch size of 128 arrived through trial and error

In [75]:
model = Sequential()

model.add(Conv2D(32, kernel_size=(5, 5), activation='relu', input_shape=(20, 20, 1)))

model.add(Conv2D(64, (5, 5), activation='relu'))

model.add(Conv2D(128, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))

model.summary()

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=batch_size,
                        nb_epoch=70,
                        verbose=1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_13 (Conv2D)           (None, 16, 16, 32)        832       
_________________________________________________________________
conv2d_14 (Conv2D)           (None, 12, 12, 64)        51264     
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 8, 8, 128)         204928    
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 4, 4, 128)         0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 4, 4, 128)         0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 2048)              0         
_________________________________________________________________
dense_9 (Dense)              (None, 128)               262272    
__________



Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


<keras.callbacks.History at 0x7f4ad7e29860>

In [76]:
scores = model.evaluate(X_test, Y_test, verbose = 10 )
print ( scores )

[0.8265708412442888, 0.8071428571428572]
