# TrainScannerNetwork.ipynb

## Author: Daniel Mallia
## Date Begun: 1/17/2020

**This Jupyter Notebook contains the process for training a Keras network for Optical Character Recognition (OCR) for use with a document scanner app, on the Chars74K Dataset. See the following link for the dataset:**

http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

In [1]:
# Imports
from keras import layers
from keras import models
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
from numpy.random import default_rng
import os
import shutil
import math

Using TensorFlow backend.


## Organize Data:

For now, training will only be done on the "Fnt" folder, which contains computer font characters, of 4 different variations. This data comes as a single folder with a directory for each class - thus we must split into train/test file sets before using a validation split on the training data generator.

Inspiration from how to handle this in part from here: 
https://stackoverflow.com/questions/46717742/split-data-directory-into-training-and-test-directory-with-sub-directory-structu

and 

https://stackoverflow.com/questions/8505651/non-repetitive-random-number-in-numpy

In [3]:
# trainDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/Fnt/'
# testDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/FntTest/'

# # For each sample folder
# for folder in os.listdir(trainDataLocation):
#     if(folder == '.DS_Store'): # Ignore .DS_Store files
#         continue
#     else: 
#         # Make a matching test folder
#         testPath = testDataLocation + folder + '/'
#         os.mkdir(testPath)
        
#         currentFolderPath = trainDataLocation + folder + '/'
#         FILE_NAME_LIST = os.listdir(currentFolderPath)
#         numberOfFiles = len(FILE_NAME_LIST)
#         numRange = np.arange(0, numberOfFiles)
#         numOfSelections = math.floor(numberOfFiles * .2)
        
#         # Randomly select indices for choosing approximately 20% of the samples
#         randomNumGen = default_rng()
#         testSelections = randomNumGen.choice(numRange, size=numOfSelections, replace=False)
        
#         # For each index selected
#         for index in testSelections:
#             filename = FILE_NAME_LIST[index]
#             shutil.move(currentFolderPath + filename, testPath + filename)

## Data Preprocessing:

**NOT DOING DATA AUGMENTATION BECAUSE OTHERWISE THIS WILL ALSO BE APPLIED TO THE VALIDATION DATA.
THIS IS A KNOWN ISSUE WITH KERAS, SEE:**

https://stackoverflow.com/questions/53037510/can-flow-from-directory-get-train-and-validation-data-from-the-same-directory-in

This is useful for understanding how to do the train/validation split: https://stackoverflow.com/questions/42443936/keras-split-train-test-set-when-using-imagedatagenerator

Useful for understanding the matter of class labels with Keras: https://medium.com/difference-engine-ai/keras-a-thing-you-should-know-about-keras-if-you-plan-to-train-a-deep-learning-model-on-a-large-fdd63ce66bd2

In [2]:
trainDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/Fnt/'
testDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/FntTest/'

imageSize = (128, 128) # Chars 74k image size

# Initialize generators - just appropriate scaling and validation split
trainDataGen = ImageDataGenerator(rescale=1./255, validation_split=.2)
testDataGen = ImageDataGenerator(rescale=1./255)

# Flow from directories

trainGenerator = trainDataGen.flow_from_directory(trainDataLocation,
                             target_size=imageSize,
                             class_mode="categorical",
                             subset="training")

validationGenerator = trainDataGen.flow_from_directory(trainDataLocation,
                                  target_size=imageSize,
                                  class_mode="categorical",
                                  subset="validation")

testGenerator = testDataGen.flow_from_directory(testDataLocation,
                           target_size=imageSize,
                           class_mode="categorical")

Found 40362 images belonging to 62 classes.
Found 10044 images belonging to 62 classes.
Found 12586 images belonging to 62 classes.


In [3]:
classLabels = []

for i in range(0, 10):
    classLabels.append(str(i))
for i in range(ord('A'), ord('Z') + 1):
    classLabels.append(chr(i))
for i in range(ord('a'), ord('z') + 1):
    classLabels.append(chr(i))
    
print(classLabels)
print(len(classLabels))

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
62


In [4]:
# Generate mapping dictionary
#print(trainGenerator.class_indices)

i = 0
mapping = {}
for key in trainGenerator.class_indices:
    #print(trainGenerator.class_indices[key])
    mapping[key] = classLabels[i]
    i+=1
    
print(mapping)

{'Sample001': '0', 'Sample002': '1', 'Sample003': '2', 'Sample004': '3', 'Sample005': '4', 'Sample006': '5', 'Sample007': '6', 'Sample008': '7', 'Sample009': '8', 'Sample010': '9', 'Sample011': 'A', 'Sample012': 'B', 'Sample013': 'C', 'Sample014': 'D', 'Sample015': 'E', 'Sample016': 'F', 'Sample017': 'G', 'Sample018': 'H', 'Sample019': 'I', 'Sample020': 'J', 'Sample021': 'K', 'Sample022': 'L', 'Sample023': 'M', 'Sample024': 'N', 'Sample025': 'O', 'Sample026': 'P', 'Sample027': 'Q', 'Sample028': 'R', 'Sample029': 'S', 'Sample030': 'T', 'Sample031': 'U', 'Sample032': 'V', 'Sample033': 'W', 'Sample034': 'X', 'Sample035': 'Y', 'Sample036': 'Z', 'Sample037': 'a', 'Sample038': 'b', 'Sample039': 'c', 'Sample040': 'd', 'Sample041': 'e', 'Sample042': 'f', 'Sample043': 'g', 'Sample044': 'h', 'Sample045': 'i', 'Sample046': 'j', 'Sample047': 'k', 'Sample048': 'l', 'Sample049': 'm', 'Sample050': 'n', 'Sample051': 'o', 'Sample052': 'p', 'Sample053': 'q', 'Sample054': 'r', 'Sample055': 's', 'Sample05

## Model:

In [5]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(128, 128, 3))) # height, width, channels
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(32, (3,3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Conv2D(64, (3,3), activation='relu'))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(62, activation='softmax'))

In [6]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 126, 126, 32)      896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 63, 63, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 61, 61, 32)        9248      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 30, 30, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 12, 12, 64)       

In [7]:
numOfTrainingFiles = len(trainGenerator.filepaths)
numOfValidationFiles = len(validationGenerator.filepaths)
batchSize = 64

trainSteps = math.ceil(numOfTrainingFiles / batchSize)
validationSteps = math.ceil(numOfValidationFiles / batchSize)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit_generator(
        trainGenerator,
        steps_per_epoch=trainSteps,
        epochs=2,
        validation_data=validationGenerator,
        validation_steps=validationSteps)

Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x637dbf850>

## Test:

In [None]:
model.evaluate_generator(
        testGenerator)