# TrainScannerNetwork.ipynb

## Author: Daniel Mallia
## Date Begun: 1/17/2020

**This Jupyter Notebook contains the process for training a Keras network for Optical Character Recognition (OCR) for use with a document scanner app, on the Chars74K Dataset. See the following link for the dataset:**

http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

In [2]:
# Imports
from keras import layers
from keras import models
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
from numpy.random import default_rng
import os
import shutil
import math

Using TensorFlow backend.


## Organize Data:

For now, training will only be done on the "Fnt" folder, which contains computer font characters, of 4 different variations. This data comes as a single folder with a directory for each class - thus we must split into train/test file sets before using a validation split on the training data generator.

Inspiration from how to handle this in part from here: 
https://stackoverflow.com/questions/46717742/split-data-directory-into-training-and-test-directory-with-sub-directory-structu

and 

https://stackoverflow.com/questions/8505651/non-repetitive-random-number-in-numpy

In [3]:
# trainDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/Fnt/'
# testDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/FntTest/'

# # For each sample folder
# for folder in os.listdir(trainDataLocation):
#     if(folder == '.DS_Store'): # Ignore .DS_Store files
#         continue
#     else: 
#         # Make a matching test folder
#         testPath = testDataLocation + folder + '/'
#         os.mkdir(testPath)
        
#         currentFolderPath = trainDataLocation + folder + '/'
#         FILE_NAME_LIST = os.listdir(currentFolderPath)
#         numberOfFiles = len(FILE_NAME_LIST)
#         numRange = np.arange(0, numberOfFiles)
#         numOfSelections = math.floor(numberOfFiles * .2)
        
#         # Randomly select indices for choosing approximately 20% of the samples
#         randomNumGen = default_rng()
#         testSelections = randomNumGen.choice(numRange, size=numOfSelections, replace=False)
        
#         # For each index selected
#         for index in testSelections:
#             filename = FILE_NAME_LIST[index]
#             shutil.move(currentFolderPath + filename, testPath + filename)

## Data Preprocessing:

**NOT DOING DATA AUGMENTATION BECAUSE OTHERWISE THIS WILL ALSO BE APPLIED TO THE VALIDATION DATA.
THIS IS A KNOWN ISSUE WITH KERAS, SEE:**

https://stackoverflow.com/questions/53037510/can-flow-from-directory-get-train-and-validation-data-from-the-same-directory-in

This is useful for understanding how to do the train/validation split: https://stackoverflow.com/questions/42443936/keras-split-train-test-set-when-using-imagedatagenerator

Useful for understanding the matter of class labels with Keras: https://medium.com/difference-engine-ai/keras-a-thing-you-should-know-about-keras-if-you-plan-to-train-a-deep-learning-model-on-a-large-fdd63ce66bd2

In [8]:
classLabels = []

for i in range(0, 10):
    classLabels.append(str(i))
for i in range(ord('A'), ord('Z') + 1):
    classLabels.append(chr(i))
for i in range(ord('a'), ord('z') + 1):
    classLabels.append(chr(i))
    
print(classLabels)
print(len(classLabels))

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
62


In [None]:
trainDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/Fnt/'
testDataLocation = '/Users/danielmallia/Documents/TTP/IndependentStudy/Capstone/Data/English/FntTest/'

imageSize = (128, 128) # Chars 74k image size

# Initialize generators - just appropriate scaling and validation split
trainDataGen = ImageDataGenerator(rescale=1./255, validation_split=.2)
testDataGen = ImageDataGenerator(rescale=1./255)

# Flow from directories

trainGenerator = trainDataGen.flow_from_directory(trainDataLocation,
                             target_size=imageSize,
                             classes=classLabels,
                             classmode="categorical",
                             subset="training")

validationGenerator = trainDataGen.flow_from_directory(trainDataLocation,
                                  target_size=imageSize,
                                  classes=classLabels,
                                  classmode="categorical",
                                  subset="validation")

testGenerator = testDataGen.flow_from_directory(testDataLocation,
                           target_size=imageSize,
                           classes=classLabels,
                           classmode="categorical")

## Model:

In [None]:
model = models.Sequential()