## Table of contents:

1. [Importing Libraries](#Libraries)
2. [Loading Data](#Data)
3. [Data Understanding](#Exploration)
4. [Exploratory Data Analysis](#EDA)
5. [Data Preprocessing](#Clean)
6. [Model Archetecture](#Modelling)
7. [Model Training](#Training)
8. [Model Testing](#Validation)
9. [Model Tuning](#Tuning)
10. [Model Prediction](#Testing)

<a name="Libraries"></a>
<h1> 1. Importing Libraries.

In [None]:
# Import relevant libraries and Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import cv2
import random
import glob
from tensorflow import keras
import string
import tensorflow as tf
import pytesseract
from PIL import Image
from tensorflow.keras.optimizers import Adam
from distance import levenshtein
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout

<h3> 1.1. Set seed for Reproducibility


In [None]:
SEED = 2022

<a name="Data"></a>
<h1> 2. Loading Data

* EMNIST
* IAM

>**2.1. EMNIST**
>>The EMNIST dataset contains handwritten characters from the NIST Special Database 19 and the NIST Special Database 20. It includes a total of 814,255 characters, with 47 classes representing the 26 uppercase and 26 lowercase letters, as well as 10 digits and 11 special characters. Each character is represented as a 28x28 pixel grayscale image.

In [None]:
# Function to extarct EMNIST images and labels
import gzip
def extract_labels(filename):
    with gzip.open(filename, 'rb') as f:
        magic = int.from_bytes(f.read(4), byteorder='big')
        if magic != 2049:
            raise ValueError("Invalid magic number for label file: expected 2049, but got {}".format(magic))
        num_labels = int.from_bytes(f.read(4), byteorder='big')
        buf = f.read(num_labels)
        labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
        return labels

def extract_data(filename):
    with gzip.open(filename, 'rb') as f:
        # Read magic number and number of images
        magic = int.from_bytes(f.read(4), byteorder='big')
        if magic != 2051:
            raise ValueError("Invalid magic number for image file: expected 2051, but got {}".format(magic))
        num_images = int.from_bytes(f.read(4), byteorder='big', signed=False)

        # Read number of rows and columns
        rows = int.from_bytes(f.read(4), byteorder='big')
        cols = int.from_bytes(f.read(4), byteorder='big')

        # Read image data
        buf = f.read(num_images * rows * cols)
        data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
        data = data.reshape(num_images, rows, cols, 1)

        return data


# Load EMNIST dataset
emnistDir = '/home/munyao/Desktop/flat_iron_school/Moringa/phase_5/gzip/'
emnist_train_images = extract_data(emnistDir + 'emnist-byclass-train-images-idx3-ubyte.gz')
emnist_train_labels = extract_labels(emnistDir + 'emnist-byclass-train-labels-idx1-ubyte.gz')
emnist_test_images = extract_data(emnistDir + 'emnist-byclass-test-images-idx3-ubyte.gz')
emnist_test_labels = extract_labels(emnistDir + 'emnist-byclass-test-labels-idx1-ubyte.gz')

>**2.2. IAM**
>>The IAM dataset contains handwritten text samples from forms, letters, and other documents. It includes over 1,000 pages of handwritten text and a total of 13,353 lines of text, with a vocabulary of 6,877 words. The dataset includes both isolated word images and full text lines, with varying degrees of difficulty and quality. The text is represented as ASCII strings and the images are grayscale.


In [None]:
# Extract the paths to the image files and their corresponding labels for the IAM forms dataset
iam_path = '/home/munyao/Desktop/flat_iron_school/Moringa/phase_5/IAM/forms'
image_paths = []
labels = []
for subdir, _, files in os.walk(iam_path):
    for file in files:
        if file.endswith('.png'):
            image_paths.append(os.path.join(subdir, file))
            labels.append(file.split('-')[1])


image_data = np.array(image_paths)
iam_labels = np.array(labels)  

<a name="Exploration"></a>
<h1> 3. Data Understanding

<a name="Clean"></a>
<h1> 4. Data Preprocessing

* Normalize the pixel values of the images to be between 0 and 1.
* Reshape the images to be 28x28 grayscale images with a single channel.
* Convert the labels to one-hot encoding with the specified number of classes.

In [None]:
# Function to preproces train and test
def preprocess_data(train_images, train_labels, test_images, test_labels, num_classes=None):
    # Normalize the pixel values to be between 0 and 1
    train_images = train_images.astype('float32') / 255.0
    test_images = test_images.astype('float32') / 255.0

    # Reshape the images to be 28x28 grayscale images
    train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
    test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)

    # Convert the labels to one-hot encoding
    train_labels = tf.keras.utils.to_categorical(train_labels-1, num_classes=num_classes)
    test_labels = tf.keras.utils.to_categorical(test_labels-1, num_classes=num_classes)

    return train_images, train_labels, test_images, test_labels


<h3> 4.1. IAM Dataset Preprocessing

In [None]:
# Load images from image_paths
image_data = []
for path in image_paths:
    img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)  # Convert image to grayscale
    img = cv2.resize(img, (28, 28))  # Resize the image to 28x28
    image_data.append(img)

# Convert the list of images to a numpy array
image_data = np.array(image_data)

# Add a channel dimension to the images
image_data = np.expand_dims(image_data, axis=-1)

# Convert the labels to a numpy array
labels = np.array(iam_labels)

# Reshape the images to 
image_data = image_data.reshape(-1, 28, 28, 1)

In [None]:
# Split the data into training and testing sets
iam_train_images, iam_test_images, iam_train_labels, iam_test_labels = train_test_split(image_data, labels, test_size=0.2, random_state=SEED)

# Print the shapes of the resulting arrays
print("iam train image shape:", iam_train_images.shape)
print("iam test image shape:", iam_test_images.shape)
print("iam train label shape:", iam_train_labels.shape)
print("iam test label:", iam_test_labels.shape)

In [None]:
# # Reshape the images to 1D arrays
# X_train = iam_train_images.reshape(iam_train_images.shape[0], -1)
# X_test = iam_test_images.reshape(iam_test_images.shape[0], -1)

# # Convert the labels to one-hot encoding
# y_train = pd.get_dummies(iam_train_labels)
# y_test = pd.get_dummies(iam_test_labels)

# # Save the data to CSV files
# pd.DataFrame(X_train).to_csv('iam_train_images.csv', index=False)
# pd.DataFrame(y_train).to_csv('iam_train_labels.csv', index=False)
# pd.DataFrame(X_test).to_csv('iam_test_images.csv', index=False)
# pd.DataFrame(y_test).to_csv('iam_test_labels.csv', index=False)

<h3> 4.2. EMNIST Dataset Preprocessing

In [None]:
# Preprocess the EMNIST ByClass dataset
emnist_train_images = emnist_train_images.astype('float32') / 255.0
emnist_train_images = np.expand_dims(emnist_train_images, axis=-1)

emnist_test_images = emnist_test_images.astype('float32') / 255.0
emnist_test_images = np.expand_dims(emnist_test_images, axis=-1)

# Convert the EMNIST ByClass labels to one-hot encoding
emnist_train_labels = np.eye(62)[emnist_train_labels]
emnist_test_labels = np.eye(62)[emnist_test_labels]

print(f'EMNIST train image shape:{emnist_train_images.shape}')
print(f'EMNIST train lebel shape:{emnist_train_labels.shape}')
print(f'EMNIST test image shape:{emnist_test_images.shape}')
print(f'EMNIST test label shape:{emnist_test_labels.shape}')


<a name="Modelling"></a>
<h1>5. Model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np


<h2>5.1 Model Archetecture

* Simple Convolutional Neural Network (CNN) with two convolutional layers and two fully connected layers. 
* We compile the model using categorical cross-entropy loss and the Adam optimizer. 
* We normalize the pixel values and reshape the images to be 28x28 grayscale images. 

In [None]:
# Define the model architecture
model = Sequential([
    Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D((2,2)),
    Conv2D(64, (3,3), padding='same', activation='relu'),
    MaxPooling2D((2,2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(62, activation='softmax')
])

# Compile the model
opt = tf.keras.optimizers.legacy.Adam(lr=0.001)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

<a name="Training"></a>
<h2>5.2. Fit 

In [None]:
# Normalize the pixel values to be between 0 and 1
train_images = emnist_train_images.astype('float32') / 255.0
test_images = emnist_test_images.astype('float32') / 255.0

# Reshape the images to be 28x28 grayscale images
train_images = emnist_train_images.reshape(train_images.shape[0], 28, 28, 1)
test_images = emnist_test_images.reshape(test_images.shape[0], 28, 28, 1)

# Convert the labels to one-hot encoding
train_labels = tf.keras.utils.to_categorical(emnist_train_labels-1, num_classes=62)
test_labels = tf.keras.utils.to_categorical(emnist_test_labels-1, num_classes=62)

# Train the model
es = EarlyStopping(monitor='val_loss', patience=3)
penPalHistory = model.fit(train_images, train_labels, validation_split=0.1, epochs=32, callbacks=[es])


>The initial loss is 0.9920, and the accuracy is 0.7032. As the number of epochs increases, the loss decreases, and the accuracy improves for both the training and validation sets. After the 20th epoch, the final loss is 0.4337, and the accuracy is 0.8468 on the training set and the validation set's loss is 0.4095 with accuracy of 0.8546.

<h5>Save model

In [None]:
# Save the trained model to disk
model.save('emnistModel.h5')


<a name="Validation"></a>
<h2>5.2. Model Validation

In [None]:
# Load the trained model
model = keras.models.load_model('penPal.h5')


In [None]:
# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(test_images, test_labels)

# Print the test accuracy
print('Test accuracy:', test_acc)


<a name="Tuning"></a>
<h2>5.3. Model Tuning

>**Transfer Learning**

In [None]:
# Load previously trained model on emnist
iam_model = keras.models.load_model('penPal.h5')

# Replace the last layer
num_classes = 3  
iam_model.layers.pop()  
iam_model.add(Dense(num_classes, activation='softmax', name='new_dense'))


# Freeze pretrained layers
for layer in iam_model.layers[:-1]:
    layer.trainable = False

# Compile
opt = tf.keras.optimizers.legacy.Adam(lr=0.0001)
iam_model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
iam_model.summary()

>**Fit**

In [None]:
# Normalize the pixel values to be between 0 and 1
train_images = iam_train_images.astype('float32') / 255.0
test_images = iam_test_images.astype('float32') / 255.0

# Reshape the images to be 28x28 grayscale images
train_images = iam_train_images.reshape(train_images.shape[0], 28, 28, 1)
test_images = iam_test_images.reshape(test_images.shape[0], 28, 28, 1)

# Convert the labels to one-hot encoding
train_labels = tf.keras.utils.to_categorical(iam_train_labels-1, num_classes=62)
test_labels = tf.keras.utils.to_categorical(iam_test_labels-1, num_classes=62)


# Train the model on the IAM dataset
es = EarlyStopping(monitor='val_loss', patience=3)
iam_model.fit(iam_train_images, iam_train_labels, validation_split=0.1, epochs=20, callbacks=[es])


>**Save**

In [None]:
# Save the trained model to disk
iam_model.save('penPalModel.h5')


<a name="Prediction"></a>
<h1>6. Model Evaluation

In [None]:
# Normalize the pixel values of the test images to the range [0, 1]
test_images = test_images / 255.0

# Reshape the test images to a 4D tensor with shape (num_samples, 28, 28, 1)
test_images = np.reshape(test_images, (test_images.shape[0], 28, 28, 1))

# Evaluate the model on the test dataset
predictions = model.predict(test_images)

# Convert the predicted probabilities to predicted labels
predicted_labels = np.argmax(predictions, axis=1)

# Flatten the test labels to a 1D array
test_labels_flat = np.reshape(test_labels, (test_labels.shape[0],))

# Calculate the Character Error Rate (CER) and Word Error Rate (WER)
cer = np.sum(np.not_equal(predicted_labels, test_labels_flat)) / len(test_labels_flat)
wer = cer 


<h2> 6.1.Metrics

1. Character Error Rate.
>It is computed as the Levenshtein distance which is the sum of the character substitutions (Sc), insertions (Ic) and deletions (Dc) that are needed to transform one string into the other, divided by the total number of characters in the groundtruth (Nc).

2. Word Error Rate
>It is computed as the sum of the word substitutions (Sw), insertions (Iw) and deletions (Dw) that are required to transform one string into the other, divided by the total number of words in the groundtruth (Nw)

<h3>6.1.1. Evaluate the Model</h3>

In [None]:
# Print the CER and WER
print("Character Error Rate: {:.2%}".format(cer))
print("Word Error Rate: {:.2%}".format(wer))


>penPal is performing reasonably well for handwriting recognition. A Character Error Rate of 14.5% our model makes an error on about 1 in 7 characters.

<h1>6. Database to Store the Digitized

In [None]:
# Import relevant libraries
import sqlite3
conn = sqlite3.connect('digitized_notes.db')
cursor = conn.cursor()
