The goal for this project was to see if I could build an effective optical character recognition model to offer the Windsor Art & Heritage Center of Colorado for their large collection of historical city documents. I went with Tensorflow and Keras as they were ranked highest for free models that excel at handwriting OCR. With a parternship with The Windsor Art & Heritage Center, I was given hundreds of thousands of pages with no transcription. These books are over a hundred years old and are starting to really show their age and succumb to the elements so the clock is ticking. I wanted to see how well we can leverage deep learning to ensure history is not lost. While I didn't quite get there over the course of the course, I'm continuing my work on it in hopes of delivering a standalone model that be given to museums for transcription.
I attempted three different levels of training the model based on epochs.
60 epochs seems to work decently well, but still missed the mark on several of the testing validation words.
Below you can see how it handled predicting the words from the training dataset.

And here is how it handled the words from my own dataset.

100 epochs seemed to perform better, but we are starting to drift in the direction of overfitting.
Here is how it handled my dataset.

Some better and some worse.
Finally I decided to try 200 epochs, and while it is demonstrating some overfitting, especially in the performance graph, I do find the result overall were better.
But you can see the results, despite overfitting, below.

My model wasn’t the most accurate. However a lot of it is readable. I was suprised it worked as well as it did given the legibilty and photo difficulties that arise.
The model was designed for words not sentences or pages. More work needs to be done to scale up to handle the thousands of pages the museum has.
I Didn’t accomplish my goal of transcribing all of Windsor’s documents. However, I did get an excellent experience in deep learning that I will continue to lean on.
I’m not finished with this project. I’m going to continue working on this so that I can help Windsor preserve their history with the digitization of their documents.



