GitHub - burnsmatt92/PracticumII

Handwritten History: Automated Digitization and Transcription of Historical Documents

MSDS696 – Data Science Practicum II

The goal for this project was to see if I could build an effective optical character recognition model to offer the Windsor Art & Heritage Center of Colorado for their large collection of historical city documents. I went with Tensorflow and Keras as they were ranked highest for free models that excel at handwriting OCR. With a parternship with The Windsor Art & Heritage Center, I was given hundreds of thousands of pages with no transcription. These books are over a hundred years old and are starting to really show their age and succumb to the elements so the clock is ticking. I wanted to see how well we can leverage deep learning to ensure history is not lost. While I didn't quite get there over the course of the course, I'm continuing my work on it in hopes of delivering a standalone model that be given to museums for transcription.

Dataset Example

Training

I attempted three different levels of training the model based on epochs.

60 Epochs

60 epochs seems to work decently well, but still missed the mark on several of the testing validation words.

Below you can see how it handled predicting the words from the training dataset.

And here is how it handled the words from my own dataset.

100 Epochs

100 epochs seemed to perform better, but we are starting to drift in the direction of overfitting.

Here is how it handled my dataset.

Some better and some worse.

200 Epochs

Finally I decided to try 200 epochs, and while it is demonstrating some overfitting, especially in the performance graph, I do find the result overall were better.

But you can see the results, despite overfitting, below.

Conclusion

My model wasn’t the most accurate. However a lot of it is readable. I was suprised it worked as well as it did given the legibilty and photo difficulties that arise.

The model was designed for words not sentences or pages. More work needs to be done to scale up to handle the thousands of pages the museum has.

I Didn’t accomplish my goal of transcribing all of Windsor’s documents. However, I did get an excellent experience in deep learning that I will continue to lean on.

I’m not finished with this project. I’m going to continue working on this so that I can help Windsor preserve their history with the digitization of their documents.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
LICENSE		LICENSE
Practicum_Notebook.ipynb		Practicum_Notebook.ipynb
Prediction.ipynb		Prediction.ipynb
README.md		README.md
my_model_100.keras		my_model_100.keras
my_model_200.keras		my_model_200.keras
my_model_60.keras		my_model_60.keras

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handwritten History: Automated Digitization and Transcription of Historical Documents

MSDS696 – Data Science Practicum II

Dataset Example

Training

60 Epochs

100 Epochs

200 Epochs

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Handwritten History: Automated Digitization and Transcription of Historical Documents

MSDS696 – Data Science Practicum II

Dataset Example

Training

60 Epochs

100 Epochs

200 Epochs

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages