Skip to content

Historical handwritten dataset & benchmark results for Ethiopic script recognition

License

Notifications You must be signed in to change notification settings

bdu-birhanu/HHD-Ethiopic

Repository files navigation

HHD-Ethiopic

A text-line level historical handwritten Ethiopic OCR Dataset

Overview

This repository contains a historical handwritten dataset called HHD-Ethiopic, and baselines models and human-level performance for benchmarking Historical Handwritten Ethiopic text-image recognition. HHD-Ethiopic is a text-line level historical handwritten Ethiopic OCR Dataset specifically designed for historical handwritten Ethiopic text-image recognition tasks. The full paper is here.

Dataset Details

The HHD-Ethiopic dataset consists of ~80k text-line images extracted from $18^{th}$ to $20^{th}$ centuries of historical handwritten Ethiopic manuscripts. Each text-line image is accompanied by its ground-truth text line transcription. The dataset can be directly downloaded from Hugging Face HHD-Ethiopic Dataset and/or Zenodo HHD-Ethiopic Dataset. Additional synthetically generated Ethiopic text-line images and their corresponding ground truth texts are available from this link.

Sample text-line images and their corresponding ground-truth text are shown below. For a more thorough tutorial about the dataset see formats of the dataset

No. Text-line Image Ground-Truth Text
[Image 1] download ወጽራኅየኒ፡ቦአ፡ቅድሜሁ፡ውስተ፡ዕዘኒሁ
[Image 2] download ፍራስ፡እሳት፡ወጽሩዓን
[Image 3] download ወአንሰ፡በብዝኃ፡አሀውዕ፡ቢተኩ
[Image 4] download ወአድኅነከ፡ይትፌሥሑ።

Getting Started

In the current implementation, the NumPy format of the HHD-Ethiopic dataset is used for training and testing the baseline models. Download the dataset.

After downloading HHD-Ethiopic, install the requirements, to demonstrate we just used the Train data and Test data stored in numpy format. To train and test all baseline models, please use all source codes link.

pip install -r requirements.txt

To Train the model from scratch

$ python3 train_model_plain_CTC.py

Alternatively, you can also run the training code demonstration in Google Colab directly 68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667.

To Prediction/test

$ python3 test_model_plain_CTC.py

Alternatively, you can also run the testing code demonstration in ! Colab directly.

*Please note that the two Colab demos provided here are the HPopt-Attn-CTC implementation as a sample demo.

Sample testing results

Sample results and Character Error Rate (CER) per line are shown below:

Image Ground-truth Prediction Edit Distance CER/Line (100%)
download ሰፉሐከ፡የማነከ፡ወውሕጠቶሙ፡ምድር። ሰፉሕከ፡የማነከ፡ወውሕጠቶሙ፡ምድ። 2 9
download ምድር፡ይኔጽር፡ዘሀሎ፡በየብስ፡ ምድር፡ይኔጽር፡ዘሀሎ፡በየብስ፡ 1 5
download ለብሔረ፡ኢትዮጵያ አብሒረ፡ኢትየጵያ 4 40
download ዓገሠ።በዝሕማም፡መሥጋ፡ ዓገሠ።በዝሕማም፡በሥጋ፡ 2 20

Feedbacks

We welcome contributions and feedback from the research community to further enhance the HHD-Ethiopic dataset and code. If you have any suggestions, please feel free to send them via email: birhanu.hailub@gmail.com or ethiopic.dataset@gmail.com

Acknowledgments

We would like to express our gratitude to the Ethiopian National Archive and Library Agency (ENALA) for providing access to the historical handwritten documents used in creating the HHD-Ethiopic dataset. We are also grateful to ICT4D research center, Bahir Dar Institute of Technology, and ChaLearn for their funding. Furthermore, we would like to acknowledge the support and contributions of the annotators who made this dataset possible.

License

ccThis work is licensed under a Creative Commons Attribution 4.0 International License.

About

Historical handwritten dataset & benchmark results for Ethiopic script recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published