Skip to content

geezorg/emufi-training-set

Repository files navigation

EMUFI Training Set Collection

This repository contains pages from selected manuscripts, along with their typed transcription (ground truth), to serve as a training set for OCR systems. The collection has been developed to train systems (in particular Transkribus) to recognize symbols unique to the Geʾez manuscript tradition such as bichromatic punctuation and numerals, ligatures, character and punctuation variants, and specialized marks.

The transcribed documents in the collection use the EMUFI assigned addresses for the non-standard symbols. It is also the goal of the training set to develop an OCR model that targets the EMFUI encoding.

Collections

Manuscripts and sample pages have been selected that provide a variety of the non-standard manuscript symbols that have been previously surveyed for the EMUFI project. The manuscripts samples in the collection and summarized in the following:

British Library

The British Library digitized manuscripts collection contain the following usage statement:

Usage Statement: The Ethiopian manuscripts published in digitised format by the British Library are to the best of our knowledge not in copyright under Ethiopian law. However the British Library recognises broader interests in the cultural heritage which the Ethiopian manuscripts represent. The manuscripts included are often of a religious nature, and the Library has taken considerable care not to distort or alter the underlying material. We ask users also to show appropriate respect in reusing the digital images of the Ethiopian manuscripts, which should not be altered or reused in ways that might be derogatory or offensive to the Ethiopian communities for whom they are of special cultural importance.

OR 598

Title: ግብረ ሕማማት, Lectionary of the Holy Week.
Home: https://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Or_598
Date: 1708-11
Folios: f.2r - f.8r (13 pages)

OR 603

Title: ነገር ማርያም, Nagara Māryām "The Story of Mary".
Home: https://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Or_603
Date: 1721-1730
Folios: f2.r - f7.v (12 pages)

OR 649

Title: ታምረ፡ ማርያም , Miracles of the blessed Virgin Mary.
Home: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Or_649
Date: 18th century
Folios: f10.r - f15.v (12 pages)

OR 699

Title: ገድሎሙ ለቅዱሳን በረላም ወይዋስፉ, The History of Baralam and Yewasef.
Home: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Or_699
Date: 1746-1755
Folios: f4.r - f8.v (11 pages)

OR 719

Title: ገድለ፡ ቅዱስ፡ ላሊበላ, The Acts of Lālibalā.
Home: http://www.bl.uk/manuscripts/FullDisplay.aspx?ref=Or_719
Date: 1431-4
Folios: f7.4 - f12.v (12 pages)

Princeton Ethiopic Manuscripts (PEM)

The PEM resources contain the following license link which indicates that there is no known copyright associated with the artifact: https://rightsstatements.org/page/NKC/1.0/?language=en

C0776 Item 61

Title: Princeton Ethiopic Manuscript No. 61: Homilies and Miracles of St. Michael.
Home: https://findingaids.princeton.edu/catalog/C0776_c01260
Date: 1800s
Folios: 352 - 364 (13 pages)

C0776 Item 65

Title: Princeton Ethiopic Manuscript No. 65: Miracles of Mary.
Home: https://findingaids.princeton.edu/catalog/C0776_c1384
Date: circa 1720
Folios: 15 - 29 (11 pages)

About

Geʾez manuscript samples for training OCR systems to recognize EMUFI extensions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published