annotations_creators | language_creators | languages | licenses | multilinguality | size_categories | source_datasets | task_categories | task_ids | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Repository: Github repository for ChrEn
- Paper: ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization
- Point of Contact: benfrey@email.unc.edu
ChrEn is a Cherokee-English parallel dataset to facilitate machine translation research between Cherokee and English. ChrEn is extremely low-resource contains 14k sentence pairs in total, split in ways that facilitate both in-domain and out-of-domain evaluation. ChrEn also contains 5k Cherokee monolingual data to enable semi-supervised learning.
The dataset is intended to use for machine-translation
between Enlish (en
) and Cherokee (chr
).
The dataset contains Enlish (en
) and Cherokee (chr
) text. The data encompasses both existing dialects of Cherokee: the Overhill dialect, mostly spoken in Oklahoma (OK), and the Middle dialect, mostly used in North Carolina (NC).
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Many of the source texts were translations of English materials, which means that the Cherokee structures may not be 100% natural in terms of what a speaker might spontaneously produce. Each text was translated by people who speak Cherokee as the first language, which means there is a high probability of grammaticality. These data were originally available in PDF version. We apply the Optical Character Recognition (OCR) via Tesseract OCR engine to extract the Cherokee and English text.
[More Information Needed]
[More Information Needed]
The sentences were manually aligned by Dr. Benjamin Frey a proficient second-language speaker of Cherokee, who also fixed the errors introduced by OCR. This process is time-consuming and took several months.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The dataset was gathered and annotated by Shiyue Zhang, Benjamin Frey, and Mohit Bansal at UNC Chapel Hill.
The copyright of the data belongs to original book/article authors or translators (hence, used for research purpose; and please contact Dr. Benjamin Frey for other copyright questions).
@inproceedings{zhang2020chren,
title={ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization},
author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit},
booktitle={EMNLP2020},
year={2020}
}