annotations_creators

language_creators

languages

licenses

multilinguality

size_categories

source_datasets

task_categories

task_ids

monolingual

monolingual_raw

parallel

parallel_raw

no-annotation

found

expert-generated

found

monolingual

monolingual_raw

parallel

parallel_raw

chr

en

chr

en

chr

en

other-different-license-per-source

monolingual

monolingual_raw

parallel

parallel_raw

multilingual

monolingual

translation

monolingual

monolingual_raw

parallel

parallel_raw

100K<n<1M

1K<n<10K

10K<n<100K

original

monolingual

monolingual_raw

parallel

parallel_raw

conditional-text-generation

sequence-modeling

conditional-text-generation

monolingual

monolingual_raw

parallel

parallel_raw

machine-translation

language-modeling

machine-translation

Dataset Card for ChrEn

Dataset Description

Repository: Github repository for ChrEn
Paper: ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization
Point of Contact: benfrey@email.unc.edu

Dataset Summary

ChrEn is a Cherokee-English parallel dataset to facilitate machine translation research between Cherokee and English. ChrEn is extremely low-resource contains 14k sentence pairs in total, split in ways that facilitate both in-domain and out-of-domain evaluation. ChrEn also contains 5k Cherokee monolingual data to enable semi-supervised learning.

Supported Tasks and Leaderboards

The dataset is intended to use for machine-translation between Enlish (en) and Cherokee (chr).

Languages

The dataset contains Enlish (en) and Cherokee (chr) text. The data encompasses both existing dialects of Cherokee: the Overhill dialect, mostly spoken in Oklahoma (OK), and the Middle dialect, mostly used in North Carolina (NC).

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Many of the source texts were translations of English materials, which means that the Cherokee structures may not be 100% natural in terms of what a speaker might spontaneously produce. Each text was translated by people who speak Cherokee as the first language, which means there is a high probability of grammaticality. These data were originally available in PDF version. We apply the Optical Character Recognition (OCR) via Tesseract OCR engine to extract the Cherokee and English text.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

The sentences were manually aligned by Dr. Benjamin Frey a proficient second-language speaker of Cherokee, who also fixed the errors introduced by OCR. This process is time-consuming and took several months.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

The dataset was gathered and annotated by Shiyue Zhang, Benjamin Frey, and Mohit Bansal at UNC Chapel Hill.

Licensing Information

The copyright of the data belongs to original book/article authors or translators (hence, used for research purpose; and please contact Dr. Benjamin Frey for other copyright questions).

Citation Information

@inproceedings{zhang2020chren,
  title={ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization},
  author={Zhang, Shiyue and Frey, Benjamin and Bansal, Mohit},
  booktitle={EMNLP2020},
  year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset Card for ChrEn

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset Card for ChrEn

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information