IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpu, a new corpus for 10 Indigenous languages and Spanish. We also present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages as part of our contribution to theAmericasNLP 2021 Shared Task on OpenMachine Translation.

IndT5

We train an Indigenous language model adopting the unified and flexible text-to-text transfer Transformer (T5) approach . T5 treats every text-based language task as a “text-to-text" problem, taking text format as input and producing new text format as output. T5 is essentially an encoder-decoder Transformer , with the encoder and decoder similar in configuration and size to a BERT_Base but with some architectural modifications. Modifications include applying a normalization layer before a sub-block and adding a pre-norm (i.e., initial input to the sub-block output).

Download IndT5 Model

IndT5 model (pytorch checkpoint) is available on Huggingface website for direct download and use exclusively for research.

IndCourpus

We build IndCorpus, a collection of 10 Indigeous languages and Spanish comprising 1.17GB of text, from both Wikipedia and the Bible.

Demographic information of 10 Indigenous languages

Language	Language Code	Main Location	Number of Speakers
Aymara	aym	Bolivia	1,677,100
Asháninka	cni	Peru	35,200
Bribri	bzd	Costa Rica	7,000
Guarani	gn	Paraguay	6,652,790
Hñähñu	oto	Mexico	88,500
Nahuatl	nah	Mexico	410,000
Quechua	quy	Peru	7,384,920
Rarámuri	tar	Mexico	9,230
Shipibo-Konibo	shp	Peru	22,500
Wixarika	hch	Mexico	52,500

Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)

Target Language	Wiki Size (MB)	Wiki #Sentences	Bible Size (MB)	Bible #Sentences
Hñähñu	-	-	1.4	7.5K
Wixarika	-	-	1.3	7.5K
Nahuatl	5.8	61.1K	1.5	7.5K
Guarani	3.7	28.2K	1.3	7.5K
Bribri	-	-	1.5	7.5K
Rarámuri	-	-	1.9	7.5K
Quechua	5.9	97.3K	4.9	31.1K
Aymara	1.7	32.9K	5	30.7K
Shipibo-Konibo	-	-	1	7.9K
Asháninka	-	-	1.4	7.8K
Spanish	1.13K	5M	-	-
Total	1.15K	5.22M	19.8	125.3K

Parallel datasets for machine translation

The datasets are provided by AmericasNLP 2021 Shared Task on Open Machine Translation (https://github.com/AmericasNLP/americasnlp2021).

Number of sentences in parallel dataset (train, dev and test set)

Language Pair	Train	Dev	Test
es-aym	6,531	996	1,003
es-cni	3,883	883	1,003
es-bzd	7,506	996	1,003
es-gn	26,032	995	1,003
es-oto	4,889	599	1,003
es-nah	16,145	672	1,003
es-quy	125,008	996	1,003
es-tar	14,720	995	1,003
es-shp	14,592	996	1,003
es-hch	8,966	994	1,003

BibTex

  title={IndT5: A Text-to-Text Transformer for 10 Indigenous Languages},
  author={Chen, Wei-Rui and Abdul-Mageed, Muhammad and Cavusoglu, Hasan and others},
  booktitle={Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas},
  pages={265--271},
  year={2021}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
IND_langs_large7.png		IND_langs_large7.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IND_langs_large7.png

IND_langs_large7.png

README.md

README.md

Repository files navigation

IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

IndT5

Download IndT5 Model

IndCourpus

Demographic information of 10 Indigenous languages

Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)

Parallel datasets for machine translation

Number of sentences in parallel dataset (train, dev and test set)

BibTex

About

Releases

Packages

Contributors 3

UBC-NLP/IndT5

Folders and files

Latest commit

History

IND_langs_large7.png

IND_langs_large7.png

README.md

README.md

Repository files navigation

IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

IndT5

Download IndT5 Model

IndCourpus

Demographic information of 10 Indigenous languages

Data size and number of sentences in monolingual dataset (collected from Wikipedia and Bible)

Parallel datasets for machine translation

Number of sentences in parallel dataset (train, dev and test set)

BibTex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages