Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi
In the paper, we compared pre-trained word embeddings(Fasttext and BERT) and embeddings obtained from curated for Yorùbá and Twi. For this purpose, we gather and select corpora and study the most appropriate techniques for the languages. We also create test sets for the evaluation of the word embeddings within a word similarity task (wordsim353) and the contextual embeddings within a NER task. The Corpora and the embeddings are available in here.
For the comparison, we define 3 datasets according to the quality and quantity of textual data used for training:
- Curated Small Dataset (clean), C1, about 1.6 million tokens for Yorùbá and over 735k tokens for Twi. The clean text for Twi is the Bible and for Yorùbá all texts marked under the C1 column in the figure below.
- In Curated Small Dataset (clean + noisy), C2, we add noise to the clean corpus (Wikipedia articles for Twi, and BBC Yorùbá news articles for Yorùbá). This increases the number of training tokens for Twi to 742k tokens and Yorùbá to about 2 million tokens.
The evaluation datasets are the following:
- Translated WordSim-353 for Twi
- Translated WordSim-353 for Yorùbá
- Yoruba named entity recognition (NER) dataset on Github Yorùbá NER dataset
Data set from the Niger Volta-Language Technology Institute can be gotten here
For reproducing the NER results using our fine-tuned BERT models, please use a modified BERT-NER code. First, copy all the BERT embeddings in the Google drive and Google's uncased-multilingual-bert-base model into the bert_models directory of the github. Then, you can run the following bash scripts in the BERT-NER:
- sh run_ner_yorubaPM.sh for the baseline model i.e uncased Multilingual-bert
- sh run_ner_yorubaFMB.sh for the fine-tuned uncased Multilingual-bert on Yorùbá corpus with the default multilingual vocab.txt
- sh run_ner_yorubaFM.sh for the fine-tuned uncased Multilingual-bert on Yorùbá corpus but with mostly Yorùbá vocabulary (i.e in vocab.txt) . We found out that using Yorùbá vocabulary for fine-tuning gave a better performance than using the multilingual vocab.txt
If you use any of the resources in this page, please cite the paper:
Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espa ̃na-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yor\ub\’a and Twi . In LREC, 2020