SwahBERT: Language model of Swahili

Is a pretrained monolingual language model for Swahili.
The model was trained for 800K steps using a corpus of 105MB that was collected from news sites, online discussion, and Wikipedia.
The evaluation was perfomed on several downstream tasks such as emotion classification, news classification, sentiment classification, and Named entity recognition.

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("swahbert-base-uncased")

# Tokenized input
text = "Mlima Kilimanjaro unapatikana Tanzania"
tokenized_text = tokenizer.tokenize(text)

SwahBERT => ['mlima', 'kilimanjaro', 'unapatikana', 'tanzania']
mBERT => ['ml', '##ima', 'ki', '##lima', '##nja', '##ro', 'una', '##patikana', 'tan', '##zania']

Pre-training data

The text was extracted from different sorces;

News sites: United Nations news, Voice of America (VoA), Deutsche Welle (DW) and taifaleo
Forums: JaiiForum
Wikipedia.

Pre-trained Models

Download the models here:

SwahBERT-Base, Uncased:12-layer, 768-hidden, 12-heads , 124M parameters
SwahBERT-Base, Cased:12-layer, 768-hidden, 12-heads , 111M parameters

Steps	vocab size	MLM acc	NSP acc	loss
800K	50K (uncased)	76.54	99.67	1.0667
800K	32K (cased)	76.94	99.33	1.0562

Emotion Dataset

We released the Swahili emotion dataset.
The data consists of ~13K emotion annotated comments from social media platforms and translated English dataset.
The data is multi-label with six Ekman’s emotions: happy, surprise, sadness, fear, anger, and disgust or neutral.

Evaluation

The model was tested on four downstream tasks including our new emotion dataset

F1-score of language models on downstream tasks

Tasks	SwahBERT	SwahBERT_cased	mBERT
Emotion	64.46	64.77	60.52
News	90.90	89.90	89.73
Sentiment	70.94	71.12	67.20
NER	88.50	88.60	89.36

Citation

Please use the following citation if you use the model or dataset:

@inproceedings{martin-etal-2022-swahbert,
    title = "{S}wah{BERT}: Language Model of {S}wahili",
    author = "Martin, Gati  and Mswahili, Medard Edmund  and Jeong, Young-Seob  and Woo, Jiyoung",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.23",
    pages = "303--313"
    }

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
emotion_dataset		emotion_dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SwahBERT: Language model of Swahili

Pre-training data

Pre-trained Models

Emotion Dataset

Evaluation

Citation

About

Releases

Packages

gatimartin/SwahBERT

Folders and files

Latest commit

History

Repository files navigation

SwahBERT: Language model of Swahili

Pre-training data

Pre-trained Models

Emotion Dataset

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages