<img src="https://github.com/UBC-NLP/afrolid/raw/main/images/afrolid_logo.jpg">

AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. AfroLID is described in this paper: 
[**AfroLID: A Neural Language Identification Tool for African Languages**](https://arxiv.org/abs/2210.11744).


## (1) Install Afrolid

In [1]:
!pip install -U git+https://github.com/UBC-NLP/afrolid.git --q

[K     |████████████████████████████████| 125 kB 5.1 MB/s 
[K     |████████████████████████████████| 1.3 MB 17.2 MB/s 
[K     |████████████████████████████████| 11.0 MB 4.4 kB/s 
[K     |████████████████████████████████| 118 kB 7.9 MB/s 
[K     |████████████████████████████████| 123 kB 52.9 MB/s 
[K     |████████████████████████████████| 241 kB 45.9 MB/s 
[K     |████████████████████████████████| 112 kB 53.9 MB/s 
[?25h  Building wheel for afrolid (setup.py) ... [?25l[?25hdone
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone


## (2) Donwload the model

In [2]:
! wget https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
!tar -xf afrolid_model.tar.gz

--2022-12-05 22:21:02--  https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
Resolving demos.dlnlp.ai (demos.dlnlp.ai)... 74.208.236.113, 2607:f1c0:100f:f000::264
Connecting to demos.dlnlp.ai (demos.dlnlp.ai)|74.208.236.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2277022086 (2.1G) [application/gzip]
Saving to: ‘afrolid_model.tar.gz’


2022-12-05 22:21:33 (72.0 MB/s) - ‘afrolid_model.tar.gz’ saved [2277022086/2277022086]



## (2) Initial AfroLID object

In [16]:
import os, sys
import logging
from afrolid.main import classifier

In [29]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
    force=True, # Resets any previous configuration
)
logger = logging.getLogger("afroli")


In [30]:
cl = classifier(logger, model_path="/content/afrolid_model")

2022-12-05 22:36:15 | INFO | afroli | Initalizing AfroLID's task and model.


| [input] dictionary: 64001 types
| [label] dictionary: 528 types


## (3) Get language prediction(s)

In [50]:
## Gold label = dip
text="6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali"
predicted_langs = cl.classify(text) # default max_outputs=3
print("Predicted languages:")
for lang in predicted_langs:
  print("     |-- ISO: {}\tName: {}\tScript: {}\tScore: {}%".format(
                      lang,
                      predicted_langs[lang]['name'], 
                      predicted_langs[lang]['script'],
                      predicted_langs[lang]['score']))

2022-12-05 23:23:11 | INFO | afroli | Input text: 6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali


Predicted languages:
     |-- ISO: dip	Name: Dinka, Northeastern	Script: Latin	Score: 100.0%


In [52]:
## Gold label = kmy
text="Ama vuodieke nɩŋ mana n Chʋa Ŋmɩŋ dɩ nagɩna yɩ mɩŋ , nan keŋ n jigiŋ a yi mɩŋ yada , ta n kaaŋ yagɩ vuodieke nɩŋ dɩ kienene n jigiŋ"
predicted_langs = cl.classify(text)  # default max_outputs=3
print("Predicted languages:")
for lang in predicted_langs:
  print("     |-- ISO: {}\tName: {}\tScript: {}\tScore: {}%".format(
                      lang,
                      predicted_langs[lang]['name'], 
                      predicted_langs[lang]['script'],
                      predicted_langs[lang]['score']))

2022-12-05 23:24:28 | INFO | afroli | Input text: Ama vuodieke nɩŋ mana n Chʋa Ŋmɩŋ dɩ nagɩna yɩ mɩŋ , nan keŋ n jigiŋ a yi mɩŋ yada , ta n kaaŋ yagɩ vuodieke nɩŋ dɩ kienene n jigiŋ


Predicted languages:
     |-- ISO: kma	Name: Konni	Script: Latin	Score: 68.42%
     |-- ISO: kmy	Name: Koma	Script: Latin	Score: 31.58%


## (3) Integrate with Pandas


In [32]:
!wget https://raw.githubusercontent.com/UBC-NLP/afrolid/main/examples/examples.tsv -O examples.tsv

--2022-12-05 22:50:39--  https://raw.githubusercontent.com/UBC-NLP/afrolid/main/examples/examples.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5464 (5.3K) [text/plain]
Saving to: ‘examples.tsv’


2022-12-05 22:50:39 (51.4 MB/s) - ‘examples.tsv’ saved [5464/5464]



In [41]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
df = pd.read_csv("examples.tsv", sep="\t")
df

Unnamed: 0,gold_label,content
0,dip,6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt...
1,twi,"Aprannaa ason no gyigyei akyi no , ɔbɔfo hoɔde..."
2,tex,Akatek zin u̱zu̱ngtti̱ lo̱o̱c kar zin ki̱di̱ng...
3,aha,Dwɔɔnʋ 4 : 8 Mɩnla mfasʋ yɛ yenyia wɔ Nyamɩnlɩ...
4,ngo,Vamteta vakulu na vagogo va vandu vamkotili “ ...
5,akp,"1Nnɛgbe kama ne , lonya bɔi lalaa ɔwɛ̃ gɔ ɔto ..."
6,bst,ዬሱ̈ሲ ዋ ን ዎይና ጃ̇ ላ̈ሚ̇ዴ 1ዓይዚ̇ ቃም̇ ፥ ጋሊላ ጋዳ̇ ዎ ቃ̈...
7,spp,20Ɲyɛ sùpyire t'à pyi ti ɲyɛ a kwû yire yyefuy...
8,dsh,56Hééllá lulle hatallá ˈdíéllá he giri ˈdeeny ...
9,sgw,ፍጥረት እንም ይእግዘር ክብር ያትየሽ 1በሰሜ ያነቦ እንም ይእግዘር ክብር...


In [47]:
def get_afrolid_prediction(text):
  predictions = cl.classify(text, max_outputs=1)
  for lang in predictions:
    return lang, predictions[lang]['score'], predictions[lang]['name'], predictions[lang]['script']

In [49]:
df['predict_iso'], df['predict_score'], df['predict_name'], df['predict_script'] = zip(*df['content'].progress_apply(get_afrolid_prediction))
df

  0%|          | 0/21 [00:00<?, ?it/s]2022-12-05 23:19:54 | INFO | afroli | Input text: 6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali .
 10%|▉         | 2/21 [00:01<00:10,  1.83it/s]2022-12-05 23:19:55 | INFO | afroli | Input text: Aprannaa ason no gyigyei akyi no , ɔbɔfo hoɔdenfo no kasa bio : “ Na ɔbɔfo a mihuu no sɛ ogyina po ne asase so no maa ne nsa nifa so kyerɛɛ ɔsoro .
 14%|█▍        | 3/21 [00:01<00:10,  1.68it/s]2022-12-05 23:19:56 | INFO | afroli | Input text: Akatek zin u̱zu̱ngtti̱ lo̱o̱c kar zin ki̱di̱ngdi̱nga̱n Ye̱su̱ .
 19%|█▉        | 4/21 [00:02<00:10,  1.65it/s]2022-12-05 23:19:57 | INFO | afroli | Input text: Dwɔɔnʋ 4 : 8 Mɩnla mfasʋ yɛ yenyia wɔ Nyamɩnlɩ yɩ ɔlɔlɛ zʋ a ?
 24%|██▍       | 5/21 [00:03<00:09,  1.63it/s]2022-12-05 23:19:57 | INFO | afroli | Input text: Vamteta vakulu na vagogo va vandu vamkotili “ Wihenga mambu genago kwa uhotola woki ?
 29%|██▊       | 6/21 [00:03<

Unnamed: 0,gold_label,content,predict_iso,predict_score,predict_name,predict_script
0,dip,6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt...,dip,100.0,"Dinka, Northeastern",Latin
1,twi,"Aprannaa ason no gyigyei akyi no , ɔbɔfo hoɔde...",twi,99.97,Twi,Latin
2,tex,Akatek zin u̱zu̱ngtti̱ lo̱o̱c kar zin ki̱di̱ng...,tex,100.0,Tennet,Latin
3,aha,Dwɔɔnʋ 4 : 8 Mɩnla mfasʋ yɛ yenyia wɔ Nyamɩnlɩ...,aha,100.0,Ahanta,Latin
4,ngo,Vamteta vakulu na vagogo va vandu vamkotili “ ...,ngo,100.0,Ngoni,Latin
5,akp,"1Nnɛgbe kama ne , lonya bɔi lalaa ɔwɛ̃ gɔ ɔto ...",akp,100.0,Siwu,Latin
6,bst,ዬሱ̈ሲ ዋ ን ዎይና ጃ̇ ላ̈ሚ̇ዴ 1ዓይዚ̇ ቃም̇ ፥ ጋሊላ ጋዳ̇ ዎ ቃ̈...,bst,100.0,Basketo,Ethiopic
7,spp,20Ɲyɛ sùpyire t'à pyi ti ɲyɛ a kwû yire yyefuy...,spp,100.0,"Sénoufo, Supyire",Latin
8,dsh,56Hééllá lulle hatallá ˈdíéllá he giri ˈdeeny ...,dsh,100.0,Daasanach,Latin
9,sgw,ፍጥረት እንም ይእግዘር ክብር ያትየሽ 1በሰሜ ያነቦ እንም ይእግዘር ክብር...,sgw,100.0,Sebat Bet Gurage,Latin
