<img src="https://github.com/UBC-NLP/afrolid/raw/main/images/afrolid_logo.jpg">

AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. AfroLID is described in this paper: 
[**AfroLID: A Neural Language Identification Tool for African Languages**](https://arxiv.org/abs/2210.11744).


### Check that the languages ID'd are spoken within East Africa

**ISO for languages spoken in EA**\
English - 'en'\
Swahili - 'sw'\
Ganda - 'lg'\
Kirundi - 'rn'\
French - 'fr'\
Somali - 'so'\
Arabic - 'ar'\
Amharic - 'am'\
Tigrinya - 'ti'\
Kinyarwanda - 'rw'

# LID with AfroLID

In [1]:
!pip install -U git+https://github.com/UBC-NLP/afrolid.git --q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.4/125.4 KB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m750.6/750.6 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.0/11.0 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.8/271.8 KB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 KB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.8/123.8 KB[0m [31m13.1 MB/s[0m eta

In [2]:
!pip install pycountry

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycountry
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pycountry
  Building wheel for pycountry (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pycountry: filename=pycountry-22.3.5-py2.py3-none-any.whl size=10681845 sha256=29f6841122cd3c6bf3c6f26d9fc18718147b8a6eae81707b02e0bc30419a59c3
  Stored in directory: /root/.cache/pip/wheels/e2/aa/0f/c224e473b464387170b83ca7c66947b4a7e33e8d903a679748
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-22.3.5


In [3]:
! wget https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
!tar -xf afrolid_model.tar.gz

--2023-03-03 06:54:04--  https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
Resolving demos.dlnlp.ai (demos.dlnlp.ai)... 74.208.236.113, 2607:f1c0:100f:f000::264
Connecting to demos.dlnlp.ai (demos.dlnlp.ai)|74.208.236.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2277022086 (2.1G) [application/gzip]
Saving to: ‘afrolid_model.tar.gz’


2023-03-03 06:56:17 (16.5 MB/s) - ‘afrolid_model.tar.gz’ saved [2277022086/2277022086]



In [4]:
import os, sys
import logging
from afrolid.main import classifier

In [5]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
    force=True, # Resets any previous configuration
)
logger = logging.getLogger("afrolid")


In [6]:
cl = classifier(logger, model_path="/content/afrolid_model")

2023-03-03 06:57:07 | INFO | afrolid | Initalizing AfroLID's task and model.


| [input] dictionary: 64001 types
| [label] dictionary: 528 types


In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
SA_dir = r'/content/drive/MyDrive/MIT/MIT 807 mini-dissertation/Data/SA-tweets.json'
KEN_dir = r'/content/drive/MyDrive/MIT/MIT 807 mini-dissertation/Data/KEN-tweets.json'
TZ_dir = r'/content/drive/MyDrive/MIT/MIT 807 mini-dissertation/Data/TZ-tweets.json' 

In [55]:
# country specific variables
dir = TZ_dir
country = 'TZ'
allowed_languages = ['en','sw','lg','rn','fr','so','ar','am','ti','rw']

In [9]:
import pandas as pd
import json

# open json files 
with open(dir, 'r') as f:
    dfs = {k: pd.read_json(v) for k, v in json.load(f).items()}

2023-03-03 06:57:54 | INFO | numexpr.utils | NumExpr defaulting to 2 threads.


In [10]:
from tqdm import tqdm
tqdm.pandas()

# Combine all dataframes into one and add a column for the key
df = pd.concat(dfs, keys=dfs.keys())
df = df.reset_index(level=1, drop=True)
df = df.reset_index()
df = df.rename(columns={'index': 'key'})
df.head(3)

Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location
0,daladala,2023-02-08 07:49:38,1623227542218985473,"Dar Es Salaam watu wana hasira sana, ukimgusa ...",fadhilikangusi,"Dar Es Salaam, Tanzania"
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania"
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania"


In [None]:
import nltk
nltk.download('popular')

In [None]:
from nltk.tokenize import ToktokTokenizer
import re

# clean data and remove punctuation characters
token = ToktokTokenizer()
punct = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~'

def clean_punct(text):
    words = token.tokenize(text)
    punctuation_filtered = []
    regex = re.compile('[%s]' % re.escape(punct))
    remove_punctuation = str.maketrans(' ', ' ', punct)
    for w in words:
      punctuation_filtered.append(regex.sub('',w))

    # filtered_list = strip_list_noempty(punctuation_filtered)

    return ' '.join(map(str, punctuation_filtered))
  
df['Text'] = df['Text'].apply(lambda x: clean_punct(x))
df.head(3)


## Get LID using AfroLID

In [67]:
whitelist = ['eng', 'swh', 'lug', 'run', 'fra', 'som', 'arb', 'amh', 'tir', 'kin']

In [69]:
def get_afrolid_prediction(text):
  predictions = cl.classify(text, max_outputs=1)
  for lang in predictions:
    if lang in whitelist:
      return lang, predictions[lang]['score'], predictions[lang]['name'], predictions[lang]['script']
    else:
      return 'NA', 0, 'unrecognised', 'NA'

In [71]:
df['predict_iso_afrolid'], df['predict_score_afrolid'], df['predict_name_afrolid'], df['predict_script_afrolid'] = zip(*df['Text'].progress_apply(get_afrolid_prediction))
df.head(3)

  0%|          | 0/998 [00:00<?, ?it/s]2023-03-03 08:34:39 | INFO | afrolid | Input text: Dar Es Salaam watu wana hasira sana  ukimgusa kidogo kwenye daladala anakupa bonge la tusi 
  0%|          | 2/998 [00:01<08:46,  1.89it/s]2023-03-03 08:34:40 | INFO | afrolid | Input text: Muonekano wa Kituo Kipya cha Daladala cha Kinyerezi pamoja na Barabara ya Lami  KM 71  Manispaa ya Ilala Kituo hiki kina uwezo wa kupokea Daladala 90 kwa wakati mmoja Ujenzi umetekelezwa na ortamisemitz kupitia Mradi wa DMDP Jijini Dar es Salaam httpstcos8xjQXsnLx
  0%|          | 3/998 [00:02<12:15,  1.35it/s]2023-03-03 08:34:41 | INFO | afrolid | Input text: Lakini kumpisha mtu mzima kwenye seat ya daladala sio part ya Maadili Mazuri kwa Upande wa Dar es salaam
  0%|          | 4/998 [00:03<14:03,  1.18it/s]2023-03-03 08:34:42 | INFO | afrolid | Input text: Kuna huyu mtu ana hadithia hapa eti anamwaka mzima hajawahi kukaa kwenye daladala Dar es salaam  always huwa anasimama tu🙌😂
  1%|          | 5/998 [00:03<

Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [73]:
df['predict_name_afrolid'].unique()

array(['Swahili', 'Somali', 'unrecognised', 'Kinyarwanda', 'Luganda'],
      dtype=object)

# LID using CLD3

In [16]:
!pip install pycld3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycld3
  Downloading pycld3-0.22-cp38-cp38-manylinux1_x86_64.whl (13.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.6/13.6 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pycld3
Successfully installed pycld3-0.22


In [75]:
import cld3

allowed_languages = ['en','sw','lg','rn','fr','so','ar','am','ti','rw']

def get_cld3_prediction(text):
  predictions = cld3.get_language(text)
  if predictions.language in allowed_languages:
    return predictions.language, predictions.probability, predictions.proportion, predictions.is_reliable
  else:
    return 'unrecognised', 0, 1.0, False


In [76]:
df['predict_iso_cld3'], df['predict_score_cld3'], df['predict_proportion_cld3'], df['predict_is_reliable_cld3'] = zip(*df['Text'].progress_apply(get_cld3_prediction))
df.head(3)

100%|██████████| 998/998 [00:00<00:00, 2204.14it/s]


Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [77]:
df['predict_iso_cld3'].unique()

array(['sw', 'unrecognised', 'en', 'so', 'fr'], dtype=object)

# LID using Franc

In [20]:
!pip install pyfranc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyfranc
  Downloading pyfranc-0.1.1-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.9/262.9 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyfranc
Successfully installed pyfranc-0.1.1


In [21]:
from pyfranc import franc

In [78]:
def get_franc_prediction(text):
  predictions = franc.lang_detect(text, whitelist = whitelist)
  for lang in predictions:
    return predictions[0][0], predictions[0][1]

In [79]:
df['predict_iso_franc'], df['predict_score_franc']= zip(*df['Text'].progress_apply(get_franc_prediction))
df.head(3)

100%|██████████| 998/998 [00:02<00:00, 477.07it/s]


Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [80]:
df['predict_iso_franc'].unique()

array(['swh', 'eng', 'som', 'fra', 'lug', 'kin', 'run'], dtype=object)

# Sanity check

In [81]:
df.head(3)

Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [83]:
# decode the ISO codes to language names for cld3 and franc

import pycountry

def get_language_name_cld3(code):
  try:
    lang = pycountry.languages.get(alpha_2=code)
    return lang.name
  except:
    return 'unrecognised'

df['predict_name_cld3'] = df['predict_iso_cld3'].apply(get_language_name_cld3)
df.head(3)


Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [85]:
iso_langs_EA = {
    'eng': 'English',
    'swh': 'Swahili',
    'lug': 'Ganda',
    'run': 'Kirundi',
    'fra': 'French',
    'som': 'Somali',
    'arb': 'Arabic',
    'amh': 'Amharic',
    'tir': 'Tigrinya',
    'kin': 'Kinyarwanda'
}

def map_language(iso_code):
  return iso_langs_EA.get(iso_code, 'unrecognised')

df['predict_name_franc'] = df['predict_iso_franc'].apply(map_language)
df.head(3)


Unnamed: 0,key,Datetime,Tweet Id,Text,Username,Location,predict_iso_afrolid,predict_score_afrolid,predict_name_afrolid,predict_script_afrolid,predict_iso_cld3,predict_score_cld3,predict_proportion_cld3,predict_is_reliable_cld3,predict_iso_franc,predict_score_franc,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,fadhilikangusi,"Dar Es Salaam, Tanzania",swh,99.92,Swahili,Latin,sw,0.999916,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,raphyrodrick,"Dar es Salaam, Tanzania",swh,100.0,Swahili,Latin,sw,0.999891,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,DejohB,"Dar es Salaam, Tanzania",swh,99.97,Swahili,Latin,sw,0.999998,1.0,True,swh,1.0,Swahili (macrolanguage),Swahili


In [86]:
# group the languages in a new df considering key, tweet ID, datetime, text, location and language names for the different LID tools

df_new = df[['key', 'Datetime', 'Tweet Id', 'Text', 'Location', 'predict_name_afrolid', 'predict_name_cld3', 'predict_name_franc']] 
df_new



Unnamed: 0,key,Datetime,Tweet Id,Text,Location,predict_name_afrolid,predict_name_cld3,predict_name_franc
0,daladala,2023-02-08 07:49:38,1623227542218985473,Dar Es Salaam watu wana hasira sana ukimgusa ...,"Dar Es Salaam, Tanzania",Swahili,Swahili (macrolanguage),Swahili
1,daladala,2023-01-31 13:00:19,1620406624908345344,Muonekano wa Kituo Kipya cha Daladala cha Kiny...,"Dar es Salaam, Tanzania",Swahili,Swahili (macrolanguage),Swahili
2,daladala,2023-01-27 06:19:14,1618856135645356034,Lakini kumpisha mtu mzima kwenye seat ya dalad...,"Dar es Salaam, Tanzania",Swahili,Swahili (macrolanguage),Swahili
3,daladala,2022-09-24 18:21:32,1573739424382423043,Kuna huyu mtu ana hadithia hapa eti anamwaka m...,"Dar es Salaam, Tanzania",Swahili,Swahili (macrolanguage),Swahili
4,daladala,2022-09-19 09:35:17,1571795051818192898,Hongera sana SuluhuSamia kwa Kupata Siti nzuri...,Dar es salaam,Swahili,Swahili (macrolanguage),Swahili
...,...,...,...,...,...,...,...,...
993,bajaj,2019-03-01 04:42:02,1101341804618690560,Tanzania Trade Fair 2019 participation of Baj...,,Swahili,unrecognised,French
994,bajaj,2019-03-01 04:37:31,1101340668473929728,Bajaj Electricals at the Tanzania trade fair 2...,,Swahili,English,French
995,bajaj,2018-12-21 07:38:29,1076019057890074624,In Dar Es Salaam The Business and Busiest Cit...,"Dar es Salaam, Tanzania",Swahili,English,English
996,bajaj,2018-07-18 14:49:14,1019594953532600320,Safari bora 🚗 huanza unapoendeshwa kwa Bajaj u...,"Dar es Salaam, Tanzania",Swahili,Swahili (macrolanguage),Swahili


In [87]:
df_new['predict_name_cld3'].unique()

array(['Swahili (macrolanguage)', 'unrecognised', 'English', 'Somali',
       'French'], dtype=object)

In [88]:
df_new.to_csv('/content/drive/MyDrive/MIT/MIT 807 mini-dissertation/Data/{}_tweets_with_LID.csv'.format(country), index=False)