<h2>Language Validation</h2>
<p>This notebook is created in order to validate our language detection approach for tweets and assess the accuracy of the used model.</p>
<p>An external set of data has been used which provides text with correctly labeled language. The dataset is available on Kaggle and can be accessed through <a href="https://www.kaggle.com/datasets/basilb2s/language-detection?resource=download">this link</a>.</p>

In [1]:
# Importing modules
from langdetect import detect
import pandas as pd
from ISO6391 import languages

The module `ISO6391.py` provides a tuple of language names and their code in the ISO 639-1 standard. This tuple is available on GitHub an can be accessed via <a href="https://gist.github.com/alexanderjulo/4073388"> this link</a>.

In [2]:
# Loading the test data
df=pd.read_csv('data/lang_validation.csv')

In [3]:
# Checking if the data has been loaded correctly
df.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


<h4>Checking if all of the language names match</h4>

In [4]:
languagesInDataframe=df['Language'].unique()

In [5]:
languageNames=[i[1] for i in languages]

In [6]:
# There are a couple of adjustments needed in the languageNames list: 
def adjustForLanguages(lang):
    if lang=="Spanish; Castilian":
        return "Spanish"
    if lang=="Dutch; Flemish":
        return "Dutch"
    if lang=="Greek, Modern (1453-)":
        return "Greek"
    if lang=="Swedish":
        # Sorry Alicia. The package was the one who misspelled it :(
        return "Sweedish"
    if lang=="Portuguese":
        return "Portugeese"
    return lang
languageNames=list(map(adjustForLanguages,languageNames))

In [7]:
# Expecting nothing to be printed here
for lang in languagesInDataframe:
    if not lang in languageNames:
        print(lang)

In [8]:
langsDict={}
for i in range(len(languages)):
    langsDict[languageNames[i]]=languages[i][0]

<h3>Testing the accuracy</h3

In [35]:
correctDetections=0
numberOfSamples=len(df.index)

In [36]:
mistakes={}
for i in list(langsDict.values()):
    mistakes[i]=[]

In [37]:
for index,row in df.iterrows():
    try:
        detection=detect(row["Text"])
        actualLanguage=langsDict[row["Language"]]
        if detection==actualLanguage:
            correctDetections+=1
        else:
            mistakes[actualLanguage].append(detection)
    except:
        print(row)

Text        (2008).
Language    Spanish
Name: 5087, dtype: object
Text            3].
Language    Russian
Name: 6035, dtype: object
Text           4]).
Language    Russian
Name: 6056, dtype: object
Text             .
Language    Arabic
Name: 9109, dtype: object
Text             .
Language    Arabic
Name: 9110, dtype: object


<p> As seen in the previous block, there are four mis-labeled data points in our testing set which compared to it's size of 10337 is negligble.</p> <p>The accuracy of the model is as follows:</p>

In [38]:
print("Accuracy: ",(correctDetections/numberOfSamples))
print("Number of correct detections: ",correctDetections)

Accuracy:  0.9551127019444713


<h4>Model accuracy</h4>
When running the code on June 20th The model provided correct detections on 9873 samples out of the 10337 data points in total.</p> 

<h3>Likely Detection Errors</h3>

<p>In the list below you can see which languages are likely to be mistaken with what languages:</p>

In [42]:
for lang in list(mistakes.keys()):
    if len(mistakes[lang])>0 :
        print(lang)
        print(len(mistakes[lang]))
        print(mistakes[lang])
        print('-------------------------')

ar
1
['fa']
-------------------------
da
66
['en', 'de', 'no', 'sq', 'no', 'no', 'no', 'no', 'no', 'no', 'sv', 'sv', 'no', 'no', 'no', 'no', 'no', 'no', 'so', 'no', 'nl', 'no', 'fr', 'hr', 'so', 'sv', 'no', 'no', 'no', 'no', 'sv', 'id', 'no', 'no', 'no', 'af', 'no', 'no', 'no', 'nl', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'af', 'et', 'hr', 'no', 'et', 'af', 'no', 'no', 'no']
-------------------------
de
22
['en', 'et', 'et', 'ca', 'fr', 'et', 'hu', 'af', 'af', 'no', 'af', 'en', 'af', 'da', 'pl', 'nl', 'et', 'sv', 'af', 'en', 'af', 'fr']
-------------------------
nl
74
['af', 'en', 'de', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'no', 'hr', 'af', 'af', 'fr', 'af', 'af', 'de', 'af', 'hr', 'hr', 'cs', 'af', 'af', 'ro', 'pl', 'af', 'af', 'sl', 'cy', 'af', 'en', 'af', 'af', 'es', 'af', 'af', 'da', 'af', 'af', 'da', 'af', 'so', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'sl', 'af', 'af', 'af', 'af', '

<p>Dutch, Denish, Spanish and English (in order) are the languages which are most likely to be detected as another language.</p>
<p>Dutch is usually detected as Afrikaans, Danish is usually detected as Norwegian, and Spanish is often detected as Portuguese.</p>