# **Introduction**
Before getting into coding there are some things we should know.

* The number of supported languages and the nature of the languages themselves
Using language identification service between 3 languages:
English, Persian and chinis is more easier than to identify between A, B and C.  

* Problem solving level

    *   Document Level
    *   Span Level

* Problem solving solutions :

    *   Baseline method
    *   RNN-based method
    *   Transformer-based method

* Length of input text
Another important factor to consider is the length of the input text which effects the `Problem solving level` and the `Problem solving solutions`.

* Hardware and speed limitations
Training more advanced models usually requires more advanced hardware.
Also, usually the bigger the model, the higher the speed limit.

* Expected accuracy

* Available data

* Unknown or virtual language


# Data Gathering
**List of all supported Languages in their three-letter codes ([ISO 639-3](https://www.ethnologue.com/codes/))**


1.   ara: Arabic
2.   nld: Dutch
3.   eng: English
4.   ita: Italian
5.   fra: French
6.   deu: German
7.   pes: Persian
8.   rus: Russian
9.   spa: Spanish
10.  tur: Turkish


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# !wget -P ./data/ https://downloads.tatoeba.org/exports/per_language/ara/ara_sentences.tsv.bz2

In [2]:
%%shell
# Navigate to the directory where you want to save the files
mkdir -p /content/drive/MyDrive/lang_detect/data
lang_list=("ara" "nld" "eng" "fra" "deu" "ita" "pes" "rus" "spa" "tur")

for lang in ${lang_list[*]}; do
  wget -P /content/drive/MyDrive/lang_detect/data "https://downloads.tatoeba.org/exports/per_language/${lang}/${lang}_sentences.tsv.bz2"
done

--2024-07-05 11:43:33--  https://downloads.tatoeba.org/exports/per_language/ara/ara_sentences.tsv.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 716722 (700K) [application/octet-stream]
Saving to: ‘/content/drive/MyDrive/lang_detect/data/ara_sentences.tsv.bz2.17’


2024-07-05 11:43:35 (784 KB/s) - ‘/content/drive/MyDrive/lang_detect/data/ara_sentences.tsv.bz2.17’ saved [716722/716722]

--2024-07-05 11:43:35--  https://downloads.tatoeba.org/exports/per_language/nld/nld_sentences.tsv.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2279912 (2.2M) [application/octet-stream]
Saving to: ‘/content/drive/MyDrive/lang_detect/data/nld_sentences.ts



In [4]:
import re
import os

from glob import glob
import pandas as pd
import numpy as np
import csv

In [5]:
root_path = "/content/drive/MyDrive/lang_detect"

In [6]:
lang_files = glob(os.path.join(root_path, 'data', '*.tsv.bz2'))
print(f'Number of language files: {len(lang_files)}')
lang_files

Number of language files: 10


['/content/drive/MyDrive/lang_detect/data/nld_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/ita_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/tur_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/fra_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/ara_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/spa_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/pes_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/eng_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/deu_sentences.tsv.bz2',
 '/content/drive/MyDrive/lang_detect/data/rus_sentences.tsv.bz2']

## 1. Creating Document level dataset

In [7]:
df_Document_level = pd.DataFrame(columns=['lang', 'text'])

TRUNCATE_SIZE = 5000

for lang in lang_files:
    # Read the language file into a DataFrame
    df = pd.read_csv(lang, sep='\t', names=["sent_id", "lang", "text"], quoting=csv.QUOTE_NONE)

    # Drop the 'sent_id' column
    del df['sent_id']

    # Apply text cleaning (removing punctuations)
    df['text'] = df['text'].apply(lambda txt: re.sub(
        r'([！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠'
        r'｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.,!?()])', r' ', txt))
    df['text'] = df['text'].apply(lambda txt: re.sub(
        '([！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠'
        '｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.,!?()])', r' ', txt))

     # Take a sample in a DataFrame
    # **************
    df = df.loc[(df['text'].str.len() <= 50) & (df['text'].str.len() >= 35)]
    df = df.sample(frac=1, random_state=1)[:TRUNCATE_SIZE]
    df_Document_level = pd.concat([df_Document_level, df])
    # **************

    print(f'{os.path.basename(lang)[:3]}: {len(df)}')

# Reset index of the final DataFrame
df_Document_level.reset_index(drop=True, inplace=True)
# Display the resulting document-level DataFrame
df_Document_level.head()

nld: 5000
ita: 5000
tur: 5000
fra: 5000
ara: 5000
spa: 5000
pes: 5000
eng: 5000
deu: 5000
rus: 5000


Unnamed: 0,lang,text
0,nld,Ondanks alles begon Tom te ontspannen
1,nld,Schaf de intellectuele eigendom af
2,nld,Tom en ik hebben elkaar drie jaar niet gesproken
3,nld,Bacteriën zijn maar kleine onschadelijke cellen
4,nld,Ligt Tom nog steeds in het ziekenhuis


In [8]:
tmp_df = df_Document_level.copy()
tmp_df['text'] = tmp_df['text'].apply(lambda txt: txt.split())
tmp_df = tmp_df.explode('text', ignore_index=True)
tmp_df.head()

Unnamed: 0,lang,text
0,nld,Ondanks
1,nld,alles
2,nld,begon
3,nld,Tom
4,nld,te


In [9]:
tmp_df.groupby(['lang']).nunique()

Unnamed: 0_level_0,text
lang,Unnamed: 1_level_1
ara,11879
deu,6450
eng,4838
fra,6635
ita,5598
nld,5770
pes,7974
rus,8748
spa,7493
tur,9309


## 2. Creating Span level dataset

Span level (for Transformer-based method) : Phrase -> [word1, word2 ... wordN] [word1_lang, word2_lang ... wordN_lang]

In [10]:
import random

In [11]:
def create_span_level_data(all_langs, num_span_samples=500):

    all_span = []
    all_span_label = []

    for i in range(num_span_samples):
        if i % 100 == 0:
            print(f"sample {i}/{num_span_samples}")
        random.seed(i)
        num_phrase = random.choice(range(6, 9))
        random.seed(i)
        num_langs = random.choice(range(4, 9))
        random.seed(i)
        langs = random.choices(all_langs, k=num_langs)[:num_phrase]
        random.seed(i)
        # **************
        augmented_langs = langs + random.choices(langs, k=num_phrase - len(langs))
        # **************

        span = []
        span_label = []

        for j, lang in enumerate(augmented_langs):
            # **************
            phrase = df_Document_level[df_Document_level["lang"] == lang].sample(n=1, random_state = i+j)
            splitted_phrase = phrase.iloc[0]["text"].split()
            span.extend(splitted_phrase)
            span_label.extend([lang]*len(splitted_phrase))
            # **************

        all_span.append(span)
        all_span_label.append(span_label)

    span_level_data = pd.DataFrame(data={"span": all_span, "label": all_span_label})

    return span_level_data


In [12]:
all_langs = df_Document_level["lang"].unique()
df_Span_level = create_span_level_data(num_span_samples=5000, all_langs=all_langs)
df_Span_level

sample 0/5000
sample 100/5000
sample 200/5000
sample 300/5000
sample 400/5000
sample 500/5000
sample 600/5000
sample 700/5000
sample 800/5000
sample 900/5000
sample 1000/5000
sample 1100/5000
sample 1200/5000
sample 1300/5000
sample 1400/5000
sample 1500/5000
sample 1600/5000
sample 1700/5000
sample 1800/5000
sample 1900/5000
sample 2000/5000
sample 2100/5000
sample 2200/5000
sample 2300/5000
sample 2400/5000
sample 2500/5000
sample 2600/5000
sample 2700/5000
sample 2800/5000
sample 2900/5000
sample 3000/5000
sample 3100/5000
sample 3200/5000
sample 3300/5000
sample 3400/5000
sample 3500/5000
sample 3600/5000
sample 3700/5000
sample 3800/5000
sample 3900/5000
sample 4000/5000
sample 4100/5000
sample 4200/5000
sample 4300/5000
sample 4400/5000
sample 4500/5000
sample 4600/5000
sample 4700/5000
sample 4800/5000
sample 4900/5000


Unnamed: 0,span,label
0,"[Ich, kann, nicht, körperlich, arbeiten, Tom, ...","[deu, deu, deu, deu, deu, eng, eng, eng, eng, ..."
1,"[Non, andiamo, a, costruire, arene, in, Uganda...","[ita, ita, ita, ita, ita, ita, ita, deu, deu, ..."
2,"[""Большое, спасибо"", -, сказал, он, с, улыбкой...","[rus, rus, rus, rus, rus, rus, rus, rus, rus, ..."
3,"[""Bunlar, kimin, köpekleri, "", ""Onlar, benim, ...","[tur, tur, tur, tur, tur, tur, tur, tur, spa, ..."
4,"[Sonuna, kadar, okuduğunuz, için, teşekkürler,...","[tur, tur, tur, tur, tur, ita, ita, ita, ita, ..."
...,...,...
4995,"[او, مرا, نادیده, گرفت, حتی, زمانی, که, مرا, د...","[pes, pes, pes, pes, pes, pes, pes, pes, pes, ..."
4996,"[Tom, spricht, ebenso, gut, Französisch, wie, ...","[deu, deu, deu, deu, deu, deu, deu, spa, spa, ..."
4997,"[Ты, хочешь, научиться, это, делать, да, Lei, ...","[rus, rus, rus, rus, rus, rus, ita, ita, ita, ..."
4998,"[برای, اسکی, کردن, اغلب, به, جای, دوری, می, رف...","[pes, pes, pes, pes, pes, pes, pes, pes, pes, ..."


In [13]:
df_Document_level.to_csv(os.path.join(root_path, "Document_level.csv"), index=False)
df_Span_level.to_csv(os.path.join(root_path, "Span_level_data.csv"), index=False)

# **Document level**

In [14]:
df_Document_level = pd.read_csv(os.path.join(root_path, "Document_level.csv"))

## **1.Base line model**

In [15]:
import string

# A dictionary of all languages to its ISO_639-3 format
lang_label = {
    'ara': 'Arabic',
    'nld': 'Dutch',
    'eng': 'English',
    'ita': 'Italian',
    'fra': 'French',
    'deu': 'German',
    'pes': 'Persian',
    'rus': 'Russian',
    'spa': 'Spanish',
    'tur': 'Turkish'
}

# Creating a dictionary mapping all languages to their specific characters.
lang_dict = dict()

for lang, _ in lang_label.items():

    lang_texts = df_Document_level.loc[df_Document_level['lang'] == lang]

    # Concatenating Strings in the Series with a Given Separator.
    # **************
    concatenated_text = ' '.join(lang_texts['text'])
    # **************

    # Removing punctuation and digits and lower
    # **************
    # Create a translation table to remove punctuation
    translation_table = str.maketrans("", "", string.punctuation + "؟«»٪" + string.digits + "۰۱۲۳۴۵۶۷۸۹")
    # Apply the translation table and lower method to remove punctuation and digits and get lower
    cleaned_text = concatenated_text.translate(translation_table).lower()
    # **************

    # Inserting the language and its specific characters into the lang_dict dictionary.
    # **************
    lang_dict[lang] = set(cleaned_text)
    # **************

print(lang_dict)

{'ara': {'g', 'ْ', 'i', '\u200c', 'n', 'ز', '\u202e', '؛', 'أ', 'ـ', 'ط', 'خ', 'w', 'f', 'c', '٦', '١', '٣', 'ء', '\u202c', 'ﻹ', 'ع', ' ', 'ش', 'ذ', 'ُ', 'ظ', 'م', 'ر', 'ب', 'u', 'o', 'y', '٤', 'ة', 'h', 't', 'k', 'ß', 'ى', 'ٓ', '،', 'إ', 'ٌ', '\xa0', 'ٱ', 's', 'd', 'ص', 'ک', 'ي', 'س', 'ِ', 'ض', 'د', '\u200f', 'ﻻ', '٠', '٢', 'ل', 'a', 'm', 'َ', 'ً', 'v', 'ؤ', 'ف', 'ا', 'ن', 'ج', 'ئ', 'غ', 'ث', 'ت', 'r', 'ّ', 'چ', 'l', 'و', 'ك', 'ق', 'e', 'ح', 'ٍ', '٩', 'b', 'ی', 'آ', 'ه'}, 'nld': {'é', 'ō', 'ó', 'ë', 'í', 'g', 'ā', 'ș', 'z', 'j', 'ț', 's', ' ', 'd', 'x', 'ă', 'p', 'k', 'ı', 'i', 'r', '\u200b', 'n', 'ï', 'ö', 'l', 'u', 'e', 'o', 'y', 'w', 'a', 'm', 'á', 'b', 'f', 'q', 'h', 't', 'c', 'ß', 'è', 'v', 'ī'}, 'eng': {'é', 'g', 'z', 'j', 's', ' ', 'd', 'x', 'p', 'i', 'r', 'n', 'ö', 'l', 't', 'u', 'e', 'y', 'o', 'w', 'a', 'm', 'ê', 'f', 'b', 'ä', 'q', 'h', 'k', 'c', 'v'}, 'ita': {'é', 'í', 'g', 'z', 'j', 's', ' ', 'd', 'x', 'p', 'k', 'i', 'r', 'n', 'ì', 'l', 'u', 'e', 'o', 'y', 'ò', 'w', 'm', '

In [16]:
for key, value in lang_label.items():
    print('{}: {}'.format(key, len(lang_dict[key])))

ara: 89
nld: 44
eng: 31
ita: 35
fra: 46
deu: 38
pes: 77
rus: 54
spa: 43
tur: 39


In [17]:
# Removing the shared characters from languages in order to create a unique character list for each language.
uniqe_char_dict = {}

# ************** 15 Min
for lang_1 in lang_label.keys():
    uniqe_char_dict[lang_1] = lang_dict[lang_1].copy()
    for lang_2 in lang_label.keys():
        if lang_1 != lang_2:
            uniqe_char_dict[lang_1] -= lang_dict[lang_2]
# **************
print(uniqe_char_dict)

{'ara': {'ٓ', 'إ', 'ْ', '\u202c', 'ٱ', 'ﻹ', '\u202e', 'ﻻ', 'ـ', '٠', '٢', 'ٍ', '٩', '٤', 'ة', '٦', '١', '٣'}, 'nld': {'ș', 'ț', 'ă', 'ī'}, 'eng': set(), 'ita': {'ì', 'ò'}, 'fra': {'î', 'ɛ', 'č', 'œ', 'ḍ', 'ô'}, 'deu': {'‚'}, 'pes': {'š', '\u200d', 'ٔ', 'پ', 'ژ', 'گ', 'ۀ'}, 'rus': {'ы', 'ш', 'щ', 'ь', 'й', 'у', 'п', 'х', 'ч', 'ц', 'л', 'и', '́', 'я', 'р', 'к', 'о', 'а', 'м', 'т', 'б', '₫', 'д', 'г', '―', 'н', 'ф', 'ж', 'с', 'ю', 'ё', 'з', 'е', 'ъ', 'в', 'э'}, 'spa': {'ñ', 'ǔ', 'º', '¡', '¿', 'ǎ', 'ú'}, 'tur': {'ş', 'ğ', '̇', '²'}}


In [18]:
for key, value in lang_label.items():
    print('{}: {}'.format(key, len(uniqe_char_dict[key])))

ara: 18
nld: 4
eng: 0
ita: 2
fra: 6
deu: 1
pes: 7
rus: 36
spa: 7
tur: 4


### 1.1. The input text language is determined by the first language which shares unique characters with the input text.

In [19]:
test_data = df_Document_level.sample(n=1000, random_state=1)

acc = 0
for index, row in test_data.iterrows():
    txt_chars = set(row['text'])
    for key, value in lang_label.items():
        if len(uniqe_char_dict[key].intersection(txt_chars)) > 0:
            if key == row['lang']:
                acc += 1
            break

print('Acc: {}'.format(acc / len(test_data.index)))

Acc: 0.293


### 1.2. The language of the input text is determined based on the language that shares the highest number of unique characters with the input text.

In [20]:
test_data = df_Document_level.sample(n=1000, random_state=1)

acc = 0

for index, row in test_data.iterrows():
    # ************** 10 Min
    txt_chars = set(row['text'])
    max_shared_chars = 0
    detected_lang = None
    for lang, lang_chars in uniqe_char_dict.items():
        shared_chars = len(txt_chars & lang_chars)
        if shared_chars > max_shared_chars:
            max_shared_chars = shared_chars
            detected_lang = lang
    if detected_lang == row['lang']:
        acc += 1
    # **************

print('Acc: {}'.format(acc / len(test_data.index)))

Acc: 0.293


### 1.3. Utilizing the Baseline method for the simpler problem.

In this section, we would like to include three languages: Persian, English, and Russian, and then evaluate the baseline method.



In [21]:
test_data = df_Document_level[df_Document_level["lang"].isin(["pes", "eng", "rus"])].sample(n=1000, random_state=1)

acc = 0

for index, row in test_data.iterrows():
    # ************** 5 Min
    def detect_language(index: int, row: pd.Series) -> int:
        txt_chars = set(row['text'])
        max_shared_chars = 0
        detected_lang = None
        for lang, lang_chars in uniqe_char_dict.items():
            shared_chars = len(txt_chars & lang_chars)
            if shared_chars > max_shared_chars:
                max_shared_chars = shared_chars
                detected_lang = lang
        return detected_lang

    detected_lang = detect_language(index, row)
    if detected_lang == row['lang']:
        acc += 1
    # **************

print('Acc: {}'.format(acc / len(test_data)))

Acc: 0.511
