<a href="https://colab.research.google.com/github/diegofrl/BT---Cross-Lingual-Classification/blob/main/BT_DoYouSprichstFran%C3%A7ais.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook setup

In [1]:
!git clone https://github.com/diegofrl/BT---Cross-Lingual-Classification.git

Cloning into 'BT---Cross-Lingual-Classification'...
remote: Enumerating objects: 210, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 210 (delta 17), reused 0 (delta 0), pack-reused 179[K
Receiving objects: 100% (210/210), 44.83 MiB | 19.28 MiB/s, done.
Resolving deltas: 100% (32/32), done.


In [2]:
!apt-get install -y hunspell hunspell-de-at hunspell-de-ch hunspell-de-de hunspell-en-au hunspell-en-ca hunspell-en-gb hunspell-en-us hunspell-fr hunspell-it
!apt-get install -y libhunspell-dev
!pip install hunspell==0.3.4
!pip install langdetect
!pip install nltk
!pip install pandas
!pip install scikit-learn
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm
!python -m spacy download de_core_news_sm
!python -m spacy download it_core_news_sm
!pip install transformers

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  dictionaries-common emacsen-common hunspell-fr-classical libhunspell-1.7-0
  libtext-iconv-perl
Suggested packages:
  wordlist libreoffice-writer openoffice.org-hunspell | openoffice.org-core
The following NEW packages will be installed:
  dictionaries-common emacsen-common hunspell hunspell-de-at hunspell-de-ch
  hunspell-de-de hunspell-en-au hunspell-en-ca hunspell-en-gb hunspell-en-us
  hunspell-fr hunspell-fr-classical hunspell-it libhunspell-1.7-0
  libtext-iconv-perl
0 upgraded, 15 newly installed, 0 to remove and 38 not upgraded.
Need to get 2,843 kB of archives.
After this operation, 12.9 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libtext-iconv-perl amd64 1.7-7 [13.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 emacsen-common all 3.0.4 [14.9 kB]
Get:3 http://ar

## Langdetect setup
Remove all unused languages from the langdetect library

In [3]:
!find /usr/local/lib/python3.10/dist-packages/langdetect/profiles -type f ! \( -name 'en' -o -name 'fr' -o -name 'de' -o -name 'it' \) -delete

## Import dependencies

In [4]:
import hunspell
from langdetect import detect_langs, detect
import nltk
nltk.download('punkt')
import os
import re
import pandas as pd
from sklearn import svm
import joblib
import csv
from math import ceil
import numpy as np
from collections import defaultdict
import spacy
import torch
from transformers import AutoTokenizer, AutoModel
from keras.utils import to_categorical
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from keras.models import Sequential, load_model
from keras.layers import Dense
import string
import unicodedata
import plotly.express as px
from sklearn import preprocessing

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Text Parsing Functions
In order for the segmentation of words and sentences to be consistent among the notebook, these methods are the only ones to use for that purpose.

In [5]:
def get_sentences(txt):
    sents = nltk.sent_tokenize(txt)

    return sents


def get_words(txt):
    word_regex = re.compile(r"[^\d\W]+")
    words_list = word_regex.findall(txt)

    return words_list

## Processing langdetect outputs
Modify the result of the 'detect_langs' function from the 'langdetect' library to obtain the likelihood of an input being in a particular language.

In [6]:
def get_prob_for_language(output, language_code):
    NB_PROFILES = 4
    prob = 0.0

    for prediction in output[:NB_PROFILES]:
        if prediction.lang == language_code:
            prob = round(prediction.prob, 2)
            break

    return prob

# Data Processing

## Features extraction
Extract the features from the cross-lingual text files and store them in CSV files. Note that the columns wordLength, wordIsEN, sentenceIsEN, wordIsDE, sentenceIsDE, wordIsFR, sentenceIsFR, wordIsIT, sentenceIsIT are features used to train the langdetect-based machine learning approach.

In [None]:
texts_directory = "/content/BT---Cross-Lingual-Classification/CrosslingualTexts"
output_directory = "/content/BT---Cross-Lingual-Classification/Data"

if not os.path.exists(output_directory):
    os.makedirs(output_directory)

header = ['MainLanguage', 'Category','Filename', 'SentenceIndex', 'Sentence', 'Word', 'wordLength',
          'wordIsEN', 'sentenceIsEN',
          'wordIsDE', 'sentenceIsDE',
          'wordIsFR', 'sentenceIsFR',
          'wordIsIT', 'sentenceIsIT']

for text_filename in os.listdir(texts_directory):
    if text_filename.endswith(".txt"):
        inputText = open(f"{texts_directory}/{text_filename}", "r") 
        text = inputText.read()
        inputText.close()

        data = []
        sentences = get_sentences(text)
        main_language, category = text_filename.split('_')
        category = category.split('.')[0]
        sentence_index = 0
        filename = os.path.splitext(text_filename)[0]

        # For each sentence, predict the probabilities isEn, isFr, isIT and isDE
        for s in sentences:
            words = get_words(s)
            spl = None
            try:
                spl = detect_langs(s)
            except:
                pass
            # For each word of the sentence, predict predict the probabilities isEn, isFr, isIT and isDE
            for w in words:
                if w != "":
                    wpl = None
                    try:
                        wpl = detect_langs(w)
                    except:
                        pass
                    row = [main_language, category, filename, sentence_index, s, w, len(w),
                           get_prob_for_language(wpl, "en") if wpl else 0.0,
                           get_prob_for_language(spl, "en") if spl else 0.0,
                           get_prob_for_language(wpl, "de") if wpl else 0.0,
                           get_prob_for_language(spl, "de") if spl else 0.0,
                           get_prob_for_language(wpl, "fr") if wpl else 0.0,
                           get_prob_for_language(spl, "fr") if spl else 0.0,
                           get_prob_for_language(wpl, "it") if wpl else 0.0,
                           get_prob_for_language(spl, "it") if spl else 0.0]
                    data.append(row)
            sentence_index += 1

        df = pd.DataFrame(data, columns=header)
        df.index.name = 'index'
        output_filename = filename + "_values" + '.csv'
        output_path = os.path.join(output_directory, output_filename)
        df.to_csv(output_path, index=True)

## Merge the targets
Merge the feature files created previously with the target files and verify that the values always match their corresponding target to avoid an eventual offset problem.

In [None]:
data_dir = "/content/BT---Cross-Lingual-Classification/Data"

feature_files = [f for f in os.listdir(data_dir) if f.endswith("_values.csv")]

# Loop over the feature files
for feature_file in feature_files:
    try:
        # Get the corresponding target file
        target_file = feature_file.replace("_values.csv", ".csv")

        # Load the feature file and target file into dataframes
        feature_df = pd.read_csv(os.path.join(data_dir, feature_file))
        target_df = pd.read_csv(os.path.join(data_dir, target_file))

        # Add an index column to the target dataframe
        target_df['index'] = target_df.index

        # Create a new dataframe to store the rows where the word is the same in both dataframes
        valid_rows_df = pd.DataFrame(columns=feature_df.columns)

        # Loop over each row in the feature dataframe
        for i, row in feature_df.iterrows():
            # Get the word from the feature dataframe
            feature_word = row['Word']

            # Get the word from the target dataframe
            target_word = target_df.iloc[i]['Word']

            # Check if the words are the same
            if feature_word == target_word:
                # If the words are the same, add this row to the valid rows dataframe using pandas.concat
                valid_rows_df = pd.concat([valid_rows_df, pd.DataFrame(row).T], ignore_index=True)

        # Merge the valid rows and target dataframes
        merged_df = pd.merge(valid_rows_df, target_df, on="index")

        # Save the merged dataframe to a new file
        output_file = feature_file.replace("_values.csv", "_merged.csv")
        merged_df.to_csv(os.path.join(data_dir, output_file), index=False)

    except Exception as e:
        print(f"Error processing file: {feature_file}. {e}")

## Merge all files in one dataset

In [None]:
merged_dir = "/content/BT---Cross-Lingual-Classification/Data" 
output_file = "/content/BT---Cross-Lingual-Classification/Datasets/dataset.csv"  # path to output file

header_written = False  # flag to keep track of whether the header has been written yet

# Loop over all files in the merged directory that end with '_merged.csv'
for file_name in os.listdir(merged_dir):
    if file_name.endswith("_merged.csv"):
        # Read the current file into a DataFrame
        current_df = pd.read_csv(os.path.join(merged_dir, file_name))
        
        # Remove the 'index' column
        current_df = current_df.drop(columns=['index'])
        
        # Replace NaN values with 0.0
        current_df = current_df.fillna(value=0.0)
        
        # If the header has not been written yet, write it to the output file
        if not header_written:
            current_df.to_csv(output_file, mode="w", index=False)
            header_written = True
        # Otherwise, write only the rows to the output file
        else:
            current_df.to_csv(output_file, mode="a", header=False, index=False)

In [None]:
# Load the dataset
df = pd.read_csv('/content/BT---Cross-Lingual-Classification/Datasets/dataset.csv')

df #Display an overview of the dataset

Unnamed: 0,MainLanguage,Category,Filename,SentenceIndex,Sentence,Word_x,wordLength,wordIsEN,sentenceIsEN,wordIsDE,sentenceIsDE,wordIsFR,sentenceIsFR,wordIsIT,sentenceIsIT,Word_y,Language
0,de,HistoricalContext,de_HistoricalContext,0,Im Laufe der Geschichte haben sich verschieden...,Im,2,0.00,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Im,de
1,de,HistoricalContext,de_HistoricalContext,0,Im Laufe der Geschichte haben sich verschieden...,Laufe,5,0.00,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Laufe,de
2,de,HistoricalContext,de_HistoricalContext,0,Im Laufe der Geschichte haben sich verschieden...,der,3,0.00,0.0,1.0,1.0,0.0,0.0,0.0,0.0,der,de
3,de,HistoricalContext,de_HistoricalContext,0,Im Laufe der Geschichte haben sich verschieden...,Geschichte,10,0.00,0.0,1.0,1.0,0.0,0.0,0.0,0.0,Geschichte,de
4,de,HistoricalContext,de_HistoricalContext,0,Im Laufe der Geschichte haben sich verschieden...,haben,5,0.00,0.0,1.0,1.0,0.0,0.0,0.0,0.0,haben,de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15414,it,ArtAndLitterature,it_ArtAndLitterature,25,Il compositore tedesco Ludwig van Beethoven è ...,i,1,0.94,0.0,0.0,0.0,0.0,0.0,0.0,1.0,i,it
15415,it,ArtAndLitterature,it_ArtAndLitterature,25,Il compositore tedesco Ludwig van Beethoven è ...,suoi,4,0.00,0.0,0.0,0.0,0.0,0.0,1.0,1.0,suoi,it
15416,it,ArtAndLitterature,it_ArtAndLitterature,25,Il compositore tedesco Ludwig van Beethoven è ...,concerti,8,0.00,0.0,0.0,0.0,0.0,0.0,1.0,1.0,concerti,it
15417,it,ArtAndLitterature,it_ArtAndLitterature,25,Il compositore tedesco Ludwig van Beethoven è ...,per,3,0.00,0.0,0.0,0.0,0.0,0.0,1.0,1.0,per,it


## Split dataset into training and testing sets
Split the dataset into training and testing sets. Randomly select 30% of the sentences of each text file for the testing and 70% for the training of the machine learning approaches.

In [None]:
# Get the unique filenames
filenames = df['Filename'].unique()

# Create empty dataframes for testing and training sets
test_df = pd.DataFrame(columns=df.columns)
train_df = pd.DataFrame(columns=df.columns)

# Set a random seed for reproducibility
np.random.seed(0)

# Loop through each filename
for filename in filenames:
    # Get the rows for the current filename
    file_df = df[df['Filename'] == filename]
    
    # Calculate the number of sentences to keep for testing
    test_sentences = ceil(file_df['SentenceIndex'].nunique() * 0.3)
    
    # Randomly select sentence indices for testing
    test_indices = np.random.choice(file_df['SentenceIndex'].unique(), size=test_sentences, replace=False)
    
    # Split the data into testing and training sets
    test_data = file_df[file_df['SentenceIndex'].isin(test_indices)]
    train_data = file_df[~file_df['SentenceIndex'].isin(test_indices)]
    
    # Append the data to the testing and training dataframes using pandas.concat
    test_df = pd.concat([test_df, test_data], ignore_index=True)
    train_df = pd.concat([train_df, train_data], ignore_index=True)

# Save the testing and training sets to CSV files
test_df.to_csv('test_data.csv', index=False)
train_df.to_csv('train_data.csv', index=False)

# Rule-Based approach

In [7]:
# List of language variations with the order based on the number of speakers
EN = ["en_US", "en_GB", "en_CA", "en_AU"]
FR = ["fr_FR", "fr_CA", "fr_BE", "fr_CH", "fr_LU"]
DE = ["de_DE", "de_AT", "de_CH", "de_BE", "de_LU", "de_LI"]
IT = ["it_IT", "it_CH"]

# Function to check if a word exists in the dictionary of a language
def check_word(word, languages):
    for language in languages:
        for variation in language:
            # Creating a Hunspell object for each language variation
            hobj = hunspell.HunSpell(
                os.path.join("/usr/share/hunspell", variation + ".dic"),
                os.path.join("/usr/share/hunspell", variation + ".aff"),
            )
            # If the word exists in the language variation dictionary, return the language
            if hobj.spell(word):
                return variation.split("_")[0]
    # If the word is not found in any language, return an empty string
    return ""

# Function to get language predictions for each word in a text
def rb_get_predictions(text, base_language):
    languages = []
    # Determine the order of languages to check based on the base language
    if base_language == "en":
        languages = [EN, FR, DE, IT]
    elif base_language == "fr":
        languages = [FR, EN, DE, IT]
    elif base_language == "de":
        languages = [DE, EN, FR, IT]
    elif base_language == "it":
        languages = [IT, EN, FR, DE]
    else:
        # Return -1 if the base language is not supported
        return -1

    # Extract words from the text
    words = get_words(text)
    result = []
    for i, word in enumerate(words):
        # Check the language of the word
        language = check_word(word, languages)
        # If the word doesn't exist in any language,
        # take the language of the previous word or the base language if it's the first word
        if language == "":
            if i != 0:
                language = result[i - 1][1]
            else:
                language = base_language
        # Append the word and its predicted language to the result
        result.append([word, language])
    return result

# Example usage
base_language = "fr"
text = "Salut comment allez-vous?"
# Get predictions for each word in the text
result = rb_get_predictions(text, base_language)
print(result)


[['Salut', 'fr'], ['comment', 'fr'], ['allez', 'fr'], ['vous', 'fr']]


# Named Entity Recognition (NER) approach

In [8]:
# Loading Spacy language models
fr_model = spacy.load("fr_core_news_sm")
de_model = spacy.load("de_core_news_sm")
it_model = spacy.load("it_core_news_sm")
en_model = spacy.load("en_core_web_sm")

# Function to extract foreign named entities from the text
def extract_foreign_entities(text, base_language):
    foreign_entities = []
    # List of named entity types that are likely to be in a foreign language
    probable_foreign_ne = ["NORP", "FAC", "GPE", "LOC", "LANGUAGE", "PERSON", "ORG", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW"]
    
    # Select the appropriate model based on the base language
    if base_language == "fr":
        model = fr_model
    elif base_language == "de":
        model = de_model
    elif base_language == "it":
        model = it_model
    elif base_language == "en":
        model = en_model

    # Apply the model to the text to extract named entities
    doc = model(text)
    for ent in doc.ents:
        # If the entity type is likely to be in a foreign language
        if ent.label_ in probable_foreign_ne:
            entity_text = ent.text

            # Try to detect the language of the entity
            try:
                entity_lang = detect(entity_text)
            except:
                # If language detection fails, assume it's the base language
                entity_lang = base_language

            # If the detected language is not the base language, add it to the list of foreign entities
            if entity_lang != base_language:
                foreign_entities.append([entity_text, entity_lang])
    return foreign_entities

# Function to label foreign entities in the text
def label_text(text, lst):
    index = 0
    for element in lst:
        # Replace the foreign entity in the text with a label
        text = text.replace(element[0], f"FOREIGN_ENTITY_{str(index)} ", 1)
        index += 1
    return text

# Function to replace labels in a list with the corresponding foreign entities
def replace_label(lst, foreign_lst):
    for i in range(len(lst)):
        if lst[i][0].startswith("FOREIGN_ENTITY_"):
            index = int(lst[i][0].split('_')[-1])
            if index < len(foreign_lst):
                lst[i] = foreign_lst[index]
    return lst

# Function to classify words as being in the base language or a foreign language
def classify_words(text, base_language, foreign_entities):
    word_regex = re.compile(r"\w+")
    # Extract words from the text
    words = word_regex.findall(text)
    words_with_lang = []
    for word in words:
        # Assume each word is in the base language
        words_with_lang.append([word, base_language])
    # Replace labels in the list with the corresponding foreign entities
    pred = replace_label(words_with_lang, foreign_entities)
    return pred

# Function to split labeled entities into individual words
def extract_by_word(pred):
    output_lst = []
    for elem in pred:
        word, lang = elem
        # Split the entity into words
        words = get_words(word)
        for w in words:
            # Append each word and its language to the output list
            output_lst.append([w, lang])
    return output_lst

# Function to get language predictions for each word in a text using named entity recognition
def ner_get_predictions(text, base_language):
    # Extract foreign named entities from the text
    foreign_entities = extract_foreign_entities(text, base_language)
    # Label foreign entities in the text
    labeled_text = label_text(text, foreign_entities)
    # Classify words as being in the base language or a foreign language
    pred = classify_words(labeled_text, base_language, foreign_entities)
    # Split labeled entities into individual words
    pred = extract_by_word(pred)
    return pred

# Example usage
text = "Turn right on Heilmanstrasse."
print(ner_get_predictions(text, 'en'))


[['Turn', 'en'], ['right', 'en'], ['on', 'en'], ['Heilmanstrasse', 'de']]


# Langdetect based machine learning approach

## Data pre-processing
Preparation of the data to train and evaluate the model's accuracy

In [None]:
# Define the feature names
feature_names = ['wordLength',
                 'wordIsEN', 'sentenceIsEN',
                 'wordIsDE', 'sentenceIsDE',
                 'wordIsFR', 'sentenceIsFR',
                 'wordIsIT', 'sentenceIsIT']

# Load the training data from the train_data.csv file
train_df = pd.read_csv('/content/BT---Cross-Lingual-Classification/Datasets/train_data.csv')

# Get the feature and target data for training
X_train = train_df[feature_names]
y_train = train_df['Language']

# Load the testing data from the test_data.csv file
test_df = pd.read_csv('/content/BT---Cross-Lingual-Classification/Datasets/test_data.csv')

# Filter the rows to exclude words with a length smaller or equal to 1
test_df = test_df[test_df['wordLength'] > 1]

# Get the feature and target data for testing
X_test = test_df[feature_names]
y_test = test_df['Language']

## Model training
Train the machine learning model and print its accuracy

In [None]:
svm_classifier = svm.SVC(kernel='linear')

svm_classifier.fit(X_train, y_train)

accuracy = svm_classifier.score(X_test, y_test)
print('Accuracy:', accuracy)

Accuracy: 0.8644910309055543


In [None]:
# Save the model
joblib.dump(svm_classifier, '/content/BT---Cross-Lingual-Classification/ML_Models/langdetect-based_model.joblib')

['/content/BT---Cross-Lingual-Classification/ML_Models/langdetect-based_model.joblib']

## Inference

In [None]:
def ml_get_predictions(text, model):
    # Define the feature names
    feature_names = ['wordLength',
                    'wordIsEN', 'sentenceIsEN',
                    'wordIsDE', 'sentenceIsDE',
                    'wordIsFR', 'sentenceIsFR',
                    'wordIsIT', 'sentenceIsIT']
    
    predictions = []
    sentences = get_sentences(text)

    # For each sentence, predict the probabilities isEn, isFr, isIT and isDE
    for s in sentences:
        words = get_words(s)
        try:
          spl = detect_langs(s)
        except:
          pass
        # For each word of the sentence, predict the probabilities isEn, isFr, isIT and isDE
        for w in words:
            if w:
                try:
                  wpl = detect_langs(w)
                except:
                  pass
                feature_values = [[len(w),
                                    get_prob_for_language(wpl, "en") if wpl else 0.0,
                                    get_prob_for_language(spl, "en") if spl else 0.0,
                                    get_prob_for_language(wpl, "de") if wpl else 0.0,
                                    get_prob_for_language(spl, "de") if spl else 0.0,
                                    get_prob_for_language(wpl, "fr") if wpl else 0.0,
                                    get_prob_for_language(spl, "fr") if spl else 0.0,
                                    get_prob_for_language(wpl, "it") if wpl else 0.0,
                                    get_prob_for_language(spl, "it") if spl else 0.0]]

                X_df = pd.DataFrame(feature_values, columns=feature_names)
                rf_prediction = model.predict(X_df)
                language = rf_prediction[0]
                predictions.append([w, language])

    return predictions

# Example usage
langdetectbased_model = joblib.load('/content/BT---Cross-Lingual-Classification/ML_Models/langdetect-based_model.joblib')
text = '''Hallo, wie geht es dir? See you there!''' #cross-lingual text example
print(ml_get_predictions(text, langdetectbased_model))

[['Hallo', 'de'], ['wie', 'de'], ['geht', 'de'], ['es', 'fr'], ['dir', 'de'], ['See', 'en'], ['you', 'en'], ['there', 'en']]


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


# Transformer-based machine learning approach

## Data pre-processing

Get sentences with their targets from the training set with this form:

```
[[sentence0, [[word0, word0_lang], [word1, word1_lang]], [sentence1, [[word0, word0_lang], [word1, word1_lang], [word2, word2_lang], ...]
```


In [None]:
train_text_and_labels = []
data = defaultdict(lambda: defaultdict(list))

with open('/content/BT---Cross-Lingual-Classification/Datasets/train_data.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        data[(row['MainLanguage'], row['Category'], row['Sentence'])]['words'].append([row['Word_x'], row['Language']])

for key, value in data.items():
    train_text_and_labels.append([key[2], value['words']])

## Vectorisation

Load the transformers tokenizer and model

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-uncased')
model = AutoModel.from_pretrained('bert-base-multilingual-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The method create_dataframe vectorizes each word of a sentence and assigns the corresponding target. It returns a dataframe with the columns word, vector and label. The vector column corresponds to the vector of the word in the context of the sentence.

In [None]:
def create_dataframe(text_and_labels):
    # Extract the text and word-language pairs from the input list
    text = text_and_labels[0]
    word_language_pairs = text_and_labels[1]

    # Tokenize the text
    tokens = tokenizer.tokenize(text)
    # Add special tokens
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    # Convert tokens to input IDs
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    # Create tensor from input IDs
    with torch.no_grad():
        embeddings = model(torch.tensor([input_ids]))[0]

    # Create a dictionary for easier lookup of language labels
    language_lookup = {word: lang for word, lang in word_language_pairs}

    # Get vector of each word and its corresponding language label
    word_vectors = []
    labels = []
    for i in range(1, len(tokens) - 1):
        if re.match(r'\w', tokens[i]):
            word_vector = embeddings[0][i].numpy()
            word_vectors.append(word_vector)
            # Look up the language label based on the token
            labels.append(language_lookup.get(tokens[i].lower(), 'unknown'))

    # Create a dataframe with word, vector, and label columns
    words = [t for t in tokens[1:-1] if re.match(r'\w', t)]
    df = pd.DataFrame({'word': words, 'vector': word_vectors, 'label': labels})
    return df

# Example usage
input_list = ['Fahren Sie von Avenue Anatole France zur Avenue de Suffren.', [['fahren', 'de'], ['sie', 'de'], ['von', 'de'], ['avenue', 'fr'], ['anatole', 'fr'], ['france', 'fr'], ['zur', 'de'], ['avenue', 'fr'], ['de', 'fr'], ['suffren', 'fr']]]
result = create_dataframe(input_list)
result.head() #Display an overview of the result

Unnamed: 0,word,vector,label
0,fahren,"[0.28685632, 0.52861565, 0.18523292, 0.279134,...",de
1,sie,"[-0.05587529, 0.016487246, 0.043123193, -0.089...",de
2,von,"[-0.23886408, 0.3358966, 0.27372605, 0.1606422...",de
3,avenue,"[-0.094035216, -0.016968282, 0.01749765, 0.089...",fr
4,ana,"[-0.23211592, 0.14859611, -0.15711416, -0.1094...",unknown


This code creates a dataframe for each sentence of the training set containing for each row: the word, its corresponding vector and its corresponding target (fr, en, de or it). Each dataframe is added to a list.

In [None]:
train_dataframes = []
for lst in train_text_and_labels:
  try:
    res = create_dataframe(lst)
  except:
    print(lst)
    pass
  train_dataframes.append(res)

Concatenate the dataframes list into one single dataframe

In [None]:
df_train = pd.concat(train_dataframes, ignore_index=True)
df_train = df_train.drop(df_train[~df_train.label.isin(['fr', 'en', 'de', 'it'])].index)

df_train.head() #Display an overview of the training set with transformers vectors

Unnamed: 0,word,vector,label
1,vi,"[-0.10989693, 0.032251377, 0.2825029, -0.71415...",it
3,di,"[-0.08477768, 0.67187977, 0.099835515, -0.1873...",it
4,alcuni,"[-0.14359927, 0.40742853, 0.35666615, -0.09985...",it
6,internazionali,"[0.22442648, -0.048540007, 0.37618932, -0.2372...",it
7,il,"[0.43887684, 0.53484637, 0.24410775, -0.158896...",it


Encode the label (target) column into integers

In [None]:
le = preprocessing.LabelEncoder()
le.fit(df_train['label'])
df_train['label'] = le.transform(df_train['label'])

df_train.head()

Unnamed: 0,word,vector,label
1,vi,"[-0.10989693, 0.032251377, 0.2825029, -0.71415...",3
3,di,"[-0.08477768, 0.67187977, 0.099835515, -0.1873...",3
4,alcuni,"[-0.14359927, 0.40742853, 0.35666615, -0.09985...",3
6,internazionali,"[0.22442648, -0.048540007, 0.37618932, -0.2372...",3
7,il,"[0.43887684, 0.53484637, 0.24410775, -0.158896...",3


In [None]:
df_train.to_csv('/content/BT---Cross-Lingual-Classification/Datasets/transformer_train_data.csv', index=True)

Same steps for the test set

In [None]:
test_text_and_labels = []
data = defaultdict(lambda: defaultdict(list))

with open('/content/BT---Cross-Lingual-Classification/Datasets/test_data.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        data[(row['MainLanguage'], row['Category'], row['Sentence'])]['words'].append([row['Word_x'], row['Language']])

for key, value in data.items():
    test_text_and_labels.append([key[2], value['words']])

test_dataframes = []
for lst in test_text_and_labels:
  try:
    res = create_dataframe(lst)
  except:
    print(lst)
    pass
  test_dataframes.append(res)

df_test = pd.concat(test_dataframes, ignore_index=True)
df_test = df_test.drop(df_test[~df_test.label.isin(['fr', 'en', 'de', 'it'])].index)

le = preprocessing.LabelEncoder()
le.fit(df_test['label'])
df_test['label'] = le.transform(df_test['label'])

In [None]:
df_test.to_csv('/content/BT---Cross-Lingual-Classification/Datasets/transformer_test_data.csv', index=True)

Separates the target column from the others

In [None]:
X_train = df_train['vector'].tolist()
y_train = df_train['label'].tolist()

X_test = df_test['vector'].tolist()
y_test = df_test['label'].tolist()

## Models training

Train a linear SVC model and print its performance metrics using pre-built functions

In [None]:
# create a linear SVC model
svc = LinearSVC(max_iter=10000)
# Train the model on the training data
svc.fit(X_train, y_train)

# Predict the labels of the testing set
y_pred = svc.predict(X_test)

# Calculate the classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

# Print the classification metrics
print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 score: {:.2f}".format(f1))

Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F1 score: 0.98


In [None]:
# Save the model
joblib.dump(svc, '/content/BT---Cross-Lingual-Classification/ML_Models/transformer_svc_model.joblib')

['/content/BT---Cross-Lingual-Classification/ML_Models/transformer_svc_model.joblib']

Train a neural network model and print its performance metrics using pre-built functions

In [None]:
# Convert target variable into one-hot encoded format
y_train_one_hot = to_categorical(y_train, num_classes=4)
y_test_one_hot = to_categorical(y_test, num_classes=4)

# Convert X_train and X_test to NumPy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)

nn = Sequential()
nn.add(Dense(32, input_dim=X_train.shape[1], activation='relu'))
nn.add(Dense(16, activation='relu'))
nn.add(Dense(4, activation='softmax'))

nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

nn.fit(X_train, y_train_one_hot, epochs=5, batch_size=30)

y_pred_one_hot = nn.predict(X_test)
y_pred = y_pred_one_hot.argmax(axis=1)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 score: {:.2f}".format(f1))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F1 score: 0.98


In [None]:
# Save the model
nn.save('/content/BT---Cross-Lingual-Classification/ML_Models/transformer_nn_model.h5')

## Inference

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-uncased')
model = AutoModel.from_pretrained('bert-base-multilingual-uncased')

def vectorize_text(text):
    def is_subword(token):
        return token[:2] == "##"

    # Convert text to lowercase
    text = text.lower()
    
    # Tokenize the text
    tokens = tokenizer.tokenize(text)
    
    # Add special tokens for BERT model
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    
    # Convert tokens to input IDs using the tokenizer
    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Generate word embeddings using the pre-trained model
    with torch.no_grad():
        embeddings = model(torch.tensor([input_ids]))[0]

    word_vectors = []
    words = []
    current_word = ""
    current_word_vector = None

    # Iterate over tokens to extract word vectors
    for i in range(1, len(tokens)-1):
        token = tokens[i]
        if not is_subword(token):
            # If it's a new word, add the previous word and its vector to the list
            if current_word:
                words.append(current_word)
                word_vectors.append(current_word_vector)
            
            # Start a new word
            current_word = token
            current_word_vector = embeddings[0][i].numpy()
        else:
            # Append the subword to the current word and update its vector
            current_word += token[2:]
            current_word_vector += embeddings[0][i].numpy()

    # Add the last word and its vector to the list
    if current_word:
        words.append(current_word)
        word_vectors.append(current_word_vector)
    
    # Create a DataFrame with the words and their corresponding vectors
    df = pd.DataFrame({'word': words, 'vector': word_vectors})
    return df


# Function to get predictions using a neural network model
def tf_nn_get_predictions(text, model):
    # Vectorize the text
    df = vectorize_text(text)
    
    # Store word-label tuples
    word_label_tuples = []

    # Iterate over words and vectors
    for word, vector in zip(df['word'], df['vector']):
        # Skip words containing punctuation
        if any(p in word for p in string.punctuation):
            continue

        # Reshape the vector for prediction
        sample = np.array(vector).reshape(1, -1)
        
        # Make prediction using the neural network model
        y_pred_one_hot = model.predict(sample, verbose=0)
        
        # Get the predicted label
        label = np.argmax(y_pred_one_hot, axis=1)[0]
        
        # Map label to string representation
        labels_str = {0: 'de', 1:'en', 2: 'fr', 3: 'it'}
        label_str = labels_str.get(label)
        
        # Append word and label tuple to the list
        word_label_tuples.append([word, label_str])

    return word_label_tuples


# Function to get predictions using a support vector machine model
def tf_svc_get_predictions(text, model):
    # Vectorize the text
    df = vectorize_text(text)
    
    # Store word-label tuples
    word_label_tuples = []

    # Iterate over words and vectors
    for word, vector in zip(df['word'], df['vector']):
        # Skip words containing punctuation
        if any(p in word for p in string.punctuation):
            continue

        # Reshape the vector for prediction
        sample = np.array(vector).reshape(1, -1)
        
        # Make prediction using the support vector machine model
        label = model.predict(sample)[0]
        
        # Map label to string representation
        labels_str = {0: 'de', 1:'en', 2: 'fr', 3: 'it'}
        label_str = labels_str.get(label)
        
        # Append word and label tuple to the list
        word_label_tuples.append([word, label_str])

    return word_label_tuples


# Example usage
# Load the neural network model
nn_model = load_model('/content/BT---Cross-Lingual-Classification/ML_Models/transformer_nn_model.h5')

# Define the text for prediction
text = "How do you say ich arbeite in Zürich in English?"

# Get predictions using the neural network model
predictions = tf_nn_get_predictions(text, nn_model)

# Print the predictions
print(predictions)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[['how', 'en'], ['do', 'it'], ['you', 'en'], ['say', 'en'], ['ich', 'de'], ['arbeite', 'de'], ['in', 'de'], ['zurich', 'de'], ['in', 'en'], ['english', 'en']]


# Evaluation

## Tuples comparison

The compare_predictions function compares the prediction list of an approach with the corresponding target prediction list and returns the percentage of same tuples [word, language]. It makes sure that the tuples follow the same syntax (no digit, one word per tuple, lowercase)

In [None]:
# Function to preprocess text
def preprocess_text(text):
    # Remove accents from text
    text_without_accents = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    
    # Find words in the text using regular expression
    words = re.findall(r'[^\d\W]+', text_without_accents)
    
    # Join the words and convert to lowercase
    return ' '.join(words).lower()


# Function to preprocess tuples
def preprocess_tuples(tuples_list):
    result = []
    
    # Iterate over tuples in the list
    for t in tuples_list:
        # Skip tuples starting with a digit
        if not t[0].isdigit():
            # Preprocess the text in the tuple and split into words
            words = preprocess_text(t[0]).split()
            
            # Append each word and the corresponding label to the result list
            for word in words:
                result.append((word, t[1]))
    
    return result



# Function to compare predictions
def compare_predictions(list1, list2):
    # Preprocess the tuple lists
    preprocessed_list1 = preprocess_tuples(list1)
    preprocessed_list2 = preprocess_tuples(list2)

    offset = 0
    same_tuples = 0
    for i in range(len(preprocessed_list1)):
        j = i + offset
        if i < len(preprocessed_list2) and j < len(preprocessed_list1):
            if preprocessed_list1[j][0] != preprocessed_list2[i][0]:
                print("Warning:", preprocessed_list1[j][0], "!=", preprocessed_list2[i][0])
                offset += 1
            elif preprocessed_list1[j] == preprocessed_list2[i]:
                # Count the number of same tuples in the preprocessed lists
                same_tuples += 1

    # Calculate the percentage of same tuples
    return (same_tuples / len(preprocessed_list1)) * 100

This code reads data from the test set and processes it to create a list of lists. Each inner list contains four elements: the main language, the category, the sentence, and a list of words with their corresponding languages. The words are grouped by the main language, category, and sentence.

## Data pre-processing

In [None]:
result = []
data = defaultdict(lambda: defaultdict(list))

with open('/content/BT---Cross-Lingual-Classification/Datasets/test_data.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        data[(row['MainLanguage'], row['Category'], row['Sentence'])]['words'].append([row['Word_x'], row['Language']])

for key, value in data.items():
    result.append([key[0], key[1], key[2], value['words']])

## Algorithms evaluation

Evaluate each approach on the entire test set. Note that this operation can take a while (approximately 25 minutes)

In [None]:
# Load the machine learning models
langdetectbased_model = joblib.load('/content/BT---Cross-Lingual-Classification/ML_Models/langdetect-based_model.joblib')
tf_nn_model = load_model('/content/BT---Cross-Lingual-Classification/ML_Models/transformer_nn_model.h5')
tf_svc_model = joblib.load('/content/BT---Cross-Lingual-Classification/ML_Models/transformer_svc_model.joblib')

def test_approach_on_category(data, algorithms, category):
    results = []
    for item in data:
        target_predictions = item[3]
        text = item[2]
        language = item [0]

        if item[1] == category:
            accuracies = []

            for approach in algorithms:

                if approach == 'NER':
                    ner_predictions = ner_get_predictions(text, language)
                    accuracy = compare_predictions(ner_predictions, target_predictions)
                elif approach == 'Langdetect ML':
                    langdetect_ml_predictions = ml_get_predictions(text, langdetectbased_model)
                    accuracy = compare_predictions(langdetect_ml_predictions, target_predictions)
                elif approach == 'Transformer NN':
                    tf_nn_predictions = tf_nn_get_predictions(text, tf_nn_model)
                    accuracy = compare_predictions(tf_nn_predictions, target_predictions)
                elif approach == 'Transformer SVC':
                    tf_svc_predictions = tf_svc_get_predictions(text, tf_svc_model)
                    accuracy = compare_predictions(tf_svc_predictions, target_predictions)
                elif approach == 'Rule-based':
                    rb_predictions = rb_get_predictions(text, language)
                    accuracy = compare_predictions(rb_predictions, target_predictions)
                else:
                    accuracy = None  # Default value if the comparison is not applicable
                accuracies.append(accuracy)

            results.append((text, language, category) + tuple(accuracies))

    return results

categories = ['Navigation', 'PopCulture', 'Expressions', 'Gastronomy', 'HistoricalContext', 'ArtAndLitterature', 'TechnologyAndScience', 'Jargon', 'Quotations']
algorithms = ['NER', 'Langdetect ML', 'Rule-based', 'Transformer NN', 'Transformer SVC']


data = []
for category in categories:
    results = test_approach_on_category(result, algorithms, category)
    data.extend(results)

# Create a DataFrame
df_results = pd.DataFrame(data, columns=['Sentence', 'Language', 'Category'] + algorithms)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations




## Visualize evaluation results

In [None]:
df_results # Display an overview of the results of the evaluation

Unnamed: 0,Sentence,Language,Category,NER,Langdetect ML,Rule-based,Transformer NN,Transformer SVC
0,Continuez sur la rue Victoria Street.,fr,Navigation,83.333333,66.666667,66.666667,100.000000,83.333333
1,Continuez sur la rue Main Street.,fr,Navigation,83.333333,66.666667,66.666667,100.000000,100.000000
2,Tournez à droite sur la rue Oak Avenue.,fr,Navigation,100.000000,75.000000,87.500000,87.500000,75.000000
3,Continuez tout droit sur la rue Church Street.,fr,Navigation,87.500000,75.000000,87.500000,100.000000,87.500000
4,Tournez à gauche sur la rue Sunnydale Drive.,fr,Navigation,75.000000,75.000000,75.000000,100.000000,87.500000
...,...,...,...,...,...,...,...,...
320,"""Der Mensch ist, was er isst"" afferma Ludwig F...",it,Quotations,62.500000,87.500000,70.833333,95.833333,87.500000
321,"Nella celebre opera di Shakespeare, Amleto si ...",it,Quotations,57.692308,57.692308,84.615385,96.153846,96.153846
322,"Voltaire ci ricorda che ""Le mieux est l'ennemi...",it,Quotations,68.000000,64.000000,84.000000,96.000000,100.000000
323,"Théophile Gautier afferma che ""L'art pour l'ar...",it,Quotations,66.666667,66.666667,66.666667,100.000000,95.238095


Save the results in an Excel file

In [None]:
df_results.to_excel('/content/BT---Cross-Lingual-Classification/results.xlsx', index=True)

In [None]:
import pandas as pd

df_results = pd.read_excel('/content/BT---Cross-Lingual-Classification/results.xlsx')
df_results #Display the results evaluation

Unnamed: 0.1,Unnamed: 0,Sentence,Language,Category,NER,Langdetect ML,Rule-based,Transformer NN,Transformer SVC
0,0,Continuez sur la rue Victoria Street.,fr,Navigation,83.333333,66.666667,66.666667,100.000000,83.333333
1,1,Continuez sur la rue Main Street.,fr,Navigation,83.333333,66.666667,66.666667,100.000000,100.000000
2,2,Tournez à droite sur la rue Oak Avenue.,fr,Navigation,100.000000,75.000000,87.500000,87.500000,75.000000
3,3,Continuez tout droit sur la rue Church Street.,fr,Navigation,87.500000,75.000000,87.500000,100.000000,87.500000
4,4,Tournez à gauche sur la rue Sunnydale Drive.,fr,Navigation,75.000000,75.000000,75.000000,100.000000,87.500000
...,...,...,...,...,...,...,...,...,...
320,320,"""Der Mensch ist, was er isst"" afferma Ludwig F...",it,Quotations,62.500000,87.500000,70.833333,95.833333,87.500000
321,321,"Nella celebre opera di Shakespeare, Amleto si ...",it,Quotations,57.692308,57.692308,84.615385,96.153846,96.153846
322,322,"Voltaire ci ricorda che ""Le mieux est l'ennemi...",it,Quotations,68.000000,64.000000,84.000000,96.000000,100.000000
323,323,"Théophile Gautier afferma che ""L'art pour l'ar...",it,Quotations,66.666667,66.666667,66.666667,100.000000,95.238095


In [None]:
# Reshape the DataFrame for the boxplot
melted_df = pd.melt(df_results, id_vars=['Sentence', 'Language', 'Category'],
                    value_vars=['NER', 'Langdetect ML', 'Rule-based', 'Transformer NN', 'Transformer SVC'],
                    var_name='Algorithm', value_name='Precision')

melted_df

Unnamed: 0,Sentence,Language,Category,Algorithm,Precision
0,Continuez sur la rue Victoria Street.,fr,Navigation,NER,83.333333
1,Continuez sur la rue Main Street.,fr,Navigation,NER,83.333333
2,Tournez à droite sur la rue Oak Avenue.,fr,Navigation,NER,100.000000
3,Continuez tout droit sur la rue Church Street.,fr,Navigation,NER,87.500000
4,Tournez à gauche sur la rue Sunnydale Drive.,fr,Navigation,NER,75.000000
...,...,...,...,...,...
1620,"""Der Mensch ist, was er isst"" afferma Ludwig F...",it,Quotations,Transformer SVC,87.500000
1621,"Nella celebre opera di Shakespeare, Amleto si ...",it,Quotations,Transformer SVC,96.153846
1622,"Voltaire ci ricorda che ""Le mieux est l'ennemi...",it,Quotations,Transformer SVC,100.000000
1623,"Théophile Gautier afferma che ""L'art pour l'ar...",it,Quotations,Transformer SVC,95.238095


In [None]:
# Create a boxplot for each algorithm
fig = px.box(melted_df, x='Algorithm', y='Precision', title='Algorithm Comparison', color='Algorithm')

# Show the plot
fig.show()

In [None]:
# Create the box plot using Plotly
fig = px.box(melted_df, x='Category', y='Precision', color='Algorithm')

# Set the title and axis labels
fig.update_layout(title='Precision Comparison of Algorithms by Category',
                  xaxis_title='Category', yaxis_title='Precision')

# Show the plot
fig.show()

In [None]:
# Create the box plot using Plotly
fig = px.box(melted_df, x='Algorithm', y='Precision', color='Language')

# Set the title and axis labels
fig.update_layout(title='Precision Comparison of Algorithms by Language',
                  xaxis_title='Category', yaxis_title='Precision')

# Show the plot
fig.show()