<a href="https://colab.research.google.com/github/datapumpernickel/ep_debate/blob/main/translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Detection of European Parliament Debates

This code takes both the sentence level and speech level data. It then predicts the language of each text based on the full text and merges it back to the sentence level. The language is predicted on the full speech text, because in tests we found this to be more reliable because the language detection algorithm has more material to work with and ambigous short sentences are embedded in their larger speech context. 

#### Installation of necessary packages

This code most likely does not run on Windows Machines, at least not without some trouble, because fasttext needs some C compiler to work

In [None]:
!pip install fasttext
!pip install -U pip transformers
!pip install sentencepiece

#### Import data from nextcloud repo

In [None]:
import pandas as pd

## speech level data
full_text = pd.read_csv("https://nextcloud.swp-berlin.org/s/wzcrP4zjSCTyWT2/download")

## sentence level data
data = pd.read_csv("https://nextcloud.swp-berlin.org/s/gj7b5xrMNyK8d4A/download")

In [None]:
data.head()

In [None]:
full_text.head()

In [None]:
## drop missings in text and reset the index so the loops dont get confused
full_text.dropna(subset=['text'], inplace=True)
full_text.reset_index(inplace = True)

data.dropna(subset=['sentence'], inplace=True)
data.reset_index(inplace = True)

In [None]:
# download the language model pretrained file
!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
import fasttext

pretrained_lang_model = "/content/lid218e.bin" # path of pretrained model file
translation_model = fasttext.load_model(pretrained_lang_model)

In [None]:
from tqdm import tqdm
tqdm.pandas()
def custom_language_detection(text):
    text = text.replace("\n"," ")
    predictions = translation_model.predict(text, k=1)
    input_lang = predictions[0][0].replace('__label__', '')
    return input_lang

full_text['language'] = full_text['text'].progress_apply(custom_language_detection)
full_text = full_text[["language","text_id","text"]]


In [None]:
full_text.tail(10)

#### Visually inspect the results

Clearly they are not perfect. There is some weird appearences of supposedly Korean language texts. But they do not seem to be related to actually important or long speeches, but rather short interventions and information about the procedural stuff from the parliament. 

In [None]:
filtered_data = filtered_data[filtered_data['language'] != 'eng_Latn']

# Keep only specific columns
filtered_data = filtered_data[['text_id', 'session_id', 'id_speaker', 'Sentence_id', 'sentence','language']]

filtered_data.reset_index(inplace = True)
filtered_data.drop_duplicates(subset='sentence', inplace = True)
filtered_data

#### Store results

In [None]:
full_text.to_csv("04_clean_data/language_detection.csv")
filtered_data = pd.merge(data, full_text, on='text_id', how='left').to_csv("04_clean_data/language_detection_sentence.csv")