The goal of this project is to create a language detection model using NLTK and the [European Parliament Proceedings Parallel Corpus](https://statmt.org/europarl/).

General Outline:
1. Download the [europarl corpus](https://statmt.org/europarl/v7/europarl.tgz)
2. Preprocess the data
3. Train a model
4. Test the model
5. Create a function that allows user input and returns the language of the sample text
6. Next steps and Conclusions

The code will be entirely in Python.


Limitations:
Due to the file size restriction on GitHub & my computer, this repo and model will only be built off of very small samples of the datasets. 

In [1]:
#import requests
import pandas as pd
import os
import re

In [2]:
import tarfile

In [None]:
#corpus_url = r'https://statmt.org/europarl/v7/europarl.tgz'

#r = requests.get(corpus_url)

In [12]:
# Opening the corpus
# I downloaded the file directly to speed up the process
file = tarfile.open(r'europarl.tgz')
print(file.getnames())

file.extractall('./corpus_folder')
file.close()

['txt', 'txt/fi', 'txt/fi/ep-11-03-09-010-03.txt', 'txt/fi/ep-09-10-20-015.txt', 'txt/fi/ep-10-06-16-003.txt', 'txt/fi/ep-08-05-19-022.txt', 'txt/fi/ep-08-01-16-007.txt', 'txt/fi/ep-10-10-21-011-02.txt', 'txt/fi/ep-97-01-30.txt', 'txt/fi/ep-11-02-15-022.txt', 'txt/fi/ep-07-06-07-005.txt', 'txt/fi/ep-10-11-10-003.txt', 'txt/fi/ep-10-10-06-002.txt', 'txt/fi/ep-06-12-12-011.txt', 'txt/fi/ep-03-09-04.txt', 'txt/fi/ep-11-01-20-018.txt', 'txt/fi/ep-10-09-22-005-08.txt', 'txt/fi/ep-09-03-12-007-07.txt', 'txt/fi/ep-08-04-22-005-19.txt', 'txt/fi/ep-07-05-22-001.txt', 'txt/fi/ep-06-10-12-007-05.txt', 'txt/fi/ep-07-06-18-005.txt', 'txt/fi/ep-07-11-15-013.txt', 'txt/fi/ep-09-04-22-006-16.txt', 'txt/fi/ep-08-03-11-010-03.txt', 'txt/fi/ep-09-05-06-004-12.txt', 'txt/fi/ep-06-10-26-006-10.txt', 'txt/fi/ep-09-12-14-021.txt', 'txt/fi/ep-07-06-27-003.txt', 'txt/fi/ep-10-03-09-011.txt', 'txt/fi/ep-10-07-08-006-02.txt', 'txt/fi/ep-08-09-24-010-03.txt', 'txt/fi/ep-07-07-12-005.txt', 'txt/fi/ep-07-04-24-007-

Guide to the language labels:

Label | Language
|----|----|
bg | Bulgarian
cs| Czech
da| Danish
de|German
el|Greek
en|English
es|Spanish
et|Estonian
fi|Finnish
fr|French
hu|Hungarian
it|Italian
lt|Lithuanian
lv|Latvian
nl|Dutch
pl|Polish
pt|Portuguese
ro|Romanian
sk|Slovak
sl|Slovene
sv|Swedish


In [19]:
# Example of one of the txt files within the corpus
# Make sure to change the encoding
ex_txt = open(r".\corpus_folder\txt\bg\ep-07-01-15-011.txt","r", encoding="utf8")

print(ex_txt.read())

<CHAPTER ID="011">
Въпроси с искане за устен отговор и писмени декларации (внасяне): вж. протокола



In [None]:
# I want to turn these into lines in a csv file
# That means I'll need to combine all the files into a single txt file
# I then need to remove the <CHAPTER ID> headings
# Then I'll want to remove the punctuation before inputting everything into the model

In [58]:
# Create a new compiled folder outside of the txt folder
# This is where the results will be sent
directory = "compiled"
    
# Parent Directory path 
parent_dir = r".\corpus_folder"
    
# Path 
path = os.path.join(parent_dir, directory) 
    
# Create the directory
os.mkdir(path)



In [60]:
# assign directory
directory = r'corpus_folder\txt'


# iterate over all folders in that directory
for folder in os.scandir(directory):
    if folder.is_dir():
        lang = folder.path[-2:]

    
    # iterate over files in that directory
    for filename in os.scandir(folder):
        if filename.is_file():
            print(filename.path)

        #Open the file, read the file
        file1 = open(filename.path,"r", encoding="utf8")
        excerpt = file1.readlines()

        #Open file2, append to file2
        output_path = r'corpus_folder/compiled/' + lang + '_comp.txt' 
        #os.mkdir(output_path)

        file2 = open(output_path,"a", encoding="utf8")
        file2.writelines(line + '\n' for line in excerpt) #This adds each new line to a new line in the output file

        file1.close()
        file2.close()

        


corpus_folder\txt\pt\ep-00-01-17.txt
corpus_folder\txt\pt\ep-00-01-18.txt
corpus_folder\txt\pt\ep-00-01-19.txt
corpus_folder\txt\pt\ep-00-01-20.txt
corpus_folder\txt\pt\ep-00-01-21.txt
corpus_folder\txt\pt\ep-00-02-02.txt
corpus_folder\txt\pt\ep-00-02-03.txt
corpus_folder\txt\pt\ep-00-02-14.txt
corpus_folder\txt\pt\ep-00-02-15.txt
corpus_folder\txt\pt\ep-00-02-16.txt
corpus_folder\txt\pt\ep-00-02-17.txt
corpus_folder\txt\pt\ep-00-02-18.txt
corpus_folder\txt\pt\ep-00-03-01.txt
corpus_folder\txt\pt\ep-00-03-02.txt
corpus_folder\txt\pt\ep-00-03-13.txt
corpus_folder\txt\pt\ep-00-03-14.txt
corpus_folder\txt\pt\ep-00-03-15.txt
corpus_folder\txt\pt\ep-00-03-16.txt
corpus_folder\txt\pt\ep-00-03-17.txt
corpus_folder\txt\pt\ep-00-03-29.txt
corpus_folder\txt\pt\ep-00-03-30.txt
corpus_folder\txt\pt\ep-00-04-10.txt
corpus_folder\txt\pt\ep-00-04-11.txt
corpus_folder\txt\pt\ep-00-04-12.txt
corpus_folder\txt\pt\ep-00-04-13.txt
corpus_folder\txt\pt\ep-00-04-14.txt
corpus_folder\txt\pt\ep-00-05-03.txt
c

In [64]:
# There was an error with "\pl\ep-09-10-22-010.txt"
# I'll just skip the rest of the polish
# This is working now, so I'm not sure what the issue was

Now that I have each folder's worth of information in a single sheet,
I will turn each sheet into a csv, and then into a pandas df

In [79]:
# I need to remove all html & punctuation, to convert to a csv
# Then output to a separate folder

directory = "cleaned"
# Parent Directory path 
parent_dir = r".\corpus_folder"
    
# Path 
path = os.path.join(parent_dir, directory) 
    
# Create the directory
os.mkdir(path)

In [134]:
# This function takes in the compiled txt file, then removes all html, and punctuation
def clean_txt (file):
    in_file = open(file, encoding='utf8').read()

    lang = os.path.basename(file)[:2]
    out_file = './corpus_folder/cleaned/' +lang +'_clean.txt'
    clean_ex = re.sub(r'<.*?>', '', in_file).lower().strip()
    #print(out_file)

    # Here i'll also remove commas, which will make the conversion to csv simpler
    # This will not affect the process because punctuation will be completely 
    # removed during the normalization process
    #clean_ex = re.sub(',', '', clean_ex)
 
    open(out_file, 'w', encoding='utf8').write(clean_ex)

In [135]:
folder = r'corpus_folder\compiled'

for filename in os.scandir(folder):
    clean_txt(filename)
    print(filename.path)
    

corpus_folder\compiled\bg_comp.txt
corpus_folder\compiled\cs_comp.txt
corpus_folder\compiled\da_comp.txt
corpus_folder\compiled\de_comp.txt
corpus_folder\compiled\el_comp.txt
corpus_folder\compiled\en_comp.txt
corpus_folder\compiled\es_comp.txt
corpus_folder\compiled\et_comp.txt
corpus_folder\compiled\fi_comp.txt
corpus_folder\compiled\fr_comp.txt
corpus_folder\compiled\hu_comp.txt
corpus_folder\compiled\it_comp.txt
corpus_folder\compiled\lt_comp.txt
corpus_folder\compiled\lv_comp.txt
corpus_folder\compiled\nl_comp.txt
corpus_folder\compiled\pl_comp.txt
corpus_folder\compiled\pt_comp.txt
corpus_folder\compiled\ro_comp.txt
corpus_folder\compiled\sk_comp.txt
corpus_folder\compiled\sl_comp.txt
corpus_folder\compiled\sv_comp.txt


In [2]:
directory = "csv"
# Parent Directory path 
parent_dir = r".\corpus_folder"
    
# Path 
path = os.path.join(parent_dir, directory) 
    
# Create the directory
#os.mkdir(path)


In [3]:
import warnings
warnings.simplefilter("ignore")

In [4]:
folder = r'corpus_folder\cleaned'

# Take a random sample of each
for filename in os.scandir(folder):
    lang = filename.name[:2]
    #print(lang)

    df = pd.read_table(filename, sep='<>', header=None)
    df.columns = ['Text']
    df['Language'] = lang
    #display(df.head())

    subset = df.sample(n=300)

    out_path = path +r'\\' +lang +'.csv'
    #print(out_path)
    subset.to_csv(out_path)

In [5]:
#begin here when re-running
folder = r'corpus_folder\csv'
file_names = os.listdir(folder)

I'm combining random samples from each language,
then removing unnecessary columns, shuffling the new dataframe, and setting a new index.

In [6]:
df1 = pd.concat(
    map(pd.read_csv, ('./corpus_folder/csv/'+i for i in file_names)), ignore_index=True)

In [7]:
df1 = df1.drop('Unnamed: 0', axis=1)

In [8]:
df1 = df1.sample(frac=1)
df1 = df1.reset_index(drop=True)

In [9]:
df1.head()

Unnamed: 0,Text,Language
0,"în scris. - după cum ne-am temut, votul strâns...",ro
1,"certes, la crise économique générale aggrave l...",fr
2,"oczywiście jestem całkowicie przekonany, że ws...",pl
3,"we therefore have some great challenges here, ...",en
4,o parlamento também assumiu uma posição favorá...,pt


Now I'll prepare the dataframe for model training.

I've decided to use NLTK, because I wasn't sure how to use Spark NLP

In [10]:
x = df1["Text"]
y = df1["Language"]

In [11]:
# Label Encoding
# Converting the names of languages to a numerical form
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)


In [12]:
# Further Text Preprocessing
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in x:
       # removing the symbols and numbers
        text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
        text = re.sub(r'[[]]', ' ', text)
        # converting the text to lower case
        text = text.lower()
        # appending to data_list
        data_list.append(text)

In [13]:
# Bag of words
# Turning the text into a numerical form

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()


#This uses a huge amount of data --> 1.49 TiB for samples of 15,000
# 7,000 -> 502 GB
# 1,000 -> 26.5 GB
# 700 -> 15.2 GB

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [16]:
# Training the model on a multinomial classification model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

In [None]:
directory = "tts"
# Parent Directory path 
parent_dir = r".\corpus_folder"
    
# Path 
path = os.path.join(parent_dir, directory) 
    
# Create the directory
os.mkdir(path)

Saving the train-test split for model recreation if necessary

In [52]:
from numpy import savetxt

In [53]:
# Save these values to a file, because if I run the sampler again,
# the data will be shuffled randomly


savetxt('./corpus_folder/tts/x_train.txt', x_train, delimiter=',', fmt='%s', encoding='utf8')
savetxt('./corpus_folder/tts/y_train.txt', y_train, delimiter=',', fmt='%s', encoding='utf8')
savetxt('./corpus_folder/tts/x_test.txt', x_test, delimiter=',', fmt='%s', encoding='utf8')
savetxt('./corpus_folder/tts/y_test.txt', y_test, delimiter=',', fmt='%s', encoding='utf8')

In [None]:
# This is how you can read the data back in
# from numpy import loadtxt
# load array
# data = loadtxt('./corpus_folder/tts/x_train.txt', delimiter=',', encoding='utf8', fmt='%s')
# data = loadtxt('./corpus_folder/tts/y_train.txt', delimiter=',', encoding='utf8', fmt='%s')
# data = loadtxt('./corpus_folder/tts/x_test.txt', delimiter=',', encoding='utf8', fmt='%s')
# data = loadtxt('./corpus_folder/tts/y_test.txt', delimiter=',', encoding='utf8', fmt='%s')

In [28]:
y_pred = model.predict(x_test)

Measure the model accuracy, recall, precision, and confusion matrix

In [41]:
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score
ac = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
rc = recall_score(y_test, y_pred, average='micro')
pr = precision_score(y_test, y_pred, average='micro')


In [42]:
print("Accuracy is :",ac)
print(cm)
print(rc)
print(pr)

Accuracy is : 0.9976190476190476
[[67  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0 64  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 57  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 59  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 62  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0 51  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 57  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0 61  0  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 65  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0 62  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 59  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 58  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0 61  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 59  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  1  0  0  0  0  0

Save the model for later

In [21]:
import pickle

In [22]:
directory = "models"
# Parent Directory path 
parent_dir = r".\corpus_folder"
    
# Path 
path = os.path.join(parent_dir, directory) 
    
# Create the directory
os.mkdir(path)

In [24]:
# Save the model into a pickle file
pickle.dump(model, open(".\corpus_folder\models\model1.p", "wb"))

In [25]:
# Load the model back from the pickle file
# model1 = pickle.load(open(".\corpus_folder\models\model1.p", "rb"))
# y_pred = model1.predict(x_test)

Create a function that takes in user input

In [44]:
def predict(text):
     x = cv.transform([text]).toarray() # converting text to bag of words model (Vector)
     lang = model.predict(x) # predicting the language
     lang = le.inverse_transform(lang) # finding the language corresponding the the predicted value
     print("The langauge is in",lang[0]) # printing the language

Label | Language
|----|----|
bg | Bulgarian
cs| Czech
da| Danish
de|German
el|Greek
en|English
es|Spanish
et|Estonian
fi|Finnish
fr|French
hu|Hungarian
it|Italian
lt|Lithuanian
lv|Latvian
nl|Dutch
pl|Polish
pt|Portuguese
ro|Romanian
sk|Slovak
sl|Slovene
sv|Swedish

In [45]:
predict('Hello, my name is John')

The langauge is in en


In [46]:
predict('Salut, je m\'appelle John')

The langauge is in fr


In [50]:
predict('Holà me llamo John. Tengo veinte años')

# This example was initially incorrect, and required a second sentence before being accurately recognized.

The langauge is in es


# Next Steps

* The main limitation of this project was a lack of memory on my device. A workaround would be to run this code on a cloud virtual machine, in order to improve the compute ability.
* The dataset is based off of speech in a political context. This means the model is limited in its application to speech from other contexts.
* The accuracy, precision, and recall of the model is remarkably high at >.997 for all three metrics. I am concerned that the model is overfit to the data.
  * A potential fix for this would be to increase the size of the samples.
* I am concerned that having languages with starkly different alphabets also contributes to the overfitting. For certain languages all the model needs to do is look for a few specific characters to exclude other languages.
  * In the future, I would try to expand the language entries beyond the current 21 languages. I would focus on including alternative languages that use the same or similar alphabets.