# FELLOWSHIP.AI - EUROPARL DATASET

## INTRODUCTION 

### Project was done as a challenge requirement for the Fellowship.AI machine learning fellow application. Data for the project was collected from 2 places: training data was collected from http://www.statmt.org/europarl/ from the link titled "source releast (text files), 1.5 GB" and the testing data was collected from the website of fellowship.ai, from the challenge tab. Majority of time was spent on data extraction, cleaning and pre-processing. 

## DATA EXTRACTION 

### The source release (text files) data was downloaded in .tgz format. This was unzipped to get the europarl.tar file. Contents of this file were extracted using the python tarfile library. This extraction setup a folder called txt in the working directory. This folder had 21 sub-folders, one for each european language. Each of these sub-folders had thousands of text files. The files contain text from the european parliament in the 21 different languages.

### A small sample of text files were imported for use as training data. Data was extracted using the train_extract and train_extract1 functions. Only difference between the 2 functions is the number of files they import. While train_extract extracts 1000 files, train_extract1 extracts only 10 files from the sub-folder. The 2 different functions were used on different sets of languages. Some languages have text files with many hundreds of lines while some languages have text files with very few lines. The different functions ensure that the training text corpora for different languages have similar distribution of words.

## DATA CLEANUP

### The training dataset contained html type tags <> with text inside them. The text contains chapter names, speaker names etc. The tags and everything inside them were cleaned up using regex. After this, the training data for each language was sentence tokenized. Further, a dataframe with these sentences as one column and the respective language in the other column was created. While the train_extract function was called for each language, I realize that this is an inefficient way to write code. I end up having 21 similar code blocks. I think a better way to do this would have been to use a recursive glob function that would match the name of the sub-folder and the language argument of the train_extract function. The language argument would be a list containing all 21 languages. The function could loop over the sub-folders and check if the sub-folder matches with the language in the list. If they do, the new column would be populated with the name of that respective language. 

### The format of the testing data was a single text file containing text corpora in all 21 languages in alphabetical order of languages. Each sentence, at the start, had a prefix indicating the language of the sentence. The testing data was first sentence tokenized. After this, regex was used to extract each prefix from every sentence. Further, a dataframe was created where one column was the sentences in different languages and the other column contained the language prefix. Finally, NA values were backfilled.

## FEATURE ENGINEERING

### No feature engineering was done since the classifier tries to classify based on language text. Due to the nature of the problem statement, it was felt that feature engineering was not required.

## MACHINE LEARNING

### Naive Bayes classifier was used to build the model. Training data was first run through a count vectorizer. After this, the model was built using the training set. This model was further validated on the holdout set, giving an accuracy of 98.87%. Finally, the model was run on the actual testing set, giving an accuracy of 96.08%. From here, a separate dataset was built by extracting out only the misclassified instances. A misclassification table shows that the highest rate of misclassification happened when the model classified the language as Danish. 

In [1]:
# Function to extract training data, 
# sentence tokenize and build new feature called language, showing the respective european language

import glob
import os
import itertools
import re
import pandas as pd
import numpy as np
import nltk

def train_extract(file_list,lang):
        
    train=[]

    for file_path in file_list[:1000]:
        with open(file_path,encoding='utf8') as f_input:
            train.append(f_input.read())

    train_clean=[re.sub(r'(.*?\<(.*)>.*|\n)','',i).strip() for i in train]

    train_clean=[re.sub(r"'",'',i).strip() for i in train_clean]
        
    train_clean1=''.join(train_clean)

    train_token=pd.Series(nltk.sent_tokenize(train_clean1))

    train_token1=pd.DataFrame(train_token)

    train_token1['Language']=lang

    return train_token1

In [2]:
# Same purpose as above function but imports only 1 file to ensure 
# equal distribution of words across languages

def train_extract1(file_list,lang):
        
    train=[]

    for file_path in file_list[:1]:
        with open(file_path,encoding='utf8') as f_input:
            train.append(f_input.read())

    train_clean=[re.sub(r'(.*?\<(.*)>.*|\n)','',i).strip() for i in train]

    train_clean=[re.sub(r"'",'',i).strip() for i in train_clean]
        
    train_clean1=''.join(train_clean)

    train_token=pd.Series(nltk.sent_tokenize(train_clean1))

    train_token1=pd.DataFrame(train_token)

    train_token1['Language']=lang

    return train_token1

In [3]:
# Extract Bulgarian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/bg/*.txt', recursive=True)

lang='bg'

train_bg=train_extract(file_list,lang)

In [4]:
# Extract Czech

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/cs/*.txt', recursive=True)

lang='cs'

train_cs=train_extract(file_list,lang)

In [5]:
# Extract Danish

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/da/*.txt', recursive=True)

lang='da'

train_da=train_extract1(file_list,lang)

In [6]:
# Extract German

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/de/*.txt', recursive=True)

lang='de'

train_de=train_extract1(file_list,lang)

In [7]:
# Extract Greek

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/el/*.txt', recursive=True)

lang='el'

train_el=train_extract1(file_list,lang)

In [8]:
# Extract English

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/en/*.txt', recursive=True)

lang='en'

train_en=train_extract1(file_list,lang)

In [9]:
# Extract Spanish

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/es/*.txt', recursive=True)

lang='es'

train_es=train_extract1(file_list,lang)

In [10]:
# Extract Estonian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/et/*.txt', recursive=True)

lang='et'

train_et=train_extract(file_list,lang)

In [11]:
# Extract Finnish

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/fi/*.txt', recursive=True)

lang='fi'

train_fi=train_extract1(file_list,lang)

In [12]:
# Extract French

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/fr/*.txt', recursive=True)

lang='fr'

train_fr=train_extract1(file_list,lang)

In [13]:
# Extract Hungarian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/hu/*.txt', recursive=True)

lang='hu'

train_hu=train_extract(file_list,lang)

In [14]:
# Extract Italian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/it/*.txt', recursive=True)

lang='it'

train_it=train_extract1(file_list,lang)

In [15]:
# Extract Lithuanian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/lt/*.txt', recursive=True)

lang='lt'

train_lt=train_extract(file_list,lang)

In [16]:
# Extract Latvian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/lv/*.txt', recursive=True)

lang='lv'

train_lv=train_extract(file_list,lang)

In [17]:
# Extract Dutch

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/nl/*.txt', recursive=True)

lang='nl'

train_nl=train_extract1(file_list,lang)

In [18]:
# Extract Polish

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/pl/*.txt', recursive=True)

lang='pl'

train_pl=train_extract(file_list,lang)

In [19]:
# Extract Portugese

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/pt/*.txt', recursive=True)

lang='pt'

train_pt=train_extract1(file_list,lang)

In [20]:
# Extract Romanian

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/ro/*.txt', recursive=True)

lang='ro'

train_ro=train_extract(file_list,lang)

In [21]:
# Extract Slovak

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/sk/*.txt', recursive=True)

lang='sk'

train_sk=train_extract(file_list,lang)

In [22]:
# Extract Slovene

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/sl/*.txt', recursive=True)

lang='sl'

train_sl=train_extract(file_list,lang)

In [23]:
# Extract Swedish

file_list = glob.glob('C:/Users/Arun/Notebooks/Fellowship.ai/europarl/txt/sv/*.txt', recursive=True)

lang='sv'

train_sv=train_extract1(file_list,lang)

In [24]:
# Concatenate all language training data into one combined training dataframe

train_final=pd.concat([train_bg,train_cs,train_da,train_de,train_el,train_en,train_es,train_et,train_fi,train_fr,train_hu,train_it,train_lt,train_lv,train_nl,train_pl,train_pt,train_ro,train_sk,train_sl,train_sv])

train_final=train_final.rename(columns={0:'Text'})

print(train_final.head())

                                                Text Language
0                          Състав на Парламента: вж.       bg
1  протоколиОдобряване на протокола от предишното...       bg
2             протоколиПроверка на пълномощията: вж.       bg
3                 протоколиВнасяне на документи: вж.       bg
4  протоколиВъпроси с искане за устен отговор и п...       bg


In [25]:
# Count number of sentences in each language of combined training data.
# This is to simply check if the training data has a good representation of all languages.

train_count=train_final.groupby(['Language']).count()

print(train_count)

          Text
Language      
bg         902
cs         481
da         783
de         681
el         579
en         613
es         621
et         463
fi         614
fr         598
hu         499
it         565
lt         872
lv         938
nl         764
pl         531
pt         579
ro         471
sk         475
sl         576
sv         705


In [26]:
# Extract testing data

test=open('europarl_test.txt','rb').read()
test=test.decode("utf-8")

In [27]:
# Sentence tokenize test data

import re
import nltk
import pandas as pd

test1=pd.Series(nltk.sent_tokenize(test))

print(test1.head())

0    ﻿bg\t"Европа 2020" не трябва да стартира нов к...
1    bg\t(CS) Най-голямата несправедливост на сегаш...
2    bg\t(DE) Г-жо председател, г-н член на Комисия...
3    bg\t(DE) Г-н председател, бих искал да започна...
4    bg\t(DE) Г-н председател, въпросът за правата ...
dtype: object


In [28]:
# Extract sentence prefix from beginning of each sentence and build actual output

test_lang=test1.str.extractall('(^bg|^cs|^da|^de|^el|^en|^es|^et|^fi|^fr|^hu|^it|^lt|^lv|^nl|^pl|^pt|^ro|^sk|^sl|^sv)')

test_lang.reset_index(inplace=True)

del test_lang['match']

test_lang=test_lang.rename(columns={0:'Language','level_0':'key'})

#print(test_lang.head())

In [29]:
# Remove language prefix,\t and other unneccessary text

test_text=[pd.Series(re.sub(r"(bg\t)|(cs\t)|(da\t)|(de\t)|(el\t)|(en\t)|(es\t)|(et\t)|(fi\t)|(fr\t)|(hu\t)|(it\t)|(lt\t)|(lv\t)|(nl\t)|(pl\t)|(pt\t)|(ro\t)|(sk\t)|(sl\t)|(sv\t)|(\(BG\))|(\(CS\))|(\(DA\))|(\(DE\))|(\(EL\))|(\(EN\))|(\(ES\))|(\(ET\))|(\(FI\))|(\(FR\))|(\(HU\))|(\(IT)\)|(\(LT)\)|(\(LV\))|(\(NL\))|(\(PL\))|(\(PT\))|(\(RO\))|(\(SK\))|(\(SL\))|(\(SV\))",'',i).strip()) for i in test1] 

#print(test_text)


In [30]:
# Create the dataframe containing text corpora

test_text_df=pd.DataFrame(test_text)

test_text_df.reset_index(inplace=True)

test_text_df=test_text_df.rename(columns={0:'Text','index':'key'})

#print(test_text_df)

In [31]:
# Join language feature to text corpora dataframe

test_final=test_text_df.set_index('key').join(test_lang.set_index('key'))

#print(test_final)

In [32]:
# Backfill NA values of Language feature

test_final1=test_final.copy()

test_final1=test_final1.fillna(method='backfill')
    
print(test_final1.head())
    

                                                  Text Language
key                                                            
0    ﻿"Европа 2020" не трябва да стартира нов конку...       bg
1    Най-голямата несправедливост на сегашната обща...       bg
2    Г-жо председател, г-н член на Комисията, по пр...       bg
3    Г-н председател, бих искал да започна с комент...       bg
4    Г-н председател, въпросът за правата на човека...       bg


In [33]:
# Split training data into training(70%) and holdout(30%)

from sklearn.model_selection import train_test_split


X_train, X_holdout, y_train, y_holdout = train_test_split(train_final['Text'], 
                                                    train_final['Language'], 
                                                    random_state=0,test_size=0.3)

In [34]:
# Count vectorize training and holdout data
# Build multinomial naive bayes classifier using training data
# Fit this classifier on holdout data

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

cv=CountVectorizer()

X_train_transformed=cv.fit_transform(X_train)
X_holdout_transformed=cv.transform(X_holdout)
    
mnb=MultinomialNB(alpha=0.1)
mnb.fit(X_train_transformed,y_train)
y_pred_holdout=mnb.predict(X_holdout_transformed)


In [35]:
# Check if actual values of holdout set and predicted values of holdout set have same length 
# Print accuracy of classifier on holdout set

print(len(y_holdout))
print(len(y_pred_holdout))

print(np.mean(y_holdout==y_pred_holdout))

3993
3993
0.986977210118


In [36]:
# Fit same classifier on testing data

X_test=test_final1['Text']
y_test=test_final1['Language']

X_test_transformed=cv.transform(X_test)

mnb.fit(X_test_transformed,y_test)
y_pred_test=mnb.predict(X_test_transformed)

In [37]:
# Print accuracy of classifier on testing dataset

print(np.mean(y_test==y_pred_test))

0.962791541882


In [38]:
# Try to find instances of misclassification
# On testing set, build new binary column that shows 1 if classifier predicted 
# correctly else shows 0
# Build new dataframe by only extracting misclassified instances of actual and predicted
# language
# Turn that dataframe into a matrix with actual language as rows and misclassified 
# predicted language as column.

ml_df1=pd.DataFrame(y_test)
ml_df2=pd.DataFrame(y_pred_test)

ml_df1=ml_df1.rename(columns={'Language':'Actual Language'})

ml_df2=ml_df2.rename(columns={0:'Predicted Language'})

ml_df=ml_df1.join(ml_df2)

ml_df['Match']=ml_df.apply(lambda x: 1 if x['Actual Language']==x['Predicted Language'] else 0, axis=1)

ml_df_misclassified=ml_df[ml_df['Match']==0]

ml_df_misclassified = ml_df_misclassified.groupby(['Actual Language','Predicted Language']).size().unstack(fill_value=0)

print(ml_df_misclassified)

Predicted Language  bg  cs  da  de  en  es  et  fr  hu  it  lt  lv  nl  pl  \
Actual Language                                                              
bg                   0   0  28   0   0   0   0   0   0   0   0   0   0   0   
cs                   0   0  63   0   2   0   0   0   1   0   2   0   0  15   
da                   0   0   0   2   1   1   0   0   0   0   2   0   0   0   
de                   0   1  11   0   0   1   0   0   0   0   0   0   0   0   
el                   0   0   7   0   0   0   0   0   0   0   0   0   0   0   
en                   0   0   3   1   0   1   1   0   0   0   0   0   0   0   
es                   0   0   6   0   0   0   0   0   0   0   0   0   0   0   
et                   0   0  72   1   4   0   0   0   1   1   1   0   2   0   
fi                   0   0   8   0   0   0   3   0   0   0   0   0   0   0   
fr                   0   0   4   0   0   0   0   0   0   0   0   0   0   0   
hu                   0   0  56   0   0   1   0   0   0   1   1  