# EA Assignment - Project Overview
__Authored by: Álvaro Bartolomé del Canto (alvarobartt @ GitHub)__

---

<img src="https://media-exp1.licdn.com/dms/image/C561BAQFjp6F5hjzDhg/company-background_10000/0?e=2159024400&v=beta&t=OfpXJFCHCqdhcTu7Ud-lediwihm0cANad1Kc_8JcMpA">

## Roadmap

Before proceeding with the explanation and conclusions of every NLP tasks researched/developed for the project, we will start by specifying the roadmap since the start day which was on Friday, July 31 until the end date of the project which was on Tuesday, August 4.

<img src="https://i.ibb.co/y4sGM4y/roadmap.png">

## Exploratory Data Analysis

Before starting any NLP project, we first need to explore and understand the data we have 
so as to decide how are we going to tackle the problem we are facing.

We can see the dataset statistics from the GitHub repository [FerreroJeremy/Cross-Language-Dataset](https://github.com/FerreroJeremy/Cross-Language-Dataset):

Sub-corpus | Alignment | Authors | Translations | Translators | Alteration | NE (%)
--- | --- | --- | --- | --- | --- | --- |
__Wikipedia__ | Comparable | Anyone | - | - | Noise | 8.37
__PAN-PC-11__ |  Parallel |  Professional authors | Human | Professional | Yes | 3.24
__APR (Amazon Product Reviews)__ | Parallel | Anyone | Machine | Google Translate | No | 6.04
__Conference papers__ | Comparable | Computer scientists | Human | Computer scientists | Noise | 9.36

During the EDA is common to plot diverse features so as to get some sort of insights on how the data is
structured accross the documents, in order to find the proper way to tackle the problem and the upcoming NLP
steps. Then, some visualizations are provided below, with some interesting data that will be explained later:

<img src="https://i.ibb.co/f2Wddzz/eda-plots.png">

In this case, we plotted the distribution of the documents per context and language and the median lenght of
each single document per context, where it showed that Wikipedia is the most populated context and French texts 
the biggest amount. Also both the APR and the Conference papers are the ones with fewer characters, and the PAN11
texts are between the Wikipedia and the other texts.

## Text Preprocessing

__When it comes to NLP, data preprocessing is one, it not the most, important tasks__, since we
are adding value to the raw data.

For this project, since we are facing a Mulit-Lingual Multi-Context dataset, we need to develop
a custom preprocessor which preprocesses the texts no matter the language (English, French and Spanish)
which also includes some more specific preprocessing related to the different contexts.

The defined steps towards a proper preprocessing are defined as it follows:

1. __Clean Tabs and Line Breaks__: line breaks and tabs are common in text, so we will just replace them 
by an space so as to make sure that removing them does not imply different words coming together.
2. __Convert to Unidecode__: so as to unify all the data, convert very str to unidecode which will replace
the accented vowels by its regular unaccented form, etc.
3. __Substitute Regular Expressions__: from a given collection of regular expressions, every match between 
the regular expression and any group in the text will be replaced by a space and, so on, removed.
4. __Lower Case__: unify all the str to lower case, so as to identify the same words with different capitalizations 
as the same words since all the characters will match. 

5. __Split by Apostrophes__: since both English and French use the apostrophe to abbreviate text, words will be 
splitted by its apostrophe if found so as to obtain two separate words from the apostrophe joined word.
6. __Remove Small Words__: a threshold has been set so as to remove the words with less than 3 characters, 
since those words do not provide any useful information towards the models we need to train.
7. __Remove Stopwords__: stopwords from a list of default stopwords from every language should be removed, 
and also some additional stopwords manually identified per language and context have been included so as 
to provide a complete specific stopwords removal.
8. __Remove Extra Spaces__: as every regular expression and unknown character has been replaced by a space, 
now multiple spaces will be substituted by a single space so as to return a str which is indeed a 
space-separated list of tokens.

In [1]:
from unidecode import unidecode

In [2]:
import re

URL_PATTERN = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
HTML_PATTERN = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
PUNCTUATION_PATTERN = re.compile(r'[^\w\s]')
NUMBER_PATTERN = re.compile(r'[\d]+')
SPACES_PATTERN = re.compile(r'[ ]{2,}')

BASE_PATTERNS = (
    URL_PATTERN, HTML_PATTERN, PUNCTUATION_PATTERN,
    NUMBER_PATTERN, SPACES_PATTERN
)

In [3]:
from nltk.corpus import stopwords

spanish_stopwords = stopwords.words('spanish')
english_stopwords = stopwords.words('english')
french_stopwords = stopwords.words('french')

STOPWORDS = english_stopwords + spanish_stopwords + french_stopwords

ADDITIONAL_STOPWORDS = [
    'much', 'despues', 'first', 'categoria', 'aqui', 'thumb', 'also', 'tres', 'asi', 
    'three', 'one', 'still', 'aquella', 'like', 'aquel', 'mas', 'tal', 'tan', 'hacia', 
    'went', 'two', 'new', 'even', 'would', 'tras', 'could', 'pues', 'without', 'category', 
    'many', 'twoone', 'tambien', 'well', 'solo', 'dos'
]

STOPWORDS += ADDITIONAL_STOPWORDS
STOPWORDS = set(list(STOPWORDS))

<img src="https://lh3.googleusercontent.com/proxy/9k4nmTrJg_hynJgjFrZSE6UBwY0tWprel-V8TUTbuS-8G7rapNbfogDYt0KWSYZvPZmSwZPznT2asEZTr9uztYNGm-Y0W1GcXBY2YTU">

In [4]:
class CustomPreProcessor(object):
    """
    Custom PreProcessor

    Preprocesses the introduced raw text to transform it into clean text. This
    preprocessing pipe is regex based.

        >>> from apinlp.nlp.preprocessing import CustomPreProcessor
        >>> preprocessor = CustomPreProcessor()
        >>> print(preprocessor._preprocess("Visit us at https://www.ea.com/"))
        "visit us"
    """
    
    def __init__(self, strip_accents=True):
        self.strip_accents = strip_accents
        
        self.patterns = BASE_PATTERNS
        self.additional_patterns = (SPACES_PATTERN,)

        self.stopwords = STOPWORDS
    
    def _preprocess(self, text):
        """Cleans and applies a preprocessing layer to raw text"""
        text = text.replace('\t', ' ').replace('\n', ' ')
        
        if self.strip_accents:
            text = unidecode(text)

        for pattern in self.patterns:
            text = pattern.sub(' ', text)

        text = text.strip().lower()
        text = text.replace("'", " ")
        
        text = text.split(' ')

        for word in self.stopwords:
            text = list(filter((word.lower()).__ne__, text))

        text = ' '.join(text)
            
        for pattern in self.additional_patterns:
            text = pattern.sub(' ', text)
    
        return text

In [5]:
#     def _preprocess(self, text):
#         """Cleans and applies a preprocessing layer to raw text"""
#         text = text.replace('\t', ' ').replace('\n', ' ')
        
#         if self.strip_accents:
#             text = unidecode(text)

#         for pattern in self.patterns:
#             text = pattern.sub(' ', text)

#         text = text.strip().lower()
#         text = text.replace("'", " ")
        
#         text = text.split(' ')

#         for word in self.stopwords:
#             text = list(filter((word.lower()).__ne__, text))

#         text = ' '.join(text)
            
#         for pattern in self.additional_patterns:
#             text = pattern.sub(' ', text)
    
#         return text

In [6]:
preprocessor = CustomPreProcessor()

In [7]:
preprocessor._preprocess(text="Visit us at https://www.ea.com/")

'visit us'

In [8]:
preprocessor._preprocess(text="Visítanos en https://www.ea.com/")

'visitanos'

In [9]:
preprocessor._preprocess(text="Visitez-nous sur https://www.ea.com/")

'visitez'

Finally, we can see an example on how did the WordClouds improve with the preprocessed 
data compared to the raw one.

<img src="https://i.ibb.co/N1mJpPb/wordcloud-comparison.png">

## Text Classification Model

Since we are facing a NLP Text Classification problem which consits on classifying multilingual data into its context
regardless the language in which the text is written.

First of all, we need to define a vectorizer so as to transform the input text (already preprocessed) into a vector 
and then train a model which is being fitted with those vectors. In this case we will be using the TF-IDF Vectorizer 
since it is the most suitable towards tackling this problem, since it ponderates the number of occurrences of each 
word inside a document with the number of occurrences of that word among all the other documents, so as to identify 
the relevance of a word appearing in a document towards later predict the context in which that concrete piece 
of text should be classified.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

In [11]:
texts = [
    'i would love to work at ea',
    'me encantaria trabajar en ea',
    'je adorerais travailler chez ea'
]

matrix = vectorizer.fit_transform(texts)

In [12]:
import pandas as pd

pd.DataFrame(matrix.todense(), index=texts, columns=vectorizer.get_feature_names())

Unnamed: 0,adorerais,at,chez,ea,en,encantaria,je,love,me,to,trabajar,travailler,work,would
i would love to work at ea,0.0,0.432385,0.0,0.255374,0.0,0.0,0.0,0.432385,0.0,0.432385,0.0,0.0,0.432385,0.432385
me encantaria trabajar en ea,0.0,0.0,0.0,0.283217,0.479528,0.479528,0.0,0.0,0.479528,0.0,0.479528,0.0,0.0,0.0
je adorerais travailler chez ea,0.479528,0.0,0.479528,0.283217,0.0,0.0,0.479528,0.0,0.0,0.0,0.0,0.479528,0.0,0.0


Once the vectorization is completed we should just decide which classification model are we going to use depending 
on both the scope and the model's requirements/limitations. In this case, since we decided to test some different 
classification models, we just tested them over random stratified folds so as to see which of them performed better.

<img src="https://i.ibb.co/3fKmZ6w/text-classification-models.png">

So on, after training some different classification model over some random stratified data shuffling folds, we
decided to proceed using the `LinearSVC` model since it seemed to be the most consistent one in both time and
accuracy. Then, the resulting Pipeline looks as it follows:

```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('vect', TfidfVectorizer(min_df=5)),
    ('clf', LinearSVC())
])
```

Since we used a LabelEncoder in order to transform the target variables, which were indeed the document's contexts, we need to retrieve the dictionary which contains the relationship between the assigned/encoded ID and the real value of the context. This is required since the model predict the int value instead of the categorical value (str), so in order to interpret the results we will need to undo/revent the encoding.

In [13]:
import json

with open('../research/resources/id2context.json', 'r') as f:
    ID2CONTEXT = json.load(f)
    
ID2CONTEXT

{'3': 'wikipedia', '1': 'conference_papers', '0': 'apr', '2': 'pan11'}

Finally, we will just load the trained pipeline from the .joblib file where it has been dumped previosly. Since the pipeline already includes both the vectorizer and the classifier, there is no need to import any other resource so as to test the trained model.

In [14]:
from joblib import load

pipeline = load('../research/resources/text-classification-pipeline.joblib')

Once we loaded all the required resources, we will just need to retrieve raw data from any of the available context and test both the preprocessing and the text classification pipeline with it.

__Note__: so as to test it we will be using some pieces of text from Wikipedia written in English, French and Spanish; but you can play around with the text values so as to create your own texts in order to manually evaluate the text classification model.

### Spanish Wikipedia

In [15]:
text = """
Electronic Arts Inc. (EA) es una empresa estadounidense desarrolladora y distribuidora de videojuegos para ordenador y videoconsolas, fundada por Trip Hawkins.

Sus oficinas centrales están en Redwood City, California. Tiene estudios en varias ciudades de Estados Unidos, en Canadá, Suecia, Corea del Sur, China e Inglaterra. Posee diversas subsidiarias, como EA Sports, encargada de los simuladores deportivos, EA Games para los demás juegos, y subsidiarias adquiridas durante el tiempo como Maxis, entre otras. Electronics Arts también posee la mayor distribución del mundo en este sector, con oficinas en países como Brasil, Polonia y República Checa.

Actualmente, desarrolla y publica juegos que incluyen los títulos de EA Sports FIFA, Madden NFL, NHL, NBA Live y UFC. Otras franquicias establecidas por EA incluyen Battlefield, Need for Speed, Los Sims, Medal of Honor, Command & Conquer, así como nuevas franquicias como Dead Space, Mass Effect, Dragon Age, Army of Two, Titanfall y Star Wars: The Old Republic. Sus títulos de escritorio aparecen en Origin, una plataforma de distribución digital de juegos en línea para ordenadores.

Actualmente es la segunda third-party más importante de la industria de los Videojuegos, con un valor de mercado de 33 mil millones de dólares.7
"""

In [16]:
preprocessed_text = preprocessor._preprocess(text=text)
preprocessed_text

'electronic arts inc ea empresa estadounidense desarrolladora distribuidora videojuegos ordenador videoconsolas fundada trip hawkins oficinas centrales estan redwood city california estudios varias ciudades unidos canada suecia corea china inglaterra posee diversas subsidiarias ea sports encargada simuladores deportivos ea games demas juegos subsidiarias adquiridas tiempo maxis electronics arts posee mayor distribucion mundo sector oficinas paises brasil polonia republica checa actualmente desarrolla publica juegos incluyen titulos ea sports fifa madden nfl nhl nba live ufc franquicias establecidas ea incluyen battlefield need speed sims medal honor command conquer nuevas franquicias dead space mass effect dragon age army titanfall star wars old republic titulos escritorio aparecen origin plataforma distribucion digital juegos linea ordenadores actualmente segunda third party importante industria videojuegos valor mercado mil millones dolares'

In [17]:
ID2CONTEXT[str(pipeline.predict([preprocessed_text])[0])]

'wikipedia'

### English Wikipedia

In [18]:
text = """
Electronic Arts Inc. (EA) is an American video game company headquartered in Redwood City, California. It is the second-largest gaming company in the Americas and Europe by revenue and market capitalization after Activision Blizzard and ahead of Take-Two Interactive and Ubisoft as of March 2018.[4]

Founded and incorporated on May 27, 1982, by Apple employee Trip Hawkins, the company was a pioneer of the early home computer games industry and was notable for promoting the designers and programmers responsible for its games. EA published numerous games and productivity software for personal computers and later experimented on techniques to internally develop games, leading to the 1987 release of Skate or Die!.

Currently, EA develops and publishes games of established franchises, including Battlefield, Need for Speed, The Sims, Medal of Honor, Command & Conquer, Dead Space, Mass Effect, Dragon Age, Army of Two, Titanfall, and Star Wars, as well as the EA Sports titles FIFA, Madden NFL, NBA Live, NHL, and EA Sports UFC.[5] Their desktop titles appear on self-developed Origin, an online gaming digital distribution platform for PCs and a direct competitor to Valve's Steam and Epic Games' Store. EA also owns and operates major gaming studios such as EA Tiburon in Orlando, EA Vancouver in Burnaby, DICE in Sweden and Los Angeles, BioWare in Edmonton and Austin, and Respawn Entertainment in Los Angeles.[6] 
"""

In [19]:
preprocessed_text = preprocessor._preprocess(text=text)
preprocessed_text

'electronic arts inc ea american video game company headquartered redwood city california second largest gaming company americas europe revenue market capitalization activision blizzard ahead take interactive ubisoft march founded incorporated may apple employee trip hawkins company pioneer early home computer games industry notable promoting designers programmers responsible games ea published numerous games productivity software personal computers later experimented techniques internally develop games leading release skate die currently ea develops publishes games established franchises including battlefield need speed sims medal honor command conquer dead space mass effect dragon age army titanfall star wars ea sports titles fifa madden nfl nba live nhl ea sports ufc desktop titles appear self developed origin online gaming digital distribution platform pcs direct competitor valve steam epic games store ea owns operates major gaming studios ea tiburon orlando ea vancouver burnaby di

In [20]:
ID2CONTEXT[str(pipeline.predict([preprocessed_text])[0])]

'wikipedia'

### French Wikipedia

In [21]:
text = """
Electronic Arts ou EA (NASDAQ : EA [archive]) est une société américaine fondée le 28 mai 1982 et dont le siège se situe à Redwood City en Californie1. EA est l'un des principaux développeurs et producteurs mondiaux de jeux vidéo.

La société occupe la place de leader sur ce marché jusqu'en 2008, notamment grâce à des rachats de sociétés et de franchises de jeux, mais aussi en acquérant les droits de licences sportives, comme celles de la FIFA, la NBA, la NFL, ou encore celle de la LNH.

Electronic Arts est, en 2013, la troisième plus grande société commercialisant des jeux vidéo, par chiffre d'affaires, après avoir été la 4e en 2012 et 20113. 
"""

In [22]:
preprocessed_text = preprocessor._preprocess(text=text)
preprocessed_text

'electronic arts ea nasdaq ea archive societe americaine fondee mai dont siege situe redwood city californie ea principaux developpeurs producteurs mondiaux jeux video societe occupe place leader marche jusqu notamment grace rachats societes franchises jeux aussi acquerant droits licences sportives comme celles fifa nba nfl encore celle lnh electronic arts troisieme plus grande societe commercialisant jeux video chiffre affaires apres avoir ete'

In [23]:
ID2CONTEXT[str(pipeline.predict([preprocessed_text])[0])]

'wikipedia'

## Topic Modelling

In this concrete case, we will be using the preprocessed data so as to fit a Topic Modelling algorithm in order
to discover the inner insights of the data and detect the hidden topics in order to have a deeper understanding 
on what is data about and into which topics is the data separated.

NLP Topic Modelling is a relevant part of the analysis, since it allows us to gain more insights about the
dataset we have, but since it is unsupervised, it requires us to tune the parameters until we can point out useful
conclusions which make sense from the given dataset.

So on, we used the LDA (Latent Dirichlet Allocation) algorithm to identify the hidden topics in the dataset, so as 
use case we started the Topic Modelling just with English texts from Wikipedia, so as to test if it worked as expected and also to evaluate the results of one of the most populated contexts.

<img src="https://i.ibb.co/mt9cnVz/topic-modelling.png">

1. __Politics/History__: it seems to be a politics and/or history topic, since we can see that the main words include: city, world, war, government, century, etc. so that we can easily infer its topic.
2. __Music/Movies/Entertainment__: this topic seems also to be pretty clear since some of main words of the documents classified into it are: film, album, music, band, song, released, rock, etc.
3. __Industry/Research/Chemistry__: this topic is far apart from the others and it is propabbly the one more uncertain, since it contains words related to both industry, research, chemistry, etc. but since it is different from all the other ones we can easily infer its topic.
4. __Sports/Games__: even though this is not a big topic, it is one of the most clear ones, since it contains a lot of words related to sports such as nba, football, player, game, tenis, etc.
5. __Technology/Software__: it is the smallest one, but seems pretty clear that it is talking about technology and software, also due to the most relevant words it contains.

__Note__: Topic Modelling has been applied and analysed for every possible combination of context and language, and it has been analysed in detail.

## Conclusions & Future Work

__Both objectives have been successfully completed and their respective reports have been generated, tackling the problem as a Data Scientist should, including a detailed Story Telling on each research part developed.__

Additionally to the defined objectives, a detailed data exploration analysis and text preprocessing have been research/developed too, since it is probably the most relevant part of a NLP Data Scientist while tackling a NLP problem, as it is adding value to the raw data.

- `Objective 1`: the created model has been fit with 80% of the documents from every context and language and tested with the remaining 20% of the data with balanced contexts and languages too, __achieving an accuracy of up to 98% on the validation set__. Also this model has been dumped into a JOBLIB file so that it can be tested over unseen data.

- `Objective 2`: __the topic modelling problem has been broken down into a topic modelling per context and language, so as to get more insights and analyse the hidden topics__ that can be found in each collection of documents, with also pretty satisfactory results evaluated in a supervised way.

To sum up, mention that even though the project tasks have been achieved and some extra points have been made, __there is still some work ahead__.

As Future Work, __the main line of research should be focused on developing a consistent Machine Translation model in order to translate text from French and Spanish into English__, which will indeed improve the results even though they are pretty accurate now.

Another Future Work line of research should be the __design of Deep Learning models maybe in TensorFlow or PyTorch (usually more suitable for NLP)__, since we are presenting a simple use case along this project, but reality is a bit more complex, so tackling the problem using Deep Learning models should improve the model's performance when the input data is bigger, more contexts are provided and more languages too.

Finally, __multilingual word embeddings should be used so as to improve the models performance whatever the input data is, so we should be using the word embeddings so as to "translate" (get the closest word embedding) every word in Spanish or French to English__, so as to tackle the problem as a Multi-Lingual input one but for the model it would just be a single language. Also, __when deploying the model into a production environment a reliable layer of language detection should be applied__ so as to either apply the word embeddings if the text is written in French or Spanish or discard the text if it is neither English, Spanish nor French.

## References

1. [_A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab. In the 10th edition of the Language Resources and Evaluation Conference (LREC 2016)_](https://www.researchgate.net/publication/301861882_A_Multilingual_Multi-Style_and_Multi-Granularity_Dataset_for_Cross-Language_Textual_Similarity_Detection)

2. [_Word Translation Without Parallel Data. Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer and Hervé Jégou. In the ICLR 2018 Computation and Language (cs.CL)_](https://arxiv.org/pdf/1710.04087.pdf)

3. [Exploiting similarities among languages for machine translation. Tomas Mikolov, Quoc V. Le and Ilya Sutskever. In the Computation and Language (cs.CL)](https://arxiv.org/abs/1309.4168)

4. [_Language-specific models in multilingual topic tracking. Leah S. Larkey, Fangfang Feng, Margaret Connell and Victor Lavrenko, In the SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval_](https://dl.acm.org/doi/abs/10.1145/1008992.1009061)

## Thank you for your attention!