<a href="https://colab.research.google.com/github/gentlemarc/Test-Associate-Data-Scientist/blob/master/Associate_Data_Scientist_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Associate Data Scientist - Technical Test - Part 2




In the second part of this assignment, I'm going to perform a topic model analysis on the provided dataset.



## 1 What is Topic Modelling?

Topic modeling is an unsupervised technique that intends to analyze large volumes of text data by clustering the documents into groups. In the case of topic modeling, the text data do not have any labels attached to it. Rather, topic modeling tries to group the documents into clusters based on similar characteristics.



### 1.1 Approaches to the Problem

There are several existing algorithms we can use to perform the topic modeling. The most common of them are


* **Latent Semantic Analysis (LSA/LSI)**
* **Probabilistic Latent Semantic Analysis (pLSA)**
* **Latent Dirichlet Allocation (LDA)**









I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the Mallet’s implementation (via Gensim). **Mallet has an efficient implementation of the LDA.** It is known to run faster and gives better topics segregation.



### 1.2 LDA Explanation

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

When I say topic, what is it actually and how it is represented?

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:



1. The quality of text processing.
2. The variety of topics the text talks about.
3. The choice of topic modeling algorithm.
4. The number of topics fed to the algorithm.
5. The algorithms tuning parameters.




## 2 Load the dataset and Import the libraries

First step is to upload the data as .zip (colab doesn't allow to upload folders).

In [1]:
from zipfile import ZipFile

filename = '/content/documents_challenge.zip'

with ZipFile(filename, 'r') as zip:
  zip.extractall()
print('Done')

Done


Like we did in the first part, we will merge all the documents in the same dataframe

In [2]:
#Import some libraries
import pandas as pd
import os 
import sys
from collections import defaultdict
import re


# Get all the url files in all the folders and append into a list
data_folder = r'/content/documents_challenge'

#we shall store all the file names in this list
filelist = []


for root, dirs, files in os.walk(data_folder):
    for file in files:
        #append the file name to the list
        filelist.append(os.path.join(root,file))
        #print(file)

#Print the len of the data
print("Lenght of file", len(filelist))

print("Example of data")
print(filelist[0:3])



Lenght of file 23128
Example of data
['/content/documents_challenge/Conference_papers/en/article-13-22-en.txt', '/content/documents_challenge/Conference_papers/en/article-20-2-en.txt', '/content/documents_challenge/Conference_papers/en/article-25-1-en.txt']


Join again the dataset to work easy with the data

In [3]:
#Let's merge all the data in a dataframe with columns name 

#results = defaultdict(list)

results = []

for files in filelist:

        #Amazon Reviews. 
        if('APR' in files):
            #Create a column to put the type file, variable to predict.
            type_file = 'APR'
            lang = re.search(r'APR/(.*?)/apr-', files).group(1)

        #Conference Papers.    
        elif('Conference_papers' in files):
           
            type_file = 'Conference Paper'
            lang = re.search(r'Conference_papers/(.*?)/article', files).group(1)
        
        #PAN 11
        elif('PAN11' in files):
            type_file = 'PAN11'
            lang = re.search(r'PAN11/(.*?)/pan-', files).group(1)
            #lang = lang.split("\\" )[1]

        #Wikipedia
        else:
            type_file = 'Wikipedia'
            lang = re.search(r'Wikipedia/(.*?)/', files).group(1)
    
        try:
            with open(files, "r",  encoding="UTF-8") as file_open:

                #results["file"] = type_file
                #results["lang"] = lang
                #results["text"].append(file_open.read())
                
                results.append ([lang, type_file, file_open.read()])

        except:
            print("Error in file: ", file)

#Create the DataFrame
corpus_df = pd.DataFrame(results, columns=['Language', 'Category', 'Text'])

#Print the 10 first values
corpus_df.head(10)

Unnamed: 0,Language,Category,Text
0,en,Conference Paper,"DICOVALENCE, a valence dictionary of\n French,..."
1,en,Conference Paper,DI-LSA\n The technique proposed independently ...
2,en,Conference Paper,The methodology consists in introducing semant...
3,en,Conference Paper,"In order to do so, we evaluate our approaches ..."
4,en,Conference Paper,These experiments show that it could be risky ...
5,en,Conference Paper,"For each EDU, annotators identify how outcomes..."
6,en,Conference Paper,"Abstract. In this article, we analyse the modi..."
7,en,Conference Paper,We also obtain a list of more than 200 multiwo...
8,en,Conference Paper,When the configurator used is itself\n object-...
9,en,Conference Paper,"Let be C a finite set of n concepts, a concept..."


Load some libraries for text pre-processing later. We will use spacy model for lemmatization.


**Lemmatization means converting a word to its root word**. For example: the lemma of the word ‘machines’ is ‘machine’. Other example:, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.



In [4]:
# Download NLTK stopwords
import nltk


#We will load the stopwords for the different languages which we have the data.
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
es_stop = set(nltk.corpus.stopwords.words('spanish'))
fr_stop = set(nltk.corpus.stopwords.words('french'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
#print an example of Spanish stopwords

es_list = list(es_stop) 
es_list[0:15]


['todo',
 'tenido',
 'hubisteis',
 'sin',
 'estarás',
 'haya',
 'esté',
 'tenía',
 'del',
 'hubieran',
 'hay',
 'esas',
 'pero',
 'mis',
 'estuvieras']

In [6]:
#Download the data for lemmatization
!python3 -m spacy download en
!python3 -m spacy download es
!python3 -m spacy download fr

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Collecting es_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.2.5/es_core_news_sm-2.2.5.tar.gz (16.2MB)
[K     |████████████████████████████████| 16.2MB 697kB/s 
Building wheels for collected packages: es-core-news-sm
  Building wheel for es-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for es-core-news-sm: filename=es_core_news_sm-2.2.5-cp36-none-any.whl size=16172934 sha256=12b8eea84a412ab5ff00dae2f9da1536b8eb2e1836d7bc10b71d21803426e875
  Stored in directory: /tmp/pip-ephem-wheel-cache-c864mhto/wheels/05/4f/66/9d0c806f86de08e8645d67996798c49e1512f9c3a250d74242
Successfully built es-core-news-sm
Inst

**Import Necessary Packages:**


In [7]:
# pyLDAvis is a package we will use for data visualization. We will explain more later

!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 2.8MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 17.0MB/s 
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97712 sha256=2e6105c34e2cb3d845b16914eff594285c18411b17c31b78bbd5e1f5bb66c10a
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=17bc5efb

In [8]:
import re
#Numpy and Pandas for data handling
import numpy as np
import pandas as pd
from pprint import pprint


# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis # Visualize the topics-keywords
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline


### Split the Dataset per Language.

To explain better the topic model analysis, we are going to split the dataset between French, Spanish and English documents. 

In [None]:
corpus_df.head()

Unnamed: 0,Language,Category,Text
0,en,Conference Paper,The train/dev/test split is\n the same as in (...
1,en,Conference Paper,Contribution of conceptual vectors to lexical ...
2,en,Conference Paper,As it would not be then reasonable nor easy to...
3,en,Conference Paper,Interaction Grammars\n We briefly introduce IG...
4,en,Conference Paper,"The latter, rather unusual,\n test configurati..."


In [9]:
english_df = corpus_df.loc[corpus_df['Language'] == 'en']
spanish_df = corpus_df.loc[corpus_df['Language'] == 'es']
french_df = corpus_df.loc[corpus_df['Language'] == 'fr']

Convert datasets to lists. Doing that we make easy to see the full content and clean them. Let's print some data


In [10]:
#Convert to list and print the 5 first and the 5 last results
english = english_df.Text.values.tolist()

print("First English results")
print(english[0:5])
print("\n")
print("Last English Results")
print(english[-5:])


First English results
['DICOVALENCE, a valence dictionary of\n French, formerly known as PROTON (van den\n Eynde and Mertens, 2002), which has been\n based on the pronominal approach. In version\n 1.1, this dictionary details the subcategorization\n frames of more than 3,700 verbs (table 1\n gives an example of a DICOVALENCE entry).\n We extracted the simple and multiword prepositions\n it contains (i.e. more than 40), as well\n as their associated semantic classes.', 'DI-LSA\n The technique proposed independently by Bestgen\n (2002), DI-LSA, is very similar to the one proposed by\n Turney and Littman. The main difference is at the level\n of the benchmarks used to evaluate a word. While\n SO-LSA uses a few benchmarks selected a priori,\n DI-LSA is based on lexicons that contain several\n hundred words rated by judges on the\n pleasant-unpleasant scale This kind of lexicon was\n initially developed in the field of content analysis. As\n early as 1965, Heise proposed to constitute a val

In [11]:
#Convert to list and print the 5 first and the 5 last results
spanish = spanish_df.Text.values.tolist()

print("First spanish results")
print(spanish[0:2])
print("\n")
print("Last spanish Results")
print(spanish[-2:])


First spanish results
[' |fecha de defunción = |lugar de defunción = |otros nombres = |cónyuge = |hijos = |sitio web = |premios óscar = |premios globo de oro = |premios bafta = |premios emmy = |premios sag = |premios tony = |premios grammy = |premios cannes = |premios san sebastian = |premios goya = |premios cesar = |premios ariel = |premios condor = |otros premios = |imdb = 0156940 Dominic Chianese (nacido el 24 de febrero o el 2 de septiembre de 1931)http://www.filmreference.com/film/36/Dominic-Chianese.html&lt;/ref&gt; es un actor Italo-Americano quizás más conocido por su papel de Junior Soprano en la serie de HBO TV, Los Soprano, un papel que le concedió dos nominaciones a a los premios Emmy.BiografíaChianese nació en el municipio del Bronx, en Nueva York, hijo de un albañil. Se graduó en la prestigiosa Bronx High School of Science en 1948. Trabajó como albañil con su padre y asistía a la escuela nocturna durante la década de 1950, consiguiendo su licenciatura en teatro y declamac

In [12]:
#Convert to list and print the 5 first and the 5 last results
french = french_df.Text.values.tolist()

print("First french results")
print(french[0:2])
print("\n")
print("Last french Results")
print(french[-2:])


First french results
['Un exemple de conjonction entre préférences est Pourrais-je avoir un petit déjeuner et un repas végétarien\n ? où l’agent exprime deux préférences qu’il souhaite satisfaire et il aimerait en avoir\n au moins une des deux s’il ne peut pas les avoir toutes. La sémantique des disjonctions est une\n modalité de choix libre. Par exemple, Je suis libre lundi ou mardi signifie que lundi ou mardi\n est un jour possible pour se rencontrer et que l’agent est indifférent entre les deux.', 'Puisque la liste des termes associés à chaque concept de notre ontologie est\n courte, ce trait aide à retrouver des lexicalisations supplémentaires ; (2) le segment contient\n une disjonction ou une conjonction ; (3) le GN est dans la portée d’une négation, d’un modal ou\n d’un verbe d’action du domaine (se rencontrer, réserver). La portée des négations et des modaux\n est résolue de manière simplifiée en utilisant l’arbre syntaxique de l’UD; (4) le segment contient\n un mot d’opinion (b

## 3 Clean the Data


As we can see, there is a lot of special characters and information that can add unnecesary noise to the topic model analysis. The more complicated documents are those ones  **coming from wikipedia**

If we see deep in the wikipedia data, we can see there is html code like **&ndash** or extensions files like **.jpg** (this is to load picture. It could add noise in our topic analysis model).

Other expressions to remove are:

* & *word_between* ;
* < *word_between* >
* urls
* Remove betwen [[ and ]]. Everything inside double squarebracket is part of Wiki markup. It could add unnecesary words to find hidden topics.
* **jpg** files
* special characters like '|' or @ are not so important, because **it's gonna be removed from the analysis during tokenization.**



Here are some functions to clean data. Also is important to **remove the accents in French and Spanish.**

In [15]:
#This is an example of text with a lot of unnecesary data.

textc = ['[[Imagen:RoyalAcademy20040807 CopyrightKaihsuTai.jpg|thumb|right|250px|\'Real Academia de Arte\', Londres]] La Royal Academy of Arts es una institución artística con sede en Piccadilly, @hola.jpg Londres.La Real Academia surgió a partir de una disputa en la Sociedad de Artistas, por el liderazgo, entre dos arquitectos, Sir William Chambers y James Paine. Paine ganó, pero Chambers juró venganza y usó sus conexiones con el rey para crear una nueva institución artística, la Real Academia, en 1768. Los cuarenta fundadores fueron admitidos el 10 de diciembre de 1768. Sir Joshua Reynolds fue el primer presidente, y Benjamin West el segundo.La Real Academia no recibe apoyo financiero del estado ni de la Corona. Obtiene ingresos de sus exposiciones y de donaciones. La Academia dirige una escuela de arte para postgraduados, con sede en Burlington House. Los alumnos suelen hacer dos exposiciones al año.El número de académicos está limitado a 80. Se busca el equilibrio entre las distintas disciplinas, y así, se suele exigir que haya, por ejemplo, al menos 14 escultores y 12 arquitectos. Además de los miembros de la academia (R.A.), existen asociados (A.R.A.), pero no es requisito previo para ser académico.La elección como Presidente de la Real Academia (P.R.A.) suele garantizar ser nombrado caballero, si es que el presidente no ostenta ya tal rango.Los miembros del público pueden unirse a la Academia como "Amigos", haciendo donaciones, lo cual es otra de las fuentes de financiación.Lista de principales académicosThomas Gainsborough (1768)William Hunter (1768; primer académico profesor de anatomía)Angelica Kauffmann (1768)Sir Joshua Reynolds (1768; Presidente 1768&ndash;1792)Benjamin West (1768; Presidente 1792&ndash;1805, 1806&ndash;1820)Sir Thomas Lawrence (1794; Presidente 1820&ndash;1830)John Flaxman (1800; Profesor de Escultura 1810&ndash;1826)Sir John Soane (1802; Profesor de la Academia, de arquitectura 1806&ndash;1837)J. M. W. Turner (1802)John Constable (1829)William Dyce (1848)John Everett Millais (1863; Presidente 1896)Alfred Waterhouse (1885)John William Waterhouse (1895)George Frederic Watts (1897)Sir Aston Webb (1903)Eduardo Paolozzi (1979)Peter Blake (1981)David Hockney (1991)PresidentesPresidenteMandatoSir Joshua Reynolds1768&ndash;1792Benjamin West1792&ndash;1805James Wyatt1805&ndash;1806Benjamin West1806&ndash;1820Sir Thomas Lawrence1820&ndash;1830Sir Martin Archer Shee1830&ndash;1850Sir Charles Lock Eastlake1850&ndash;1865Sir Francis Grant1866&ndash;1878Frederic Leighton, Lord Leighton1878&ndash;1896Sir John Everett MillaisFebrero&ndash;agosto 1896Sir Edward Poynter1896&ndash;1918Sir Aston Webb1919&ndash;1924Sir Frank Dicksee1924&ndash;1928Sir William Llewellyn1928&ndash;1938Sir Edwin Lutyens1938&ndash;1944Sir Alfred Munnings1944&ndash;1949Sir Gerald Kelly1949&ndash;1954Sir Albert Richardson1954&ndash;1956Sir Charles Wheeler1956&ndash;1966Sir Thomas Monnington1966&ndash;1976Sir Hugh Casson1976&ndash;1984Sir Roger de Grey1984&ndash;1993Sir Philip Dowson1993&ndash;1999Phillip King1999&ndash;2004Sir Nicholas Grimshaw2004&ndash;actualidadDirecciónRoyal Academy of Arts. Burlington House. Piccadilly. London W1J 0BDEnlaces externosPágina oficial de la Royal AcademyCategory:Museos de Londresar:الأكاديمية الملكية للفنون ca:Royal Academy of Arts cs:Royal Academy of Arts de:Royal Academy of Arts en:Royal Academy fr:Royal Academy he:האקדמיה המלכותית לאמנויות hu:Királyi Művészeti Akadémia it:Royal Academy of Arts ja:ロイヤル・アカデミー・オブ・アーツ nl:Royal Academy of Arts no:Royal Academy pl:Royal Academy pt:Academia Real Inglesa ru:Королевская Академия художеств simple:Royal Academy of Arts', ' thumb|right|200px| Basílica de San Andrea Vercelli (Vërsèj en piamontés) es una ciudad de Italia en la región del Piamonte, provincia de Vercelli. Tiene unos 60.000 habitantes y está situada a la orilla derecha del río Sesia. Se encuentra en medio de una gran llanura, entre Milán y Turín, muy bien irrigada y rodeada de campos de arroz, producto del que exporta a todo el mundo y del que es uno de los mayores mercados europeos. Su nombre deriva del celta Wercel, (Guardia de los celtas).HistoriaFue la capital de los Libiquis (Oppidium Vercellae) y formó parte de la Galia Cisalpina. En el 101&amp;nbsp;a.&amp;nbsp;C. se libró una batalla en sus alededores entre los romanos dirigidos por el cónsul Cayo Mario contra los Cimbrios, que fueron derrotados en la Batalla de Vercelae. En el 89&amp;nbsp;a.&amp;nbsp;C. la ciudad va a recibir el derecho romano. En tiempos de Estrabón era una villa fortificada, pero después se va a convertir en municipio (42&nbsp;a.&nbsp;C.)']
textc 

['[[Imagen:RoyalAcademy20040807 CopyrightKaihsuTai.jpg|thumb|right|250px|\'Real Academia de Arte\', Londres]] La Royal Academy of Arts es una institución artística con sede en Piccadilly, @hola.jpg Londres.La Real Academia surgió a partir de una disputa en la Sociedad de Artistas, por el liderazgo, entre dos arquitectos, Sir William Chambers y James Paine. Paine ganó, pero Chambers juró venganza y usó sus conexiones con el rey para crear una nueva institución artística, la Real Academia, en 1768. Los cuarenta fundadores fueron admitidos el 10 de diciembre de 1768. Sir Joshua Reynolds fue el primer presidente, y Benjamin West el segundo.La Real Academia no recibe apoyo financiero del estado ni de la Corona. Obtiene ingresos de sus exposiciones y de donaciones. La Academia dirige una escuela de arte para postgraduados, con sede en Burlington House. Los alumnos suelen hacer dos exposiciones al año.El número de académicos está limitado a 80. Se busca el equilibrio entre las distintas disci

In [14]:
#Create a function to remove
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 


def clean_characters(lst):

   """
   Remove Special Characters
   """
   lst = [sent.replace('\s', ' ') for sent in lst] # Replace \s
   lst = [sent.replace( '\n', ' ') for sent in lst] # Replace \n
   lst = [re.sub('&.*?;', ' ', sent) for sent in lst] #Replace between &marks;
   lst = [re.sub('\[[^\]]*\].', ' ', sent) for sent in lst] #Replace between Square brackets
   lst = [sent.replace('jpg', '') for sent in lst] # Replace \s

   return lst




In [16]:
#Let's see the text after the cleaning
clean_characters(textc)

['  La Royal Academy of Arts es una institución artística con sede en Piccadilly, @hola. Londres.La Real Academia surgió a partir de una disputa en la Sociedad de Artistas, por el liderazgo, entre dos arquitectos, Sir William Chambers y James Paine. Paine ganó, pero Chambers juró venganza y usó sus conexiones con el rey para crear una nueva institución artística, la Real Academia, en 1768. Los cuarenta fundadores fueron admitidos el 10 de diciembre de 1768. Sir Joshua Reynolds fue el primer presidente, y Benjamin West el segundo.La Real Academia no recibe apoyo financiero del estado ni de la Corona. Obtiene ingresos de sus exposiciones y de donaciones. La Academia dirige una escuela de arte para postgraduados, con sede en Burlington House. Los alumnos suelen hacer dos exposiciones al año.El número de académicos está limitado a 80. Se busca el equilibrio entre las distintas disciplinas, y así, se suele exigir que haya, por ejemplo, al menos 14 escultores y 12 arquitectos. Además de los 

At the moment is fine, other special characters that aren't necessary and the punctuation will be removed in the tokenization part

## 4 Tokenize Words and Clean Up Text


Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called **tokens**. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.  








Gensim’s **simple_preprocess()** is great for tokenization. Additionally I have set **deacc=True** to remove the punctuations.

In [17]:

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(textc))

#Print an example of tokenization using the example above
print(data_words[:1])
  

[['imagen', 'royalacademy', 'jpg', 'thumb', 'right', 'px', 'real', 'academia', 'de', 'arte', 'londres', 'la', 'royal', 'academy', 'of', 'arts', 'es', 'una', 'institucion', 'artistica', 'con', 'sede', 'en', 'piccadilly', 'hola', 'jpg', 'londres', 'la', 'real', 'academia', 'surgio', 'partir', 'de', 'una', 'disputa', 'en', 'la', 'sociedad', 'de', 'artistas', 'por', 'el', 'liderazgo', 'entre', 'dos', 'arquitectos', 'sir', 'william', 'chambers', 'james', 'paine', 'paine', 'gano', 'pero', 'chambers', 'juro', 'venganza', 'uso', 'sus', 'conexiones', 'con', 'el', 'rey', 'para', 'crear', 'una', 'nueva', 'institucion', 'artistica', 'la', 'real', 'academia', 'en', 'los', 'cuarenta', 'fundadores', 'fueron', 'admitidos', 'el', 'de', 'diciembre', 'de', 'sir', 'joshua', 'reynolds', 'fue', 'el', 'primer', 'presidente', 'benjamin', 'west', 'el', 'segundo', 'la', 'real', 'academia', 'no', 'recibe', 'apoyo', 'financiero', 'del', 'estado', 'ni', 'de', 'la', 'corona', 'obtiene', 'ingresos', 'de', 'sus', 'ex

In [18]:
#Convert the 3 different datafrae to lists

english_list = english_df.Text.values.tolist()
spanish_list = spanish_df.Text.values.tolist()
french_list = french_df.Text.values.tolist()


In [19]:
#Apply the function clean_characters to the different languages

english_list = clean_characters(english_list)
spanish_list = clean_characters(spanish_list)
french_list = clean_characters(french_list)




In [20]:
#Apply the function sent_to_words to get the tokenization arrays

english_words = list(sent_to_words(english_list))
french_words = list(sent_to_words(french_list))
spanish_words = list(sent_to_words(spanish_list))


## 5  Creating Bigram and Tirgram Models

**Bigrams** are two words frequently occurring together in the document. 


  **Trigrams** are 3 words frequently occurring.  
    
      
        


  








**Gensim’s Phrases** model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are **min_count** and **threshold**. The higher the values of these param, the harder it is for words to be combined to bigrams.   


    
      
             
    
   

Some examples in our data are *defuncion_lugar*, *premios_oscar_premios_globo* or *otros_nombres_conyuge*

In [21]:
# Build the bigram and trigram models

# ENGLISH
bigram_en = gensim.models.Phrases(english_words, min_count=4, threshold=80) # higher threshold fewer phrases.
trigram_en = gensim.models.Phrases(bigram_en[english_words], threshold=80)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod_en = gensim.models.phrases.Phraser(bigram_en)
trigram_mod_en = gensim.models.phrases.Phraser(trigram_en)





In [22]:
# See trigram example in English
print(trigram_mod_en[bigram_mod_en[english_words[0]]])

['dicovalence', 'valence', 'dictionary', 'of', 'french', 'formerly', 'known', 'as', 'proton', 'van_den', 'eynde', 'and', 'mertens', 'which', 'has', 'been', 'based', 'on', 'the', 'pronominal', 'approach', 'in', 'version', 'this', 'dictionary', 'details', 'the', 'frames', 'of', 'more', 'than', 'verbs', 'table', 'gives', 'an', 'example', 'of', 'dicovalence', 'entry', 'we', 'extracted', 'the', 'simple', 'and', 'multiword_prepositions', 'it', 'contains', 'more', 'than', 'as', 'well', 'as', 'their', 'associated', 'semantic', 'classes']


In [27]:
# SPANISH
bigram_es = gensim.models.Phrases(spanish_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram_es = gensim.models.Phrases(bigram_es[spanish_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod_es = gensim.models.phrases.Phraser(bigram_es)
trigram_mod_es = gensim.models.phrases.Phraser(trigram_es)



In [26]:
 # See trigram example in Spanish
print(trigram_mod_es[bigram_mod_es[spanish_words[0]]])

['fecha', 'de', 'defuncion_lugar', 'de', 'defuncion', 'otros_nombres_conyuge', 'hijos_sitio_web', 'premios_oscar_premios_globo', 'de', 'oro_premios_bafta', 'premios_emmy_premios_sag', 'premios_tony_premios_grammy', 'premios_cannes_premios', 'san_sebastian_premios_goya', 'premios_cesar_premios_ariel', 'premios_condor_otros', 'premios_imdb', 'dominic_chianese', 'nacido', 'el', 'de', 'febrero', 'el', 'de', 'septiembre', 'de', 'http_www', 'filmreference', 'com', 'film', 'dominic_chianese', 'html_ref', 'es', 'un', 'actor', 'italo', 'americano', 'quizas', 'mas', 'conocido', 'por', 'su', 'papel', 'de', 'junior', 'soprano', 'en', 'la', 'serie', 'de', 'hbo', 'tv', 'los', 'soprano', 'un', 'papel', 'que', 'le', 'concedio', 'dos', 'nominaciones', 'los', 'premios_emmy', 'nacio', 'en', 'el', 'municipio', 'del', 'bronx', 'en', 'nueva_york', 'hijo', 'de', 'un', 'albanil', 'se', 'graduo', 'en', 'la', 'prestigiosa', 'bronx', 'high_school', 'of', 'science', 'en', 'trabajo', 'como', 'albanil', 'con', 'su'

In [28]:
# FRENCh
bigram_fr = gensim.models.Phrases(french_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram_fr = gensim.models.Phrases(bigram_fr[french_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod_fr = gensim.models.phrases.Phraser(bigram_fr)
trigram_mod_fr = gensim.models.phrases.Phraser(trigram_fr)



In [29]:
# See trigram example in Spanish
print(trigram_mod_fr[bigram_mod_fr[french_words[0]]])

['un', 'exemple', 'de', 'conjonction', 'entre', 'preferences', 'est', 'pourrais', 'je', 'avoir', 'un', 'petit_dejeuner', 'et', 'un', 'repas', 'vegetarien', 'ou', 'agent', 'exprime', 'deux', 'preferences', 'qu', 'il', 'souhaite', 'satisfaire', 'et', 'il', 'aimerait', 'en', 'avoir', 'au', 'moins', 'une', 'des', 'deux', 'il', 'ne', 'peut', 'pas', 'les', 'avoir', 'toutes', 'la', 'semantique', 'des', 'disjonctions', 'est', 'une', 'modalite', 'de', 'choix', 'libre', 'par', 'exemple', 'je_suis', 'libre', 'lundi', 'ou', 'mardi', 'signifie', 'que', 'lundi', 'ou', 'mardi', 'est', 'un', 'jour', 'possible', 'pour', 'se', 'rencontrer', 'et', 'que', 'agent', 'est', 'indifferent', 'entre', 'les', 'deux']


In [None]:
print("Num of English Trigrams", len(trigram_mod_en[bigram_mod_en[english_words[0]]]))
print("Num of Spanish Trigrams", len(trigram_mod_es[bigram_mod_es[spanish_words[0]]]))
print("Num of French Trigrams", len(trigram_mod_fr[bigram_mod_fr[french_words[0]]]))

Num of English Trigrams 20
Num of Spanish Trigrams 3806
Num of French Trigrams 46


In the examples above, we can see how the words frequently ocurring together in the different languages.  

It's interesting to see how the spanish has many more trigrams compared to English or French. Before give some conclusions, let's **remove Stop Words**



 ## 6 Remove Stopwords, Make Bigrams and Lemmatize

The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [32]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words_en = stopwords.words('english')
stop_words_fr = stopwords.words('french')
stop_words_es = stopwords.words('spanish')

#Let's extend the stopwords with some unnecesary words coming from wikipedia corpus
stop_words_en.extend(['jpg', 'thumbs', 'px'])
stop_words_fr.extend(['jpg', 'thumbs', 'px'])
stop_words_es.extend(['jpg', 'thumbs', 'px'])



In [33]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts, stop_words):
  
  # Pass a stop words dictionary and a corpus of texts
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]



def make_bigrams(texts, lang):

  #English Bigram by default

  #Spanish Bigram
  if lang == 'es':
    return [bigram_mod_es[doc] for doc in texts]
  
  #French Bigram
  elif lang == 'fr':
    return [bigram_mod_fr[doc] for doc in texts]
  
  return [bigram_mod_en[doc] for doc in texts]
  
def make_trigrams(texts, lang):


  if lang=='es':
     return [trigram_mod_es[bigram_mod_es[doc]] for doc in texts]
  #Spanish Trigram
  elif lang == 'fr':
    return [trigram_mod_fr[bigram_mod_fr[doc]] for doc in texts]

  #English Trigram
  return [trigram_mod_en[bigram_mod_en[doc]] for doc in texts]

In [35]:
#Define Lemmatization function
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Let’s call the functions- Once per language

1. English
2. Spanish
3. French



In [36]:
# Remove Stop Words
data_words_nostops_en = remove_stopwords(english_words,stop_words_en)

# Form Bigrams
data_words_bigrams_en = make_bigrams(data_words_nostops_en, 'en')

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams_en, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['valence', 'formerly', 'know', 'base', 'pronominal', 'version', 'detail', 'frame', 'verb', 'table', 'give', 'example', 'dicovalence', 'entry', 'extract', 'simple', 'multiword_preposition', 'contain', 'well', 'associated', 'semantic', 'class']]


Let's Work in Spanish Language

In [37]:
# Remove Stop Words
data_words_nostops_es = remove_stopwords(spanish_words,stop_words_es)

# Form Bigrams
data_words_bigrams_es = make_bigrams(data_words_nostops_es, 'es')

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp_es = spacy.load('es', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized_es = lemmatization(data_words_bigrams_es, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized_es[:1])

[['oro', 'premios_bafta', 'premios_emmy', 'premios_sag', 'premio', 'premios_cesar', 'premios_imdb', 'nacido', 'febrero', 'septiembre', 'http_www', 'filmreference', 'actor', 'nacio', 'actore', 'gilbert', 'direccion', 'raedler', 'despue', 'pasion', 'teatros', 'provinciale', 'restaurante', 'primera', 'television', 'monitor', 'sociocultural', 'amigo', 'hit', 'cancione', 'final', 'actor', 'final', 'lupertazzi', 'dominic_chianese', 'dominic_chianese']]


Let's Work in French Language

In [38]:
# Remove Stop Words
data_words_nostops_fr = remove_stopwords(french_words,fr_stop)

# Form Bigrams
data_words_bigrams_fr = make_bigrams(data_words_nostops_fr, 'fr')

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp_fr = spacy.load('fr', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized_fr = lemmatization(data_words_bigrams_fr, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized_fr[:1])

[['preference', 'repa', 'preference', 'aimerait', 'moin', 'semantique', 'disjonction', 'modalite', 'libre', 'possible', 'agent']]


## 7 Create the Dictionary and Corpus needed for Topic Modelling

The two main inputs to the LDA topic model are the dictionary(**id2word**) and the **corpus**. Let’s create them.





In [39]:
# Create Dictionary. One for each language
id2word_en = corpora.Dictionary(data_lemmatized)
id2word_es = corpora.Dictionary(data_lemmatized_es)
id2word_fr = corpora.Dictionary(data_lemmatized_fr)


# Create Corpus. One for each Language
texts_en = data_lemmatized
texts_es = data_lemmatized_es
texts_fr = data_lemmatized_fr

# Term Document Frequency
corpus_en = [id2word_en.doc2bow(text) for text in texts_en]
corpus_es = [id2word_es.doc2bow(text) for text in texts_es]
corpus_fr = [id2word_fr.doc2bow(text) for text in texts_fr]

# Print an example
print(corpus_fr[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1)]]


In [40]:
print(corpus_es[:1])

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]


Gensim creates a unique id for each word in the document. The produced corpus shown above is a **mapping of (word_id, word_frequency).**  
  
  For example, (0,2) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs twice and so on.  
  

 This is used as the **input by the LDA model**.  
   

If you want to see what word a given id corresponds to, pass the id as a **key to the dictionary.**






In [41]:
id2word_es[2]

'amigo'

Or we can see a term-frequenct of one document



In [42]:
[[(id2word_es[id], freq) for id, freq in cp] for cp in corpus_es[:1]]


[[('actor', 2),
  ('actore', 1),
  ('amigo', 1),
  ('cancione', 1),
  ('despue', 1),
  ('direccion', 1),
  ('dominic_chianese', 2),
  ('febrero', 1),
  ('filmreference', 1),
  ('final', 2),
  ('gilbert', 1),
  ('hit', 1),
  ('http_www', 1),
  ('lupertazzi', 1),
  ('monitor', 1),
  ('nacido', 1),
  ('nacio', 1),
  ('oro', 1),
  ('pasion', 1),
  ('premio', 1),
  ('premios_bafta', 1),
  ('premios_cesar', 1),
  ('premios_emmy', 1),
  ('premios_imdb', 1),
  ('premios_sag', 1),
  ('primera', 1),
  ('provinciale', 1),
  ('raedler', 1),
  ('restaurante', 1),
  ('septiembre', 1),
  ('sociocultural', 1),
  ('teatros', 1),
  ('television', 1)]]

##8 Building the Topic Model

Now, we are ready and we have everything to train our LDA model. Furthermore to corpus and the dictionary, we need to provide a number of topics.

Apart from that, **alpha** and **eta** are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

**chunksize** is the number of documents to be used in each training chunk. update_every **determines** how often the model parameters should be updated and passes is the total number of training passes.



### 8.1 Working only with the Spanish Language

As seen in the above results, Spanish is the Language with larger number of results in terms of Bigram and Trigram. There is no a specific reason for that, but 

We will discuss later, the different reasons why the Spanish give better results things we could improve on this model. To simplify the notebook,hereinafter **we will continue only in one language** (Spanish, as we said). 

In [43]:
# Build LDA model
lda_model_es = gensim.models.ldamodel.LdaModel(corpus=corpus_es,
                                           id2word=id2word_es,
                                           num_topics=20, 
                                           random_state=42,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

## 8 View the Topics in LDA Model

The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword gives a certain weight to the topic.



In [44]:
# Print the Keyword in the topics
pprint(lda_model_es.print_topics()) #Pprtin produces nice and aestetical pleasing views of data structures
doc_lda = lda_model_es[corpus_es]

[(0,
  '0.053*"sabe" + 0.018*"irresistible" + 0.017*"horrible" + 0.016*"morale" + '
  '0.015*"desagradable" + 0.014*"pariente" + 0.012*"debe" + 0.012*"conocer" + '
  '0.012*"loco" + 0.009*"presento"'),
 (1,
  '0.037*"extension" + 0.028*"brazo" + 0.017*"chiste" + 0.010*"cancer" + '
  '0.008*"trato" + 0.008*"exteriore" + 0.008*"singular" + 0.008*"indios" + '
  '0.007*"solitario" + 0.007*"vino"'),
 (2,
  '0.086*"usted" + 0.027*"palabra" + 0.014*"sere" + 0.013*"idea" + '
  '0.013*"amigo" + 0.012*"siempre" + 0.011*"imposible" + 0.010*"corte" + '
  '0.009*"sabio" + 0.009*"instante"'),
 (3,
  '0.093*"voz" + 0.040*"espanole" + 0.013*"despue" + 0.010*"vox" + '
  '0.010*"ustede" + 0.010*"felice" + 0.009*"conmigo" + 0.009*"prosiguio" + '
  '0.008*"leer" + 0.008*"cancione"'),
 (4,
  '0.033*"despue" + 0.018*"noche" + 0.017*"tenia" + 0.017*"iban" + '
  '0.016*"call" + 0.013*"nuevo" + 0.009*"francese" + 0.009*"creia" + '
  '0.009*"breve" + 0.009*"enorme"'),
 (5,
  '0.076*"pue" + 0.037*"tre" + 0.036*"

### 8.1 Interpret the data





Topic 0 is a represented as  0.053 *  *sabe* + 0.018 * *irresistible* + 0.017 * *horrible* + 0.016 * *morale* + 0.015 * *desagradable* + 0.014 * *pariente* + 0.012 * *debe* + 0.012 * *conocer* + 0.012 * *loco* + 0.00 9*"presento"') ....  
  

It means, the top 10 keywords in this topic are:

Sabe, irresistible,horrible , desagradable, pariente, emitir, and so on....

 0.053 is the weight the word "sabe" has in the topic

**The greater the weight, the most important the keyword in the topic**  



| Weight in Topic | Word |
|-------|-----------|
| 0.053 | sabe  |
| 0.018 | irresistible  |
| 0.017 | horrible       |
| 0.016 | desagradable       |
| 0.014 | pariente     |
| 0.012 | emitir    |
| 0.012 | debe      |
| 0.012 | conocer    |
| 0.012 | loco     |
| 0.009 | presento     |



It means the top 10 keywords that contribute to this topic are: sabe, irresistible, ‘light’.. and so on and the weight of ‘car’ on topic 0 is 0.016.

##9 Compute Model Perplexity and Coherence Score

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is.   




  
  **Perplexity** In general, perplexity is a measurement of **how well a probability model predicts a sample**. In the context of Natural Language Processing, perplexity is one way to **evaluate language models**.

 The best language model is one that best predicts an unseen test set. Perplexity is the inverse probability of the test set, normalized by the number of words. [Detailed info and formulas](https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3)

**Coherence Score**. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. 

The coherence measure we will use is c_v. C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity

In [49]:
# Compute Perplexity
print('\nPerplexity: ', lda_model_es.log_perplexity(corpus_es))  # a measure of how good the model is. The lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model_es, texts=data_lemmatized_es, dictionary=id2word_es, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -10.698730047120495

Coherence Score:  0.42470641754696103


##9 Visualize the topics-keywords

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with interactive notebooks.




In [50]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_es, corpus_es, id2word_es)
vis


Some conclusions about this visualization. On the Left side, we can see different bubbles of different sizes. Each bubble represents a topic. As we can imagine, th**e larger the bubble, the more frequent is that topic**  

  
 A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

 
A model with too many topics, which happens in our case will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

We have successfully built a good looking topic model.





## 9. Things To test in this Area. 

Heres a list of things we could try in this model. I'm sure some of them would improve the quality of the output.

1. **Multilingual Document Classification**. The idea here would be to build an agnostic language NLP application, able to train a document classifier on the dataset of one language and generalize its prediction capabilities to other language datasets. 
3. **Improve the Lemmatization**: the lematization process for Spanish is not well done. Unlike the English lemmatizer, spacy Spanish lemmatizer does not use PoS tagging information. What he does, is to pick the first match in a list of inflected verbs and lemmas. E.g of element : ideo idear, ideas idear, idea idear, ideamos idear, etc.


2. **Build a LD Mallet Model**: We have seen gensim's inbuilt version of the LDA Algorithm. Mallet's versions use to give a better quality of topics.
3. **Find  the optimal number of topics for LDA**. To do that, we could build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.
4. **Finding the dominant topic in each sentence.** One of the practical application of topic modeling is to determine what topic a given document is about. To find that, we find the topic number that has the highest percentage contribution in that document.
3. **Modify parameters in Gensim Model**: increasing min_count and threshold, we could have more solid topics for Spanish.
4. **Check other algorithms, like LSA**.

