# EA Assignment 06 - Topic Modelling Analysis
__Authored by: Álvaro Bartolomé del Canto (alvarobartt @ GitHub)__

---

<img src="https://media-exp1.licdn.com/dms/image/C561BAQFjp6F5hjzDhg/company-background_10000/0?e=2159024400&v=beta&t=OfpXJFCHCqdhcTu7Ud-lediwihm0cANad1Kc_8JcMpA">

We will start this Jupyter Notebook with a little recap from the previous one named `02 - Data Preprocessing.ipynb` where we defined the NLP Preprocessing pipeline we want to apply in order to prepare the input raw text into valuable text, at the end of that Jupyter Notebook we dumped the generated `pandas.DataFrame` into a JSON-Lines file so as to avoid preprocessing the data on each Notebook. So on, this Jupyter Notebook starts with the preprocessed data loading, which means that if you want to have more details/insights on how the data is being preprocessed just refer to the previous notebook.

## Loading PreProcessed Data

__Reproducibility Warning__: you will not find the `PreProcessedDocuments.jsonl` file when cloning the repository from GitHub, since it has been included in the .gitignore file due to the GitHub quotas when uploading big files. So on, if you want to reproduce this Jupyter Notebook, please refer to `02 - Data Preprocessing.ipynb` where the NLP preprocessing pipeline is explained and this file is generated.

In [1]:
import json

data = list()

with open('PreProcessedDocuments.jsonl', 'r') as f:
    for line in f.readlines():
        data.append(json.loads(line))

In [2]:
import pandas as pd

data = pd.DataFrame(data)
data.head()

Unnamed: 0,lang,context,preprocessed_text
0,en,wikipedia,watchmen twelve issue comic book limited serie...
1,en,wikipedia,citigroup center formerly citicorp center tall...
2,en,wikipedia,birth_place death_date death_place party conse...
3,en,wikipedia,marbod maroboduus born died king marcomanni no...
4,en,wikipedia,sylvester medal bronze medal awarded every yea...


In [3]:
data.shape

(23011, 3)

In [4]:
data = data[(data['lang'] == 'en') & (data['context'] == 'wikipedia')]
data.shape

(4000, 3)

In [5]:
data['tokenized_text'] = data['preprocessed_text'].str.split(' ')
data.head()

Unnamed: 0,lang,context,preprocessed_text,tokenized_text
0,en,wikipedia,watchmen twelve issue comic book limited serie...,"[watchmen, twelve, issue, comic, book, limited..."
1,en,wikipedia,citigroup center formerly citicorp center tall...,"[citigroup, center, formerly, citicorp, center..."
2,en,wikipedia,birth_place death_date death_place party conse...,"[birth_place, death_date, death_place, party, ..."
3,en,wikipedia,marbod maroboduus born died king marcomanni no...,"[marbod, maroboduus, born, died, king, marcoma..."
4,en,wikipedia,sylvester medal bronze medal awarded every yea...,"[sylvester, medal, bronze, medal, awarded, eve..."


---

## Tackling the Topic Modelling

So as to tackle this problem, we will just create a Topic Modelling model for each unique combination of 'context' and 'language', since without a proper Machine Translation model to translate/unify all the text in the dataset to English. It is useless to apply the same Topic Modelling model to data written in different languages since there will no be relation even though between texts of the same topic in the same context.

For example, a couple of Wikipedia's texts of the topic "Historical Events" written in English and French will no have any common words/tokens except the non-specific language words such as names, surnames, etc.

So on, we will be testing some commonly used Topic Modelling algorithms splitting the data by context and language in order to get to know both the number of suitable topics from the main words of each topic and then visualize them in a 2D plot.

---

## Topic Modelling with LDA

Latent Dirichlet Allocation ...


__Note__: since we are just testing the most common Topic Modelling algorithms, we will be using just the Wikipedia texts written in English. 

In [6]:
import gensim

In [7]:
id2word = gensim.corpora.Dictionary(data['tokenized_text'])
list(id2word.token2id.items())[:5]

[('abandon', 0),
 ('abbreviated', 1),
 ('abc', 2),
 ('abilities', 3),
 ('ability', 4)]

In [8]:
id2word.filter_extremes(no_below=10, no_above=0.05)

In [9]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in data['tokenized_text']
]

In [10]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=5)

CPU times: user 2min 9s, sys: 4min 21s, total: 6min 31s
Wall time: 45.9 s


In [11]:
from pprint import pprint
pprint(lda.print_topics(num_words=5))

[(0,
  '0.003*"species" + 0.003*"game" + 0.003*"team" + 0.003*"may" + '
  '0.003*"season"'),
 (1, '0.005*"used" + 0.003*"may" + 0.003*"use" + 0.002*"water" + 0.002*"form"'),
 (2,
  '0.005*"film" + 0.005*"album" + 0.004*"music" + 0.003*"band" + '
  '0.003*"american"'),
 (3,
  '0.004*"county" + 0.003*"used" + 0.003*"system" + 0.002*"software" + '
  '0.002*"city"'),
 (4,
  '0.004*"city" + 0.003*"world" + 0.003*"war" + 0.002*"time" + '
  '0.002*"century"')]


In [12]:
# lda.save('resources/lda/model')

In [18]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [19]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

In [20]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis