# EA Assignment 06 - Topic Modelling Analysis
__Authored by: Álvaro Bartolomé del Canto (alvarobartt @ GitHub)__

---

<img src="https://media-exp1.licdn.com/dms/image/C561BAQFjp6F5hjzDhg/company-background_10000/0?e=2159024400&v=beta&t=OfpXJFCHCqdhcTu7Ud-lediwihm0cANad1Kc_8JcMpA">

As already explained in the previous Jupyter Notebook named `research/05 - Topic Modelling.ipynb`, LDA has been selected as the Topic Modelling algorithm to use, since it works well and it has a visualization library which is really useful during the analysis of the identified topics among all the documents.

Also note that in the previous Jupyter Notebook we were applying LDA just over the Wikipedia English texts, but in this case we will need to extrapole that research to every unique combination of the pair context-language; so the number of topics will be specific per context-language and the results will be analysed independently.

## Loading PreProcessed Data

__Reproducibility Warning__: you will not find the `PreProcessedDocuments.jsonl` file when cloning the repository from GitHub, since it has been included in the .gitignore file due to the GitHub quotas when uploading big files. So on, if you want to reproduce this Jupyter Notebook, please refer to `02 - Data Preprocessing.ipynb` where the NLP preprocessing pipeline is explained and this file is generated.

In [11]:
import json

data = list()

with open('PreProcessedDocuments.jsonl', 'r') as f:
    for line in f.readlines():
        data.append(json.loads(line))

In [12]:
import pandas as pd

data = pd.DataFrame(data)
data.head()

Unnamed: 0,lang,context,preprocessed_text
0,en,wikipedia,watchmen twelve issue comic book limited serie...
1,en,wikipedia,citigroup center formerly citicorp center tall...
2,en,wikipedia,birth_place death_date death_place party conse...
3,en,wikipedia,marbod maroboduus born died king marcomanni no...
4,en,wikipedia,sylvester medal bronze medal awarded every yea...


In [13]:
data.shape

(23011, 3)

In [14]:
combinations = data[['lang', 'context']].drop_duplicates()
combinations = [(row['lang'], row['context']) for index, row in combinations.iterrows()]
combinations

[('en', 'wikipedia'),
 ('es', 'wikipedia'),
 ('fr', 'wikipedia'),
 ('en', 'conference_papers'),
 ('fr', 'conference_papers'),
 ('en', 'apr'),
 ('fr', 'apr'),
 ('en', 'pan11'),
 ('es', 'pan11')]

In [18]:
data['tokenized_text'] = data['preprocessed_text'].str.split(' ')
data.head()

Unnamed: 0,lang,context,preprocessed_text,tokenized_text
0,en,wikipedia,watchmen twelve issue comic book limited serie...,"[watchmen, twelve, issue, comic, book, limited..."
1,en,wikipedia,citigroup center formerly citicorp center tall...,"[citigroup, center, formerly, citicorp, center..."
2,en,wikipedia,birth_place death_date death_place party conse...,"[birth_place, death_date, death_place, party, ..."
3,en,wikipedia,marbod maroboduus born died king marcomanni no...,"[marbod, maroboduus, born, died, king, marcoma..."
4,en,wikipedia,sylvester medal bronze medal awarded every yea...,"[sylvester, medal, bronze, medal, awarded, eve..."


---

## Topic Modelling with LDA

So as to tackle this problem, we will just create a LDA Topic Modelling model for each unique combination of 'context' and 'language', since without a proper Machine Translation model to translate/unify all the text in the dataset to English it is useless to apply the same Topic Modelling model to data written in different languages since there will no be relation even though between texts of the same topic in the same context.

So on, along this section we will just be applying a specific LDA model per language-context pair so as to discover the hidden topics and explore the data in detail.

In [19]:
from pprint import pprint

We will be using `gensim` so as to apply the LDA model and then `pyLDAvis` in order to visualize the LDA results.

In [20]:
import gensim

import pyLDAvis.gensim
pyLDAvis.enable_notebook()

__Reproducibility Warning__: if you run again any fitted LDA Topic Modelling algorithm, you will lose the current results over which the analysis has been performed, and the results may vary since LDA has a different behaviour in different executions. So please, take it into consideration before proceeding.

## Wikipedia LDA

### English

In [42]:
aux = data[(data['lang'] == 'en') & (data['context'] == 'wikipedia')]
aux.shape

(4000, 4)

In [43]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('abandon', 0),
 ('abbreviated', 1),
 ('abc', 2),
 ('abilities', 3),
 ('ability', 4)]

In [44]:
id2word.filter_extremes(no_below=10)

In [45]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [46]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=5)

CPU times: user 2min 13s, sys: 4min 29s, total: 6min 43s
Wall time: 47.8 s


In [47]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [20]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

__TOPICS__:

1. __Politics/History__: it seems to be a politics and/or history topic, since we can see that the main words include: city, world, war, government, century, etc. so that we can easily infer its topic.
2. __Music/Movies/Entertainment__: this topic seems also to be pretty clear since some of main words of the documents classified into it are: film, album, music, band, song, released, rock, etc.
3. __Industry/Research/Chemistry__: this topic is far apart from the others and it is propabbly the one more uncertain, since it contains words related to both industry, research, chemistry, etc. but since it is different from all the other ones we can easily infer its topic.
4. __Sports/Games__: even though this is not a big topic, it is one of the most clear ones, since it contains a lot of words related to sports such as nba, football, player, game, tenis, etc.
5. __Technology/Software__: it is the smallest one, but seems pretty clear that it is talking about technology and software, also due to the most relevant words it contains.

__Note__: since pyLDAvis allows us to dynamically modify the terms' relevance, we are using its threshold to identify better the topics, so some may be unclear when the default value it 1.0 but adjusting it may reveal new hidden topics.

### French

As already described in `research/05 - Topic Modelling.ipynb`, we found out that accross all the Wikipedia English documents there were 5 main topics (also described above), so we will be using the same number of topics by default, even though the results may vary it is going to be our entry point. This means that any additional change/improvement will be reported in this sub-section.

In [54]:
aux = data[(data['lang'] == 'fr') & (data['context'] == 'wikipedia')]
aux.shape

(5588, 4)

In [55]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('acces', 0),
 ('ainsi', 1),
 ('alcelaphus', 2),
 ('allemande', 3),
 ('ancienne', 4)]

In [56]:
id2word.filter_extremes(no_below=10)

In [57]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [58]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=5)

CPU times: user 1min 50s, sys: 3min 46s, total: 5min 37s
Wall time: 39.1 s


In [59]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [60]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

__TOPICS__:

1. __Music/Films/Entertainment__: it is the biggest topic, competing eith the 2nd one which will be explained later, and it contains a lot of words related to mainly music and films, but also some other words related to the entertainment industry in general.

2. __History__: this topic contains texts from different historical events and people, since the main words are completely related to both historical events and people, such as: roi (king), guerre (war), saint (saint), armee (warship), etc. which coming from Wikipedia makes a lot of sense.

3. __City/Lifetime__: this topic is a little bit confusing since it is hard to classify into a category, but the main words reveal that this topic is mainly related to city objects, places, etc. and some other lifetime words, so it is both city-based and geo-based.

4. __Politics__: this is also one of the most clear to identify topics since its main words are strongly related to politics no matter the terms threshold, even though it is also overlapping with the Historical topic.

5. __Sports__: this topic seems also to be clear, since the main words whatever the term relevance threshold is, are related to sports, mainly Football and some other subtopics such as the French Ligue or the Champions Cup.

We noticed a little difference between this Topic Modelling and the English one related not to the number of topics, since we tried different combinations coming up with the same amount of topics that the LDA applied to English texts, but the topic Technology and Software identified for English texts is not present accross French documents, but anyway there is a new topic City/Lifetime which contains the information presented above.

### Spanish

As already stated for the French texts from Wikipedia we will start using the same number of topics that have been used for the English Wikipedia in the POC Jupyter Notebook, but every parameter will be tuned so as to get to the best result where hidden topics can be identified. So on, this sub-section is also subject to changes that will be mentioned if so. 

In [78]:
aux = data[(data['lang'] == 'es') & (data['context'] == 'wikipedia')]
aux.shape

(4000, 4)

In [79]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('abril', 0),
 ('africa', 1),
 ('alectroenas', 2),
 ('alguna', 3),
 ('allaudi', 4)]

In [80]:
id2word.filter_extremes(no_below=10)

In [81]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [82]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=5)

CPU times: user 1min 17s, sys: 2min 25s, total: 3min 42s
Wall time: 29.9 s


In [83]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [84]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

__TOPICS__:

1. __History__: this topic always seems to be one of the most clear ones, since it contains really specific words related to historical events and people, with words such as: gobierno (government), guerra (war), rey (king), ejercito (waarship), imperio (empire), etc. which are clearly pigeonholed into the History topic.

2. __Technology/Science__: this may be the most ambiguous one, since it contains top terms from different categories, anyway, those terms could be pigeonholed into the categories Technology or Science, since this topic combines words like: celulas (cells), especie (specie), carbono (carbon), etc. which are words related to Science (and maybe to Wildlife too), with words like: windows (windows), sistema (system), software (software), sistemas (systems), copyleft (copyleft), etc. which are strongly related to Technology (which in a wider understanding of the category Science, those could also be classified into the same topic).

3. __Music/Films/Entertainment__: also among the most clear and big ones, containing words related to the Entertainment industry, mainly music and films, but some other subtopics are also present

4. __Sports__: this is also one of the most clear ones, even though is probably the smallest one, it is placed far away from the other topics and contains terms related to Sports, mainly about Football.

5. __Religion/Spirituality__: this seems to be a newer hidden topic, since it was not present accross the French and English texts, and it contains a lot of words related to both Religion and Spiritualism, such as: espiritu (spirit), dios (god), budismo (budism), santo (saint), bautismo (baptism), etc.

In this Topic Modelling, a new hidden topic appeared which is indeed the Religion and Spirituality topic, which contains terms unseen among the other documents in the same context but other languages.

### Conclusion

To sum up, the identified topics accross all the Wikipedia context have been: History/Politics, Sports, Music/Films/Entertainment, Technology/Science and Religion/Spirituality, but with some minor differences between languages, even though the results were pretty similar, some differences were spotted, so some identified hidden topics have been refined.

---

## APR

In this case, the Amazon Product Reviews Topic Modelling has been tested with a lot of different combinations, but after some research in the data filenames it has been manually spotted that there are 3 different category reviews, for Books, DVDs and Music. So on, once that feature was spotted, the tuning process was reduced just to the BoW filtering, since we can infer that the number of topics to model for both French and English is 3.

__Note__: this case is the only exception with this concrete feature, since it is the only one that contained the tagged labels, in every other Topic Modelling problem the process has been manually adjusted via different testing.

### English

In [107]:
aux = data[(data['lang'] == 'en') & (data['context'] == 'apr')]
aux.shape

(3540, 4)

In [108]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('album', 0), ('attract', 1), ('audience', 2), ('explosive', 3), ('fans', 4)]

In [109]:
id2word.filter_extremes(no_below=10)

In [110]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

As already mentioned we will be using 3 topics as the final solution, since it has been spotted that less topics just induce to a error were one category contains diverse data, as we already mentioned that the data filenames tell that there are just 3 different APR categories.

In [111]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=5)

CPU times: user 10.8 s, sys: 5.51 ms, total: 10.8 s
Wall time: 10.8 s


In [112]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [113]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

__TOPICS__:

1. __Books__: the top terms from the first topic are strongly related to the Books and Reading, since some of them include: book, author, read, story, reading, etc. which makes pretty easy to infer that the first topic is the Books one.

2. __Films/Series__: this is maybe the most ambiguous topic, since it is not that far from the other topics, but anyway it it far enough and has a pretty big diameter which means that there is a lot of documents into that topic. So on, we can infer that it is talking about Films and Series due to the top terms such as: series, dvd, film, season, movie, episodes, etc.

3. __Music__: this topic has also been identified perfectly since the words match perfectly with the Music topic, some examples that let us infer that this topic is the Music one are: album, group, artist, rock, music, song, guitar, etc.

### French

In [114]:
aux = data[(data['lang'] == 'fr') & (data['context'] == 'apr')]
aux.shape

(2367, 4)

In [115]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('associate', 0), ('bernard', 1), ('blurt', 2), ('classic', 3), ('fair', 4)]

In [116]:
id2word.filter_extremes(no_below=10)

In [117]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

As already mentioned we will be using 3 topics as the final solution, since it has been spotted that less topics just induce to a error were one category contains diverse data, as we already mentioned that the data filenames tell that there are just 3 different APR categories.

In [121]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)

CPU times: user 10.3 s, sys: 8.17 ms, total: 10.3 s
Wall time: 10.3 s


In [122]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [123]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

__TOPICS__:

1. __Films/Series__: this topic and the second one have been the hardest to identify, since the top terms seem to be pretty similar, so the Topic Modelling algorithm was missclassifying them. Even though, we got to achieve a nice topic identification, where this topic can be easily identified from its top terms such as: serie (serie), speciaux (special), effets (effects), episode (episode), etc.

2. __Books__: this topic, as the previous one, was a little bit confusing to identify at first, but some terms revealed that it is classifying the Books APR, from terms such as: livre (book), recit (story), lis (read), etc.

3. __Music__: this was the easiest to identify since it is placed far from the other topics, which reveals that this topic has been successfully specified so that the terms present on it are unique and less probable to be missclassified; in this case we could infer this from the terms: album (album), rock (rock), disque (disk), ecoute (listen), groupe (group), etc.

### Conclusion

In this case the Topic Modelling had a little bit of a trick, since it was found out that the filenames of this directory contained some naming pattern, where there were 3 different values which were indeed the topics of the APRs, in this case, Books, DVDs and Music.

Anyway, the paramter tuning process and analysis in order to identify the different topics was still performed.

After some tests the English APR Topic Modelling seemed to reveal the main topics, while the French one took a little bit longer, since at first it was modelling wrong the categories Films/Series (DVD) with the category Books, but at the end after increasing the number of passes of the algorithm it seemed to improve and, so on, be more clear towards the topic identification.

Also, a feature that helped with the Topic Modelling was that the number of samples of each topic was more or less balanced in the dataset, which led to a Topic Modelling with all the topics the same size (diameter).

---

## Conference Papers

This is the less populated dataset, which means that there are not that much documents so its Topic Modelling should be harder than the other ones, since we have less samples to identify every pre-defined topic or pattern, if it is indeed defined.

Anyway, since we do not know or have any previous idea on how to tackle the topic modelling towards Conference Papers, we will just try out some Topic Modelling combinations so as to obtain the biggest set of non-overlapping topics where we will try to infer the category of each topic from its top terms.

### English

In [194]:
aux = data[(data['lang'] == 'en') & (data['context'] == 'conference_papers')]
aux.shape

(363, 4)

In [195]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('approaches', 0),
 ('aspects', 1),
 ('candidate', 2),
 ('covered', 3),
 ('evaluation', 4)]

In [196]:
id2word.filter_extremes(no_below=2)

In [197]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [201]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=5)

CPU times: user 1.28 s, sys: 1.6 ms, total: 1.28 s
Wall time: 1.28 s


In [202]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [203]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

After a lot of tests and research it has been defined that the most suitable number of topics is 4, even though some of them are still unclear, but it is clear that all of them are related to NLP. Anyway, the identified topics accross all the English Conference Papers will be broken down below.

__TOPICS__:

1. __Vectorization/Annotation__: some top terms from this topic may reveal that the conference papers classified into it are related to text vectorization and text annotation. Some of those topics can be: vector, vectors, annotation, dictionary, etc. but since all this papers are focused on NLP, we can also infer that this topic contains conference papers which imply some sort of text vectorization and/or annotation techniques so as to build either text summarization models, chatbots, etc.

2. __Linguitics Research__: this topic is clearer than the previous one, as it has some clear top terms such as: research, linguistic, refereces, study, method, etc. which imply that those conference papers should be mainly focused on the state of the art research on linguistics so as to later apply that linguistic research to NLP techniques/models.

3. __Graphs__: this topic seems clearer than the first one too, since it is strongly focused on graphs (nodes and unions) from its top terms such as: nodes, agents, unify, polarities, hierarchies, etc. terms that may led to infer that the conferences papers in this topic are focused on graph analysis from text, community detection, etc.

### French

In [204]:
aux = data[(data['lang'] == 'fr') & (data['context'] == 'conference_papers')]
aux.shape

(242, 4)

In [205]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('algorithme', 0),
 ('apports', 1),
 ('approche', 2),
 ('article', 3),
 ('cet', 4)]

In [206]:
id2word.filter_extremes(no_below=2)

In [207]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [214]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=5)

CPU times: user 844 ms, sys: 1.47 ms, total: 845 ms
Wall time: 845 ms


In [215]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [216]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

After analysing the generated topics and the top terms accross all the conference papers written in French, it has been determined that all the available conference papers in the dataset are strongly related to NLP research, so based on that some topics have been identified and broken down below.

__TOPICS__:

1. __Graphs/Structures__: this is the most unclear topic of the ones identified, even though it has some sense according to the previous topic modelling model applied to English, where there was a clear topic related to graphs and structures so as to analyse text data using NLP techniques, in this case, we can proof that with the terms: graphe, tag, unification, construction, cooccurrents, structures , etc.

2. __Sequence Models / Machine Translation__: the top terms of this topic are terms that are usually seen in the area of sequence models, which are mainly applied to Machine Translation problems, and also some other topics that let us infer that it is indeed strongly related to MT, such as: sequence, transducteur, sequentiel, automatique, lexiques, mots, etc.

3. __Word Sense Disambiguation__: this topic seems more specific, so it has been determined that it is related to maybe Word Sense Disambiguation techniques since it contains some words which are part of the state of the art of WSD or have a strong relation such as: dissimilarite, paraphrases, thematique, composantes, metaphore, etc. Maybe there are just some conference papers about WSD, because it also seems to contain some words related to NLU, which basically goes hand in hand with WSD.

4. __Segmentation/Categorization__: this topic is clearer than the other ones, since the top terms related to segmentation and categorization NLP techniques are present among the topic's top terms such as: segmentation, categorisation, analyses, discursive, etc. so that we can identify that the conference papers classified into this topic are strongly related to text segmentation or categorization, including also some words related to sentiment (polarity) analysis or similar.

### Conclusion

Unlike previous Topic Modelling models in this case it has been harder to identify the hidden topics since the documents were pigeonholed inside a topic, which was indeed NLP, which means that the conference papers are NLP papers. So on, it has been harder since a more advanced knowledge on state of the art NLP was required since the terms revealed the NLP topics but even taking that into consideration, it has been the harder to supervised identify.

Anyway, results seem to be coherent and some relevant features have been inferred from the Topic Modelling results after some parameter tuning.

---

## PAN11

We will end the Topic Modelling Analysis with the PAN11 documents used for plagiarism detection, so as to identify the hidden topics accross all the documents available.

### English

In [245]:
aux = data[(data['lang'] == 'en') & (data['context'] == 'pan11')]
aux.shape

(1747, 4)

In [246]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('abstinence', 0),
 ('accompanied', 1),
 ('according', 2),
 ('acknowledge', 3),
 ('acrescentaban', 4)]

In [247]:
id2word.filter_extremes(no_below=20)

In [248]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [252]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=5)

CPU times: user 21.8 s, sys: 26.3 ms, total: 21.9 s
Wall time: 21.9 s


In [253]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [254]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

Since the PAN11 dataset has been used for plagiarism detection we are assuming that the text in the documents is mainly from student activities so that we were expecting topics similar to the ones obtained. Since the Topic Modelling algorithm has been tested with a lot of different topics, whenever we were testing the model with a more than 2 topics, the topics were overlapping and more uncertain about how to classify that topic. So we finally decided to use just 2 topics, were the information available seems more or less clear compared to any other topic number used. 

__TOPICS__:

1. __Literature__: this is the most uncertain topic, since it contains a lot f unconnected words, which sometimes are words the same entity type, for example: man, girl, woman, father, etc. but with no relationship, so the infered topic has been literature.

2. __Spain and Spaniards__: this topic has been easier to identify, since a lot of the terms present in the topic are strongly related to Spain and Spanish Personalities such as the painter Velazquez since the term 'art' also appears, and some other words such as: spain, spanish, toledo, country, government, congress, etc.  Maybe some words are also related to the hispanic culture not just to Spain but to the hispanic countries in South America.

3. __Nature and Natives__: this also seems clear, since it contains a lot of terms related to nature such as: rio, river, mountains, water, nature, etc. and terms which may be related to natives such as: indians, canoes, opium, etc.

This Topic Modelling has been quite hard, since the content makes hard to identify the topics even though looking carefully to evey topic on different Topic Modelling models with parameter tuning.

### Spanish

In [255]:
aux = data[(data['lang'] == 'es') & (data['context'] == 'pan11')]
aux.shape

(1164, 4)

In [256]:
id2word = gensim.corpora.Dictionary(aux['tokenized_text'])
list(id2word.token2id.items())[:5]

[('aca', 0),
 ('acobardaba', 1),
 ('acordose', 2),
 ('alegria', 3),
 ('aliento', 4)]

In [257]:
id2word.filter_extremes(no_below=10)

In [258]:
corpus = [
    id2word.doc2bow(document=text, allow_update=True) for text in aux['tokenized_text']
]

In [265]:
%time lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=5)

CPU times: user 14.6 s, sys: 2.03 ms, total: 14.6 s
Wall time: 14.6 s


In [266]:
vis = pyLDAvis.gensim.prepare(lda, corpus, id2word)

__Note__: if you are previewing the Jupyter Notebook version, note that you will not be able to see the LDA plot since the state is not keept in the Notebook since it is generated with D3.js and the Jupyter Notebook uploaded is static while the data in this plot is dynamic. In order to interact with the LDA plot, please refer to `research/06 - Topic Modelling Analysis.html`, which you can easily open in your web browser.

In [267]:
vis = pyLDAvis.display(vis, template_type='notebook')
vis

As already mentioned during the English PAN11 Topic Modelling analysis, the identification of topics has been quite hard, but some have been identified after some in depth research. Anyway, note that this topics should not be reliable, since the level of uncertainty is in some way high.

__TOPICS__:

1. __History (related with Nature and Natives)__: this also seems clear, since it contains a lot of terms related to nature such as: rio, river, mountains, water, nature, etc. and terms which may be related to natives such as: indians, canoes, opium, etc. but it also has a huge historical load, so maybe those documents are part of history projects which can be based on natives and so on, nature.

2. __Literature (related with Chemistry and/or Medicine)__: this topic is really confusing since it contains a lot of names in Spanish which could lead to think that this topic is somehow related to Spanish culture such as one of the topics described in the previous model, but after tuning the term threshold, some chemistry and medicine related terms appear such as: dosis, medicamentos, mucosas, afecciones, neuralgias, ulceras, etc. which may reveal that this is a chemistry or medicine related topic and the names are just names from people related to this fields or the author names, etc. but since those seem to appear with a higher frequency than the rest of the terms we can infer that all those documents are part of literature which can be based on medicine or chemistry.

So on, the identified topics seem to have some sort of correlation with the previously identified ones, even though some more considerations have been made for this last one, which may be wrong, but those have been the resulting conclusions after analysing the topics top terms.

### Conclusion

This has been the most uncertain and hard to identify Topic Modelling model, since the topics were unclear not mattering the number of topics to identify, the extremes filtering, etc. since the words seemed to be too generic and from diverse documents with no strong relation between them. Anyway, some decent results have been presented after a lot of parameter tuning.

---

## Topic Modelling Conclusions

After applying the LDA algorithm to each single context-language combination of documents with their respective manually tuned parameters, the main conclusion after analysing in detail each of the identified topics through the visualization of them.

The more populated the data was the more accurate the Topic Modelling results were, since Wikipedia's identified topics for any language (English, French and Spanish) were really clear due to the amount of data we are feeding the model with and the topics required no additional understanding, since the top topic terms were basic ones. In the other hand, the Conference Papers was harder in terms of complexity of the top words of each topic, since in order to establish/determine the hidden topic a deep NLP understanding was required. Anyway, the hardest one has been the PAN11, since establishing the proper number of topics to identify has been really expensive in time where the top terms were not helping at all as it was really hard to establish relationships so as to identify the topic.

The results and the analysis provided has been expensive but rewarding at the end, anyway, some conclusions towards future work that can be performed have been determined and will be explained in the last Jupyter Notebook of this project.