### Table of contents:

* [4. Topic modeling](#chapter4)
    * [4.1 Requirements](#section_4_1)
    * [4.2 Imports](#section_4_2)
    * [4.3 Get data](#section_4_3)
    * [4.4 With top2vec](#section_4_4)
    * [4.5 With Latent Dirichlet Allocation (LDA)](#section_4_5)

# 4. Topic modeling <a class="anchor" id="chapter4"></a>

Topic modeling is a machine learning technique (unsupervised) that automatically analyzes text data to determine cluster words (mapped to topics) for a set of documents.

Two algorithms were used:

- Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Some benefits: automatically finds number of topics, works on short text, doesn't ingore semantics.
- LDA groups texts based on the words they contain and the probability of a word belonging to a certain topic. The LDA algorithm outputs the topic word distribution. It's the most popular topic modeling algorithm but: needs pre-processing (maybe multiple rounds), needs to known the number of topics and ignores semantics.

After some experiments, it was found that the top2vec algorithm works much better in this dataset. 

Note: execute with GPU.

References: 1.https://github.com/ddangelov/Top2Vec 2. https://radimrehurek.com/gensim/models/ldamulticore.html

## 4.1 Requirements <a class="anchor" id="section_4_1"></a>

In [None]:
pip install gensim==3.8.3

In [None]:
pip install top2vec

In [3]:
#pip install pyLDAvis

## 4.2 Imports <a class="anchor" id="section_4_2"></a>

In [1]:
import numpy as np
import pandas as pd

from top2vec import Top2Vec

#import gensim.corpora as corpora
#from gensim.models import CoherenceModel
#from gensim.models.ldamulticore import LdaMulticore
#import pyLDAvis
#import pyLDAvis.gensim_models
#import matplotlib.pyplot as plt
#import matplotlib.colors as mcolors
#%matplotlib inline
#from wordcloud import WordCloud

  def _figure_formats_changed(self, name, old, new):


## 4.3 Get data <a class="anchor" id="section_4_3"></a>

In [3]:
# top2vec doesn't need pre-processing, but we will still use the already cleaned dataset

data = pd.read_pickle('data_preprocessed_tm.pkl')
data.head()

2022-04-03 15:11:50.184689: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-03 15:11:50.184729: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [None]:
# Create mapping index-document ID

ids = {}
for index in data.index:
  ids[index] = data.iloc[index].ID

In [4]:
content = data.loc[:,'content']
content.head()

Unnamed: 0,label,content
0,0,prisão perpétua homem tentou assassinar senado...
1,0,john nash matemático mente brilhante morre aci...
2,1,mito reeleição mínima garantida cavaco sairá d...
3,0,morreu rita levintalcini grande dama ciência i...
4,0,trás porta amarela homem problemas psicológico...


## 4.4 With top2vec <a class="anchor" id="section_4_4"></a>

In [None]:
# Convert dataset to list of strings

documents = list(content.values.flatten())

In [None]:
# Train a Top2Vec model on our news dataset

model = Top2Vec(documents, speed="learn", workers=8)

In [None]:
# Total number of topics found

total_topics = model.get_num_topics()

print("Found: "+str(total_topics)+" topics.")

In [None]:
# For each topic, the top 50 words are returned, in order of semantic similarity to the topic

topic_words, word_scores, topic_nums = model.get_topics(total_topics)

In [None]:
# Wordcloud for each topic

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

In [None]:
# Search documents by topic. Ordered by (decreasing) similarity.

documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=0, num_docs=10)

document_ids = []
for index in document_indexes:
  document_ids.append(ids.get(index))

document_ids

## 4.5 With Latent Dirichlet Allocation (LDA) <a class="anchor" id="section_4_5"></a>

In [5]:
# Generate document-terms matrix

#content_words = [doc.split() for doc in content]

#dictionary = corpora.Dictionary(content_words)

#doc_term_matrix = [dictionary.doc2bow(text) for text in content_words]

In [6]:
# Create model, with 6 topics to be found

#lda_model = LdaMulticore(corpus=doc_term_matrix, id2word=dictionary, num_topics=6, random_state=42)

#print('\nPerplexity: ', lda_model.log_perplexity(doc_term_matrix))

#coherence_model_lda = CoherenceModel(model=lda_model, texts=content_words, dictionary=dictionary, coherence='c_v')
#coherence_lda = coherence_model_lda.get_coherence()
#print('\nCoherence: ', coherence_lda)

In [7]:
# Visualize topics found

#pyLDAvis.enable_notebook()
#vis = pyLDAvis.gensim_models.prepare(lda_model, doc_term_matrix, dictionary)
#vis

#print(lda_model.print_topics())