<a href="https://colab.research.google.com/github/andrybrew/text-mining/blob/master/Extra_03_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using Gensim for Topic Modeling

We’re going to use the gensim implementations because they offer more functionality out of the box and then we’ll replicate that functionality with sklearn. Let’s first prepare the dataset we’ll be working with.


In [None]:
!pip install sastrawi
!pip install pyldavis
!pip install gensim==3.8.0

import nltk
from bs4 import BeautifulSoup
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re 

import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')


In [None]:
!pip install gensim==3.8.0
import pkg_resources
pkg_resources.get_distribution("gensim").version


In [None]:
!git clone https://github.com/project303/dataset.git
  
article = open('dataset/Berita.txt', encoding="utf8").read().split('BERHENTI DISINI')
len(article)

Clean the data from html tags with ``beautifulsoup``

In [None]:
article_clean = []
for text in article:
    text = BeautifulSoup(text, 'html.parser').getText()
    article_clean.append(text)
article = article_clean
print(article[0][:100])

Tokenize and clean stopwords

In [None]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

In [None]:
def tokenize_and_stem(text):
    stopwords = nltk.corpus.stopwords.words('indonesian')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [None]:
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in article:
    tokenized_data.append(tokenize_and_stem(text))

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...


In [None]:
NUM_TOPICS = 4

# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha = 'auto', eval_every=5)#, per_word_topics=True)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

We’re going to run LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models, which implementation included in the gensim package.

Let’s now display the topics the two models have inferred:

In [None]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

Let’s now put the models to work and transform unseen documents to their topic distribution:

In [None]:
text = "Pertandingan berjalan dengan seru. Tim lawan berhasil dikalahkan dengan skor 1-0."
bow = dictionary.doc2bow(tokenize_and_stem(text))

print(lda_model[bow]) 
print(lsi_model[bow])
print(bow)

The LDA result can be interpreted as a distribution over topics.
Gensim offers a simple way of performing similarity queries using topic models.

In [None]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(article[document_id][:1000])

Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. 

LDA is an iterative algorithm. Here are the two main steps:

   - In the initialization stage, each word is assigned to a random topic.
   - Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
        - What’s the probability of the word belonging to a topic
        - What’s the probability of the document to be generated by a topic

Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:

In [None]:
import pyLDAvis.gensim
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
panel