### Using Gensim(LDA) for Topic Modeling, *a low fat tutorial*

Author: Argyris Argyrou, PhD student at  @ Cyprus University of Technology

Packages required

In [1]:
from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

import gensim

Feel free to play with the senteces below.

In [2]:
doc_a="I eat fish and vegetables."
doc_b="Fish are pets."
doc_c="My kitten eats fish."

Convert documents into a list

In [3]:
doc_set = [doc_a,doc_b,doc_c]
print(doc_set)

['I eat fish and vegetables.', 'Fish are pets.', 'My kitten eats fish.']


Basic Data cleaning

In [4]:
# tokenizing  words
tokenizer = RegexpTokenizer(r'\w+')

# removing stop words
mystopwords = ['i','are','my','and']

# create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in mystopwords]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

Prerequisites for generating our topic models

In [6]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=10)

Let's identify which topics are discussed in our documents.

In [7]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.202*"fish" + 0.200*"veget" + 0.200*"eat"'), (1, '0.305*"eat" + 0.304*"fish" + 0.174*"kitten"'), (2, '0.365*"fish" + 0.361*"pet" + 0.091*"eat"')]


This *low fat tutorial* is an abstract of the tutorial from *Jordan Barber* below:<br>
https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html