                Beginners Guide to Topic Modeling in Python

    Table of Content

    1 Latent Dirichlet Allocation for Topic Modeling
        1.1 Parameters of LDA
    2 Python Implementation
        2.1 Preparing documents
        2.2 Cleaning and Preprocessing
        2.3 Preparing document term matrix
        2.4 Running LDA model
        2.5 Results
    3 Tips to improve results of topic modelling
        3.1 Frequency Filter
        3.2 Part of Speech Tag Filter
        3.3 Batch Wise LDA
    4 Topic Modeling for Feature Selection


#https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

    Preparing Documents

In [79]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]
doc_complete

['Sugar is bad to consume. My sister likes to have sugar, but not my father.',
 'My father spends a lot of time driving my sister around to dance practice.',
 'Doctors suggest that driving may cause increased stress and blood pressure.',
 'Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.',
 'Health experts say that Sugar is not good for your lifestyle.']

        Cleaning and Preprocessing

In [86]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(doc):
    normalized_stemmed = ""
    print("\n",doc)
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    print(stop_free)
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    print(punc_free)
    
    
    for word in punc_free.split():
        y = lemma.lemmatize(word,"v") #lemmatize all words for verb
        y = lemma.lemmatize(y,"a")    #lemmatize all words for adjectives
        y = lemma.lemmatize(y,"n")    #lemmatize all words for nouns
                           
        normalized_stemmed = normalized_stemmed + " " + y
    print(normalized_stemmed)
    return normalized_stemmed


doc_clean = [clean(doc).split() for doc in doc_complete]  
doc_clean


 Sugar is bad to consume. My sister likes to have sugar, but not my father.
sugar bad consume. sister likes sugar, father.
sugar bad consume sister likes sugar father
 sugar bad consume sister like sugar father

 My father spends a lot of time driving my sister around to dance practice.
father spends lot time driving sister around dance practice.
father spends lot time driving sister around dance practice
 father spend lot time drive sister around dance practice

 Doctors suggest that driving may cause increased stress and blood pressure.
doctors suggest driving may cause increased stress blood pressure.
doctors suggest driving may cause increased stress blood pressure
 doctor suggest drive may cause increase stress blood pressure

 Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.
sometimes feel pressure perform well school, father never seems drive sister better.
sometimes feel pressure perform well school father never se

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spend',
  'lot',
  'time',
  'drive',
  'sister',
  'around',
  'dance',
  'practice'],
 ['doctor',
  'suggest',
  'drive',
  'may',
  'cause',
  'increase',
  'stress',
  'blood',
  'pressure'],
 ['sometimes',
  'feel',
  'pressure',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seem',
  'drive',
  'sister',
  'good'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]

    Preparing Document-Term Matrix

In [87]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(doc_clean)
print(dictionary)
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

Dictionary(33 unique tokens: ['perform', 'health', 'may', 'pressure', 'lifestyle']...)


[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(2, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (5, 1),
  (12, 1),
  (15, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1)],
 [(0, 1), (25, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]

    Running LDA Model

In [103]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

ldamodel

<gensim.models.ldamodel.LdaModel at 0x7fa8c2425b38>

    Results

In [104]:
print(ldamodel.print_topics(num_topics=3, num_words=2))


[(0, '0.051*"pressure" + 0.051*"good"'), (1, '0.085*"sister" + 0.085*"father"'), (2, '0.030*"sugar" + 0.030*"good"')]


        Tips to improve results of topic modeling
        1. Frequency Filter
        2. Part of Speech Tag Filter
        3. Batch Wise LDA