# Latent Dirichlet Allocation for Topic Modeling

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

In [1]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

In [2]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    #REMOVING STOP WORDS FROM THE DOCUMENTS
    print(stop_free)
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    #REMOVING PUNTUATIONS FROM THE DOCUMENTS
    print(punc_free)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    #lemmatizing the document
    print(normalized)
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]

sugar bad consume. sister likes sugar, father.
sugar bad consume sister likes sugar father
sugar bad consume sister like sugar father
father spends lot time driving sister around dance practice.
father spends lot time driving sister around dance practice
father spends lot time driving sister around dance practice
doctors suggest driving may cause increased stress blood pressure.
doctors suggest driving may cause increased stress blood pressure
doctor suggest driving may cause increased stress blood pressure
sometimes feel pressure perform well school, father never seems drive sister better.
sometimes feel pressure perform well school father never seems drive sister better
sometimes feel pressure perform well school father never seems drive sister better
health experts say sugar good lifestyle.
health experts say sugar good lifestyle
health expert say sugar good lifestyle


In [3]:
doc_clean

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spends',
  'lot',
  'time',
  'driving',
  'sister',
  'around',
  'dance',
  'practice'],
 ['doctor',
  'suggest',
  'driving',
  'may',
  'cause',
  'increased',
  'stress',
  'blood',
  'pressure'],
 ['sometimes',
  'feel',
  'pressure',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seems',
  'drive',
  'sister',
  'better'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]

In [4]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]



In [5]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [6]:
print(ldamodel.print_topics(num_topics=3, num_words=3))
"""['0.168*health + 0.083*sugar + 0.072*bad,
'0.061*consume + 0.050*drive + 0.050*sister,
'0.049*pressur + 0.049*father + 0.049*sister]"""

[(0, '0.057*"father" + 0.057*"sister" + 0.056*"pressure"'), (1, '0.135*"sugar" + 0.054*"like" + 0.054*"consume"'), (2, '0.079*"driving" + 0.045*"cause" + 0.045*"increased"')]


"['0.168*health + 0.083*sugar + 0.072*bad,\n'0.061*consume + 0.050*drive + 0.050*sister,\n'0.049*pressur + 0.049*father + 0.049*sister]"