# Topic Modeling Example - Basic

Reference - https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

## Preparing Documents

In [118]:
# Here are the sample documents combining together to form a corpus.

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."

doc2 = "My father spends a lot of time driving my sister around to dance practice."

doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."

doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."

doc5 = "Health experts say that Sugar is not good for your lifestyle."


In [119]:
# compile documents

list_doc_complete = [doc1, doc2, doc3, doc4, doc5]
list_doc_complete

['Sugar is bad to consume. My sister likes to have sugar, but not my father.',
 'My father spends a lot of time driving my sister around to dance practice.',
 'Doctors suggest that driving may cause increased stress and blood pressure.',
 'Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.',
 'Health experts say that Sugar is not good for your lifestyle.']

## Cleaning and Preprocessing

This involves below steps:
- Remove punctuations
- Remove stopwords
- Normalize the corpus

In [120]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/pabhijit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [121]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop = set(stopwords.words('english'))

exclude = set(string.punctuation)

lemma = WordNetLemmatizer()

In [122]:
def clean(doc):

    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])

    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)

    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

    return normalized

In [123]:
list_doc_clean = [clean(doc).split() for doc in list_doc_complete] 
list_doc_clean

[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'],
 ['father',
  'spends',
  'lot',
  'time',
  'driving',
  'sister',
  'around',
  'dance',
  'practice'],
 ['doctor',
  'suggest',
  'driving',
  'may',
  'cause',
  'increased',
  'stress',
  'blood',
  'pressure'],
 ['sometimes',
  'feel',
  'pressure',
  'perform',
  'well',
  'school',
  'father',
  'never',
  'seems',
  'drive',
  'sister',
  'better'],
 ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]

## Preparing Document-Term Matrix

- All the text documents combined is known as the corpus. 
- To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. 
- LDA model looks for repeating term patterns in the entire DT matrix. 
- Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [124]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our corppus, where every unique term is assigned an index. 
dict_dictionary = corpora.Dictionary(list_doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
list_dtm = [dict_dictionary.doc2bow(doc) for doc in list_doc_clean]

In [125]:
print(dict_dictionary)

Dictionary(35 unique tokens: ['bad', 'consume', 'father', 'like', 'sister']...)


In [126]:
list_dtm

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

## Running LDA Model
Next step is to create an object for LDA model and train it on Document-Term matrix. 
The training also requires few parameters as input which are explained in the above section. 
The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [127]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel


# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(list_dtm, num_topics=3, id2word = dictionary, passes=50)

## Results

In [129]:
ldamodel.print_topics(num_topics=3, num_words=3)

[(0, '0.076*"sugar" + 0.076*"father" + 0.076*"sister"'),
 (1, '0.065*"driving" + 0.064*"dance" + 0.064*"around"'),
 (2, '0.050*"pressure" + 0.050*"stress" + 0.050*"cause"')]

- Each line is a topic with individual topic terms and weights. 
- Topic1 can be termed as Bad Health, and Topic3 can be termed as Family.