# LDA - General Understanding

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents,LDA backtracks and tries to figure out what topics would create those documents in the first place

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. LDA converts the document- Term matrix into two lower dimension matrices, Document - Topic  and Topic - Vocabulary (Words)

It Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.

or every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.

The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 

# Parameters of LDA

Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words

Number of Topics – Number of topics to be extracted from the corpus.  (KL divergence)

Number of Topic Terms – Number of terms composed in a single topic. 

Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

In [46]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

# Cleaning and preprocessing

In [1]:
# remove punctuations, stops words and normalize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import nltk
lemma = WordNetLemmatizer()

In [21]:
#nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [41]:
def clean_text(doc):
    clean_punc = ''.join(ch for ch in doc if ch not in string.punctuation)
    clean_stopwords  = " ".join(word for word in clean_punc.split() if word not in stopwords.words('english'))
    
    lemmatized_message = " ".join(lemma.lemmatize(word)for word in clean_stopwords.split())
    return clean_stopwords

In [43]:
clean_text("Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.")

'Sometimes I feel pressure perform well school father never seems drive sister better'

In [48]:
clean_doc = [clean_text(doc).split() for doc in doc_complete]

# Gensim and converting text into matrix

In [50]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(clean_doc)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_doc]



In [51]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [52]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.045*"Sugar" + 0.045*"driving" + 0.045*"stress"'), (1, '0.047*"My" + 0.047*"father" + 0.047*"sister"'), (2, '0.052*"father" + 0.052*"sister" + 0.052*"pressure"')]
