# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.

Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up.

In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.

Each LDA topic model requires:

1. A set of documents for training the model—the training corpus
2. A dictionary of words to form the vocabulary used in the model—this can be derived from the training corpus


Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.

## Dataset

In [4]:
import pandas as pd

data = pd.read_csv(r'C:\Users\HP\Downloads\Captions-insta (1).csv', encoding = 'unicode_escape')
data_text = data[['Captions']]
data_text['index'] = data_text.index
documents = data_text

In [5]:
len(documents)

10662

In [6]:
documents.head()

Unnamed: 0,Captions,index
0,the rock is destined to be the 21st century's ...,0
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too tepid biopic,2
3,if you sometimes like to go to the movies to h...,3
4,"emerges as something rare , an issue movie tha...",4


## Pre-processing

In [8]:
pip install gensim

Collecting gensimNote: you may need to restart the kernel to use updated packages.

  Downloading gensim-4.0.1-cp38-cp38-win_amd64.whl (23.9 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.0.0-py3-none-any.whl (56 kB)
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.0.1 smart-open-5.0.0


In [9]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)



In [10]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [12]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [13]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

## Reviewing a Pre-Processed Document

In [14]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['the', 'tug-of-war', 'at', 'the', 'core', 'of', 'beijing', 'bicycle', 'becomes', 'weighed', 'down', 'with', 'agonizing', 'contrivances', ',', 'overheated', 'pathos', 'and', 'long', ',', 'wistful', 'gazes', '.', '']


 tokenized and lemmatized document: 
['core', 'beij', 'bicycl', 'weigh', 'agon', 'contriv', 'overheat', 'patho', 'long', 'wist', 'gaze']


In [16]:
processed_docs = documents['Captions'].map(preprocess)

In [17]:
processed_docs[:10]

0    [rock, destin, centuri, conan, go, splash, gre...
1    [gorgeous, elabor, continu, lord, ring, trilog...
2                              [effect, tepid, biopic]
3             [like, movi, wasabi, good, place, start]
4    [emerg, rare, issu, movi, honest, keen, observ...
5    [film, provid, great, insight, neurot, mindset...
6               [offer, rare, combin, entertain, educ]
7    [pictur, liter, show, road, hell, pave, good, ...
8    [steer, turn, snappi, screenplay, curl, edg, c...
9    [care, offer, refresh, differ, slice, asian, c...
Name: Captions, dtype: object

## Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [18]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [19]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 arnold
1 centuri
2 claud
3 conan
4 damm
5 destin
6 go
7 greater
8 jean
9 rock
10 schwarzenegg


## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [20]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [21]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(212, 1), (999, 1), (1075, 1)]

In [22]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 212 ("long") appears 1 time.
Word 999 ("core") appears 1 time.
Word 1075 ("contriv") appears 1 time.


In [23]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [24]:
corpus_tfidf = tfidf[bow_corpus]

In [25]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.4656267432394765),
 (1, 0.4188792390668849),
 (2, 0.4452270344009362),
 (3, 0.28266201549194714),
 (4, 0.3926158109071403),
 (5, 0.4188792390668849)]


In [26]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [27]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.024*"movi" + 0.017*"film" + 0.012*"like" + 0.009*"director" + 0.009*"look" + 0.009*"matter" + 0.008*"good" + 0.008*"subject" + 0.007*"screen" + 0.007*"work"
Topic: 1 
Words: 0.044*"movi" + 0.038*"film" + 0.013*"charact" + 0.010*"stori" + 0.008*"peopl" + 0.008*"give" + 0.008*"great" + 0.008*"work" + 0.007*"tell" + 0.007*"look"
Topic: 2 
Words: 0.033*"movi" + 0.032*"film" + 0.019*"like" + 0.018*"good" + 0.009*"go" + 0.008*"feel" + 0.007*"stori" + 0.007*"moment" + 0.006*"watch" + 0.006*"thing"
Topic: 3 
Words: 0.014*"perform" + 0.013*"comedi" + 0.013*"director" + 0.011*"like" + 0.011*"look" + 0.010*"movi" + 0.010*"effect" + 0.008*"start" + 0.007*"time" + 0.007*"life"
Topic: 4 
Words: 0.032*"movi" + 0.021*"funni" + 0.013*"littl" + 0.012*"drama" + 0.010*"power" + 0.010*"entertain" + 0.010*"enjoy" + 0.009*"surpris" + 0.008*"good" + 0.007*"film"
Topic: 5 
Words: 0.023*"comedi" + 0.014*"film" + 0.010*"humor" + 0.009*"make" + 0.008*"drama" + 0.008*"movi" + 0.008*"stori" + 0.0

In [28]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [29]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.009*"movi" + 0.008*"film" + 0.008*"cast" + 0.006*"good" + 0.006*"thing" + 0.006*"like" + 0.006*"go" + 0.006*"funni" + 0.006*"stori" + 0.005*"thriller"
Topic: 1 Word: 0.013*"film" + 0.010*"best" + 0.010*"movi" + 0.008*"stori" + 0.008*"think" + 0.007*"year" + 0.007*"actor" + 0.006*"littl" + 0.006*"like" + 0.005*"power"
Topic: 2 Word: 0.010*"film" + 0.009*"movi" + 0.009*"give" + 0.008*"sentiment" + 0.008*"make" + 0.008*"director" + 0.007*"famili" + 0.007*"comedi" + 0.007*"stori" + 0.006*"good"
Topic: 3 Word: 0.010*"movi" + 0.010*"film" + 0.008*"come" + 0.007*"charm" + 0.007*"like" + 0.007*"interest" + 0.006*"time" + 0.006*"better" + 0.006*"summer" + 0.006*"expect"
Topic: 4 Word: 0.010*"film" + 0.008*"movi" + 0.008*"charact" + 0.008*"stori" + 0.006*"want" + 0.006*"make" + 0.006*"worth" + 0.006*"right" + 0.006*"documentari" + 0.005*"take"
Topic: 5 Word: 0.022*"movi" + 0.010*"film" + 0.008*"great" + 0.007*"love" + 0.007*"come" + 0.007*"year" + 0.007*"life" + 0.006*"like" + 0

Classification of the topics

In [30]:
processed_docs[4310]

['core',
 'beij',
 'bicycl',
 'weigh',
 'agon',
 'contriv',
 'overheat',
 'patho',
 'long',
 'wist',
 'gaze']

In [31]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.7749382257461548	 
Topic: 0.067*"film" + 0.025*"stori" + 0.021*"time" + 0.014*"like" + 0.011*"love" + 0.010*"movi" + 0.009*"come" + 0.008*"know" + 0.008*"feel" + 0.008*"director"

Score: 0.025010637938976288	 
Topic: 0.017*"movi" + 0.014*"like" + 0.014*"time" + 0.011*"good" + 0.010*"film" + 0.010*"watch" + 0.010*"come" + 0.010*"make" + 0.009*"perform" + 0.008*"kid"

Score: 0.025008689612150192	 
Topic: 0.014*"perform" + 0.013*"comedi" + 0.013*"director" + 0.011*"like" + 0.011*"look" + 0.010*"movi" + 0.010*"effect" + 0.008*"start" + 0.007*"time" + 0.007*"life"

Score: 0.025007857009768486	 
Topic: 0.040*"film" + 0.024*"movi" + 0.022*"charact" + 0.012*"direct" + 0.010*"heart" + 0.010*"like" + 0.008*"good" + 0.008*"littl" + 0.008*"year" + 0.008*"get"

Score: 0.025007523596286774	 
Topic: 0.039*"movi" + 0.027*"like" + 0.019*"work" + 0.018*"film" + 0.010*"life" + 0.010*"feel" + 0.009*"real" + 0.009*"self" + 0.008*"charact" + 0.007*"better"

Score: 0.02500677853822708	 
Topic: 0.04

In [32]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.7748898267745972	 
Topic: 0.022*"movi" + 0.010*"film" + 0.008*"great" + 0.007*"love" + 0.007*"come" + 0.007*"year" + 0.007*"life" + 0.006*"like" + 0.006*"work" + 0.006*"turn"

Score: 0.025021912530064583	 
Topic: 0.021*"film" + 0.008*"work" + 0.008*"movi" + 0.007*"like" + 0.006*"dark" + 0.006*"perform" + 0.006*"life" + 0.006*"feel" + 0.006*"long" + 0.006*"dull"

Score: 0.025017863139510155	 
Topic: 0.010*"film" + 0.009*"movi" + 0.009*"look" + 0.009*"time" + 0.009*"comedi" + 0.009*"like" + 0.007*"joke" + 0.006*"origin" + 0.005*"long" + 0.004*"dumb"

Score: 0.02501489594578743	 
Topic: 0.013*"film" + 0.010*"best" + 0.010*"movi" + 0.008*"stori" + 0.008*"think" + 0.007*"year" + 0.007*"actor" + 0.006*"littl" + 0.006*"like" + 0.005*"power"

Score: 0.025012794882059097	 
Topic: 0.015*"like" + 0.013*"film" + 0.012*"movi" + 0.009*"feel" + 0.008*"entertain" + 0.007*"get" + 0.007*"minut" + 0.006*"hour" + 0.006*"cinemat" + 0.006*"screen"

Score: 0.025010203942656517	 
Topic: 0.010*"film"