## INTRO

Topic modelling of movie plots using LDA. 
LDA will be performed on Bag of words representation and TF-IDF representation

In [51]:
import pandas as pd

In [6]:
import gensim

In [8]:
#converts a document into a list of tokens ignoring tokens that are either too short or too long
from gensim.utils import simple_preprocess

In [9]:
from gensim.parsing.preprocessing import STOPWORDS

In [14]:
from nltk.stem import WordNetLemmatizer,SnowballStemmer #for lemmatization and stemming respectively

In [11]:
from nltk.stem.porter import *

In [12]:
import numpy as np

In [13]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\omw-1.4.zip.


True

### Loading the data

In [2]:
path_to_the_data = 'Data\wiki_movie_plots_deduped.csv'

In [3]:
dataset = pd.read_csv(path_to_the_data)

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34886 entries, 0 to 34885
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      34886 non-null  int64 
 1   Title             34886 non-null  object
 2   Origin/Ethnicity  34886 non-null  object
 3   Director          34886 non-null  object
 4   Cast              33464 non-null  object
 5   Genre             34886 non-null  object
 6   Wiki Page         34886 non-null  object
 7   Plot              34886 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


## Data Preprocessing

1) Break each sentence into tokens.

2) Tokens of size < 3 is removed since they might be prepositions , articles and abbreviations

3) StopWords removed (Stopwords are words that occur in larger frequency in sentences and often carries lesser information) (Query : Does one needs to perform stopwords removal when using TF-IDF ? )

4) Words are lemmatized so as to make the changes in the form of the words dues to tenses and parts of speeches , uniform

5) Words are stemmed so as to reduce them to their root form.



In [21]:
def return_stemmed_lemmatized_token(token):
    lemmatizer = WordNetLemmatizer()
    stemmer = SnowballStemmer(language = 'english')
    
    lemmatized_token = lemmatizer.lemmatize(token, pos='v')
    
    stemmed_token = stemmer.stem(lemmatized_token)
    
    return stemmed_token
    

In [22]:
def return_list_of_processed_tokens(text):
    ''' Given a sentence returns the list of stemmed lemmatized tokens and also eliminates tokens that are of lenght < 3'''
    
    list_of_tokens = []
    
    tokens_in_sen = gensim.utils.simple_preprocess(doc = text,min_len=4) #only returns tokens in the sen of len atleast 4
    
    
    for token in tokens_in_sen :
        
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            stemmed_lemmatized_token = return_stemmed_lemmatized_token(token)
            
            list_of_tokens.append(stemmed_lemmatized_token)
    
    
    return list_of_tokens
    

## Obtaining the tokens 

In [25]:
tokens_from_the_plot = dataset['Plot'].map(return_list_of_processed_tokens)

In [26]:
tokens_from_the_plot.head()

0    [bartend, work, saloon, serv, drink, custom, f...
1    [moon, paint, smile, face, hang, park, night, ...
2    [film, minut, long, compos, shot, girl, sit, b...
3    [last, second, consist, shot, shoot, wood, win...
4    [earliest, know, adapt, classic, fairytal, fil...
Name: Plot, dtype: object

There you go!!!! , the plots are converted to lemmatized and stemmed tokens

## Building Dictonary 

In [32]:
dictonary_of_unique_words = gensim.corpora.Dictionary(tokens_from_the_plot)

In [33]:
dictonary_of_unique_words.num_docs

34886

In [38]:
#remove tokens that occur in less than 10 movie plots
#remove tokens that appear in more than 50% of the movie plots (these might not provide any useful information regarding the topic of the movie)

#Once the above steps are completed the keep_n parameter ensures that we keep the most frequent 100000 tokens (i.e we do not consider tokens which occur in very less frequncy amongst movie plots)

# note : no_below would remove the tokens which occurs in less number of documents , it does not consider the overall frequency of the token which is taken into account by keep_n
dictonary_of_unique_words.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)  

## Bag of Words

A matrix where row is each document (i.e the plot in this case) and the columns contain all the unique words in the vocabulary , each cell indicates the frequency of each word in the document


In [42]:
# gensim doc2bow returnd a list of tupples of form (id of the word , freq of the word in the doc)

In [41]:
bow_corpus = [dictonary_of_unique_words.doc2bow(each_plot_token) for each_plot_token in tokens_from_the_plot]

In [44]:
bow_corpus[100] # for any particular plot it contains the frequncy of the words from vocabulary

[(42, 1),
 (48, 1),
 (139, 1),
 (172, 2),
 (173, 1),
 (223, 1),
 (254, 1),
 (264, 1),
 (308, 1),
 (315, 1),
 (319, 1),
 (416, 1),
 (457, 1),
 (523, 1),
 (532, 1),
 (587, 1),
 (657, 1),
 (707, 2),
 (767, 1),
 (771, 1),
 (882, 3),
 (1120, 1),
 (1180, 1),
 (1371, 1),
 (1688, 1),
 (2019, 2),
 (2046, 1),
 (2211, 1),
 (2220, 1),
 (2221, 1),
 (2222, 1),
 (2223, 1),
 (2224, 1),
 (2225, 4),
 (2226, 1),
 (2227, 6),
 (2228, 1),
 (2229, 1),
 (2230, 1),
 (2231, 1),
 (2232, 1),
 (2233, 1),
 (2234, 1),
 (2235, 1),
 (2236, 1)]

# TF IDF method of vectorizing sentences 

TF IDF handles the issues with N grams where in order to reduce the space requirement we might often eliminate words which occur rarely in documents

TF IDF increases the value associated with a word if it occurs multiple times in a single document and decreases the value if it occues in multiple documents (it might be stop word in that case)

In [45]:
from gensim import corpora, models

In [48]:
tf_idf_model = models.TfidfModel(bow_corpus) # fitting the tf - idf model

In [50]:
# getting the tf idf represnetation of each document

corpus_tfidf = tf_idf_model[bow_corpus]

## Using LDA to do topic modelling of movie plots

LDA is used to categorize when the number of attributes/dimensions of the dataset is large. 
LDA acheives this by minimizing the variance/seperation between each classes and maximizing the seperation between the means of each class

applying LDA on tfidf and bag of words representation 

In [52]:
#using TF IDF

In [58]:
# initializing LDA

# corpus = vectorized result
#id2Word = uses this to understand word to id mapping in the corpus . Note above , the corpus only contains the ID of the words and not the words itself
#workers = Since we are using the multiprocessing version of the LDA implementation from gensim , this represents the number of threads

lda_model_tfidf = gensim.models.LdaMulticore(corpus = corpus_tfidf, num_topics=4, id2word=dictonary_of_unique_words,workers=4)

In [64]:
def show_topics(model,num_words):
    for idx, topic in model.show_topics(num_words=5): # prints the most important topics
        print('Topic: {} Words & Probability: {}'.format(idx, topic))

In [65]:
show_topics(lda_model_tfidf,5)

Topic: 0 Words & Probability: 0.002*"film" + 0.002*"kill" + 0.001*"life" + 0.001*"villag" + 0.001*"stori"
Topic: 1 Words & Probability: 0.002*"vijay" + 0.002*"love" + 0.002*"villag" + 0.002*"kill" + 0.002*"father"
Topic: 2 Words & Probability: 0.002*"love" + 0.002*"marri" + 0.002*"raju" + 0.002*"father" + 0.002*"famili"
Topic: 3 Words & Probability: 0.002*"famili" + 0.002*"father" + 0.002*"love" + 0.001*"mother" + 0.001*"girl"


In [62]:
lda_model_bow = gensim.models.LdaMulticore(corpus = bow_corpus, num_topics=4, id2word=dictonary_of_unique_words,workers=4)

In [66]:
show_topics(lda_model_bow,5)

Topic: 0 Words & Probability: 0.008*"kill" + 0.006*"leav" + 0.005*"take" + 0.004*"tell" + 0.004*"return"
Topic: 1 Words & Probability: 0.006*"father" + 0.006*"kill" + 0.006*"leav" + 0.005*"tell" + 0.004*"friend"
Topic: 2 Words & Probability: 0.006*"love" + 0.006*"kill" + 0.005*"famili" + 0.005*"tell" + 0.005*"take"
Topic: 3 Words & Probability: 0.007*"love" + 0.007*"tell" + 0.006*"leav" + 0.005*"kill" + 0.005*"father"


 #### Thus we cleaely observe the differences in categories or topic for bag of model and TF IDF representation

## Testing and Visualization 

Random Sampling for Bag of words

In [67]:
sample = tokens_from_the_plot[23]

In [69]:
# get the bag of words representation for the sample
bow_sample = dictonary_of_unique_words.doc2bow(sample)

In [75]:
lda_model_bow[bow_sample]

[(2, 0.11183456), (3, 0.88134974)]

In [77]:
# observe the score of each class

for index,score in sorted(lda_model_bow[bow_sample],reverse=True):
    print('\n Score :{}\t \nTopic{}'.format(score, lda_model_bow.print_topic(index,10)))


 Score :0.9063881039619446	 
Topic0.007*"love" + 0.007*"tell" + 0.006*"leav" + 0.005*"kill" + 0.005*"father" + 0.005*"marri" + 0.005*"time" + 0.005*"come" + 0.005*"friend" + 0.004*"go"

 Score :0.08679026365280151	 
Topic0.006*"love" + 0.006*"kill" + 0.005*"famili" + 0.005*"tell" + 0.005*"take" + 0.004*"friend" + 0.004*"leav" + 0.004*"go" + 0.004*"help" + 0.004*"return"


Random Sampling with Tfidf 

In [80]:
# get the tf idf vector representation for the sample
tf_idf_sample = tf_idf_model[bow_sample]

In [84]:
lda_model_tfidf[tf_idf_sample]

[(0, 0.6495862), (1, 0.03226512), (2, 0.28528637), (3, 0.032862287)]

In [85]:
# observe the score of each class

for index,score in sorted(lda_model_tfidf[tf_idf_sample],reverse=True):
    print('\n Score :{}\t \nTopic{}'.format(score, lda_model_tfidf.print_topic(index,10)))


 Score :0.03289150819182396	 
Topic0.002*"famili" + 0.002*"father" + 0.002*"love" + 0.001*"mother" + 0.001*"girl" + 0.001*"wife" + 0.001*"marri" + 0.001*"film" + 0.001*"friend" + 0.001*"kill"

 Score :0.21758344769477844	 
Topic0.002*"love" + 0.002*"marri" + 0.002*"raju" + 0.002*"father" + 0.002*"famili" + 0.001*"stori" + 0.001*"arjun" + 0.001*"brother" + 0.001*"marriag" + 0.001*"film"

 Score :0.03227980062365532	 
Topic0.002*"vijay" + 0.002*"love" + 0.002*"villag" + 0.002*"kill" + 0.002*"father" + 0.002*"polic" + 0.001*"murder" + 0.001*"marri" + 0.001*"famili" + 0.001*"ajay"

 Score :0.7172452211380005	 
Topic0.002*"film" + 0.002*"kill" + 0.001*"life" + 0.001*"villag" + 0.001*"stori" + 0.001*"young" + 0.001*"famili" + 0.001*"girl" + 0.001*"father" + 0.001*"year"


### Comments

Strangely BOW performs better in sample 23 in order to categorize the topics

## Unseen Document

In [86]:
# define unseen text

unseen_document = "The main charecter runs out of the house and tells his friend to get some help"

In [89]:
# get the bag of words representation for the sample
bow_sample = dictonary_of_unique_words.doc2bow(return_list_of_processed_tokens(unseen_document))

Bag of Words

In [90]:
# observe the score of each class

for index,score in sorted(lda_model_bow[bow_sample],reverse=True):
    print('\n Score :{}\t \nTopic{}'.format(score, lda_model_bow.print_topic(index,10)))


 Score :0.8845025897026062	 
Topic0.007*"love" + 0.007*"tell" + 0.006*"leav" + 0.005*"kill" + 0.005*"father" + 0.005*"marri" + 0.005*"time" + 0.005*"come" + 0.005*"friend" + 0.004*"go"

 Score :0.03881495073437691	 
Topic0.006*"love" + 0.006*"kill" + 0.005*"famili" + 0.005*"tell" + 0.005*"take" + 0.004*"friend" + 0.004*"leav" + 0.004*"go" + 0.004*"help" + 0.004*"return"

 Score :0.03852028772234917	 
Topic0.006*"father" + 0.006*"kill" + 0.006*"leav" + 0.005*"tell" + 0.004*"friend" + 0.004*"go" + 0.004*"take" + 0.004*"meet" + 0.004*"year" + 0.004*"come"

 Score :0.03816216066479683	 
Topic0.008*"kill" + 0.006*"leav" + 0.005*"take" + 0.004*"tell" + 0.004*"return" + 0.004*"friend" + 0.004*"go" + 0.004*"film" + 0.003*"father" + 0.003*"come"


Clearly the LDA on Bag of words categorizes the unseen document to topic 0

Tf IDF 

In [91]:
# get the tf idf vector representation for the sample
tf_idf_sample = tf_idf_model[bow_sample]

In [92]:
# observe the score of each class

for index,score in sorted(lda_model_tfidf[tf_idf_sample],reverse=True):
    print('\n Score :{}\t \nTopic{}'.format(score, lda_model_tfidf.print_topic(index,10)))


 Score :0.7449244260787964	 
Topic0.002*"famili" + 0.002*"father" + 0.002*"love" + 0.001*"mother" + 0.001*"girl" + 0.001*"wife" + 0.001*"marri" + 0.001*"film" + 0.001*"friend" + 0.001*"kill"

 Score :0.08547670394182205	 
Topic0.002*"love" + 0.002*"marri" + 0.002*"raju" + 0.002*"father" + 0.002*"famili" + 0.001*"stori" + 0.001*"arjun" + 0.001*"brother" + 0.001*"marriag" + 0.001*"film"

 Score :0.08388233184814453	 
Topic0.002*"vijay" + 0.002*"love" + 0.002*"villag" + 0.002*"kill" + 0.002*"father" + 0.002*"polic" + 0.001*"murder" + 0.001*"marri" + 0.001*"famili" + 0.001*"ajay"

 Score :0.08571653068065643	 
Topic0.002*"film" + 0.002*"kill" + 0.001*"life" + 0.001*"villag" + 0.001*"stori" + 0.001*"young" + 0.001*"famili" + 0.001*"girl" + 0.001*"father" + 0.001*"year"


The TF-IDF too categorizes the unkonw sentence to topic 0 but with lesser probability