# Topic Modelling

Topic modelling identifies the topics within texts. For the drama reviews, this helps in understanding what viewers are saying about the dramas.

This notebook is the fourth of a 5 part series of the drama reviews project that I did.


## 1. Import libraries and load CSV file

In [2]:
import pandas as pd
import numpy as np
import gensim
from gensim import corpora

import pyLDAvis # libraries for visualization
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [36]:
df = pd.read_csv('drama_reviews_processed.csv')
df.head()

Unnamed: 0,drama_title,user_name,overall_rating,story_rating,cast_rating,music_rating,rewatch_value_rating,reviews,sentiment,reviews_processed,language,reviews_processed2x,reviews_lemmatized
0,Dear My Friends (2016),iamgeralddd,10.0,10.0,10.0,10.0,10.0,Thank you writer Noh for making this heart-wa...,1,thank you writer noh for making this heart war...,en,thank writer noh making heart warming story co...,heart warming story live drama excited weekend...
1,Dear My Friends (2016),Dounie,10.0,10.0,10.0,8.5,9.0,"I know for some, stories following and tellin...",1,i know for some stories following and telling ...,en,know stories following telling lives older peo...,story old people promise boring decide try fun...
2,Dear My Friends (2016),Pelin,10.0,10.0,10.0,10.0,10.0,"Story ""A realistic, cheerful story about “twi...",1,story a realistic cheerful story about twiligh...,en,story realistic cheerful story twilight youths...,story realistic cheerful story twilight young ...
3,Dear My Friends (2016),silent_whispers,9.0,9.0,10.0,10.0,7.0,When I heard about a drama that would be comi...,1,when i heard about a drama that would be comin...,en,heard drama would coming 2016 twilight youths ...,drama twilight youth life long friend drama lo...
4,Dear My Friends (2016),Dana,9.0,9.0,10.0,7.0,3.0,In a sometimes overwhelming world of perfect ...,1,in a sometimes overwhelming world of perfect f...,en,sometimes overwhelming world perfect faces scr...,overwhelming world perfect script dear friend ...


We are using the lemmatized reviews as they provide meaningful words for analysis. Stopwords are not helpful in this case. Thus, we drop the rows that do not have any words in the 'reviews_lemmatized' column. 

In [44]:
df = df.dropna(subset=['reviews_lemmatized'])
df = df.reset_index(drop=True)

## 2. Building a Latent Dirichlet Allocation (LDA) model

Topic modelling is new to me so I looked up https://www.analyticsvidhya.com/blog/2018/10/mining-online-reviews-topic-modeling-lda/ as a guide. LDA assumes that documents were written based on topics, and these topics have a set of words. It finds the topics by reverse engineering:

1. Assume k topics occur across all documents
2. For each document, assign each word to a topic
3. For each topic, find: (1) p(topic t | document d) and (2) p(word w | topic t) 
4. Reassign each word with a new topic, with probability p(topic t | document d) * p(word w | topic t) 
5. Repeat this to find the topic composition of each document and the word composition of each topic

In [46]:
tokenized_reviews = df.apply(lambda row: row['reviews_lemmatized'].split(), axis=1) # separate words into tokens

In [47]:
dictionary = corpora.Dictionary(tokenized_reviews) # create a dictionary of all the words

In [48]:
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews] # create a document term matrix

A document term matrix describes the frequency of terms that occur in a collection of documents.

In [49]:
LDA = gensim.models.ldamodel.LdaModel # Creating the object for LDA model using gensim library

lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=7, random_state=100,
                chunksize=1000, passes=50) # Build LDA model

In [50]:
lda_model.print_topics()

[(0,
  '0.060*"drama" + 0.029*"story" + 0.029*"character" + 0.023*"good" + 0.017*"episode" + 0.013*"time" + 0.010*"love" + 0.010*"great" + 0.010*"music" + 0.009*"many"'),
 (1,
  '0.068*"time" + 0.063*"actor" + 0.034*"day" + 0.029*"boy" + 0.029*"main" + 0.025*"break" + 0.018*"year" + 0.018*"drama" + 0.017*"love" + 0.017*"episode"'),
 (2,
  '0.059*"character" + 0.048*"drama" + 0.030*"certain" + 0.020*"storyline" + 0.018*"beautiful" + 0.018*"story" + 0.015*"scene" + 0.015*"interesting" + 0.013*"life" + 0.013*"role"'),
 (3,
  '0.092*"drama" + 0.023*"japanese" + 0.022*"platonic" + 0.021*"watch" + 0.019*"old" + 0.019*"life" + 0.018*"first" + 0.018*"thing" + 0.018*"whole" + 0.017*"tearjerker"'),
 (4,
  '0.034*"woman" + 0.026*"lead" + 0.025*"drama" + 0.023*"main" + 0.023*"enough" + 0.022*"gay" + 0.020*"character" + 0.019*"lakorn" + 0.018*"story" + 0.016*"mature"'),
 (5,
  '0.025*"life" + 0.023*"school" + 0.021*"family" + 0.017*"people" + 0.016*"student" + 0.016*"love" + 0.015*"friend" + 0.014*

Looking at these keywords, I made some guesses on the topics: 
- Topic 1 seems to reveal positive sentiments, with words like "good", "love" and "great"
- Topic 2 seems to talk about romance shows, with words like "boy", "break" and "love"
- Topic 5 seems to be related to family / school life

## Topics Visualization

In [51]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Except for clusters 3, 4 and 5, the other topics are rather similar. Perhaps, topic modelling is not as effective for drama reviews from the mydramalist website in telling the topics. 

I made some guesses on the less similar topics as well: 
- Cluster 4 could revolve around dramas with a comedy theme as words like "comedy", "humor", "funny" and "dramatic" were used
- Cluster 5 could be represent a youth romance dramas as the words "break", "couple", "girl", "boy" and "fault" were used
- Cluster 3 could reveal school life dramas as words such as "school", "student", "young", "freindship" and "teacher" were used