# Movie Classification Team 11

# Latent Dirichelt Allocation

### Team Members:
Andrew Lund, Nicholas Morgam, Amay Umradia, Charles Webb

**The purpose of this notebook is for future scope of work:**
1. To explore the dataset with TMDB plot for 1000 movies using Latent Dirichelt Allocation. We will primarily be using the vanila Latent Dirichelt Allocation and initial knowledge of previous modelling techniques. 
2. To visualize the clusters for the unlabelled dataset.
3. The future scope of work would be to hyper-tune LDA technique and use more advanced models similar to LDA

In [1]:
#import libraries and set seaborn styling
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tmdbsimple as tmdb
import requests
import pandas as pd
import time
import numpy as np
from ast import literal_eval
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim import models
sns.set_context('talk')
sns.set_style('ticks')



In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition.online_lda import LatentDirichletAllocation

import matplotlib.pyplot as plt
import seaborn as sns

import pyLDAvis
import pyLDAvis.sklearn

---
# Load the movie data from /data

We will be using the TMDB plots as our predictor variable throughout this notebook. The idea is to use Latent Dirichelt Allocation on these plots to observe the classification based on common words

In [3]:
movies = pd.read_csv('data/movies.csv')

#define tokenizer
tokenizer = RegexpTokenizer(r'\w+')
#set stop words list
english_stop = get_stop_words('en')
print(len(english_stop))

#function to clean plots
def clean_plot(plot):
    '''
    clean_plot()
    -applies the following the plot of a movie:
        1) lowers all strings
        2) tokenizes each word
        3) removed English stop words

    -inputs: plot (string)
    
    -outputs: list representation of plot
    '''
    plot = plot.lower()
    plot = tokenizer.tokenize(plot)
    plot = [word for word in plot if word not in english_stop]
    return plot

#apply to movies df for both imdb and tmdb
movies['tmdb_clean_plot'] = movies['tmdb_plot'].apply(lambda x: clean_plot(x))
movies['imdb_clean_plot'] = movies['imdb_plot'].apply(lambda x: clean_plot(x))
movies['combined_clean_plot'] = movies['combined_plots'].apply(lambda x: clean_plot(x))

movies.head(2)

174


Unnamed: 0.1,Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,...,combined_bow_plots,combined_clean_plot,tmdb_w2v_plot_mean,imdb_w2v_plot_mean,combined_w2v_plot_mean,tmdb_w2v_plot_matrix,imdb_w2v_plot_matrix,combined_w2v_plot_matrix,post_combined_clean_plot,post_tmdb_clean_plot
0,0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,...,"(0, 1092)\t0.15089615016\r\r\r\r\n (0, 811)...","[framed, 1940s, double, murder, wife, lover, u...",[ 1.41657051e-02 3.57291475e-02 3.5566851...,[ 4.66356799e-03 9.01858658e-02 -1.2476068...,[ 0.00908005 0.064875 0.00985374 0.060550...,"[[-0.08300781 0.25390625 0.07128906 ..., -0....","[[ 0.0201416 0.11474609 -0.35742188 ..., -0....","[[-0.08300781 0.25390625 0.07128906 ..., -0....",framed 1940s double murder wife lover upstandi...,framed 1940s double murder wife lover upstandi...
1,1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Spanning the years 1945 to 1955 a chronicle o...,When the aging head of a famous crime family d...,36.965452,...,"(0, 1773)\t0.104854849055\r\r\r\r\n (0, 287...","[spanning, years, 1945, 1955, chronicle, ficti...",[-0.01682084 0.05966978 -0.00681898 0.042978...,[-0.01332631 0.0813482 0.03576481 0.067564...,[ -1.48730669e-02 7.17528313e-02 1.6916243...,"[[ 0.05175781 0.02502441 -0.12255859 ..., 0....","[[-0.07470703 0.49804688 -0.07373047 ..., 0....","[[ 0.05175781 0.02502441 -0.12255859 ..., 0....",spanning years 1945 1955 chronicle fictional i...,spanning years 1945 1955 chronicle fictional i...


# function to join/combine the list of strings

In [4]:
def post_process(list1):
    str1 = " ".join(list1)
    return str1

movies['post_tmdb_clean_plot'] = movies['tmdb_clean_plot'].apply(lambda x: post_process(x))

In [5]:
movies.post_combined_clean_plot[0]

'framed 1940s double murder wife lover upstanding banker andy dufresne begins new life shawshank prison puts accounting skills work amoral warden long stretch prison dufresne comes admired inmates including older prisoner named red integrity unquenchable sense hope chronicles experiences formerly successful banker prisoner gloomy jailhouse shawshank found guilty crime commit film portrays man s unique way dealing new torturous life along way befriends number fellow prisoners notably wise long term inmate named red j s golden'

## Apply  TFIDF to TMDB clean Plots

In [6]:
english_stop = get_stop_words('en')

In [7]:
tfidf_vectorizer  = TfidfVectorizer(max_features=8000,max_df=0.9,min_df=0.02,stop_words=english_stop,ngram_range=(1,10),lowercase=True)
  
tmdb_bow = tfidf_vectorizer.fit_transform(movies['post_tmdb_clean_plot'])

### If random seed is not present, LDA will give different outputs
https://stats.stackexchange.com/questions/171463/topic-modeling-lda-gives-different-outputs


---
**Function for LDA which takes input as  number of topics to be classified which should be tuned for better results, N_gram tokens and stop words**
---
** Use to LDA to fit_transform and get the contribution of topics for each movie **

In [8]:
lda = LatentDirichletAllocation(n_components=19, max_iter=100,learning_offset=200,learning_method='online',random_state=10)

tmdb_bow_prob = lda.fit_transform(tmdb_bow)*100

tmdb_prob_data = pd.DataFrame(np.around(tmdb_bow_prob,2),index=movies.title)

In [9]:
#https://pandas.pydata.org/pandas-docs/stable/style.html
cm = sns.light_palette("lightblue", as_cmap=True)
datafram_colored = tmdb_prob_data.sample(n=5,random_state=5).style.background_gradient(cmap=cm)
datafram_colored


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
This Is England,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,1.09,80.43,1.09,1.09,1.09
Romeo and Juliet,2.01,2.01,2.01,2.01,2.01,2.01,2.01,2.01,2.01,2.01,63.86,2.01,2.01,2.01,2.01,2.01,2.01,2.01,2.01
Annie Hall,1.53,1.53,1.53,1.53,1.53,1.53,1.53,1.53,1.53,1.53,72.39,1.53,1.53,1.53,1.53,1.53,1.53,1.53,1.53
The Green Mile,1.32,1.32,1.32,1.32,1.32,1.32,1.32,1.32,1.32,12.83,1.32,55.84,1.32,1.32,1.32,1.32,10.19,1.32,1.32
Out of the Past,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,1.19,71.2,1.19,8.59,1.19


In [10]:
param = TfidfVectorizer(**tfidf_vectorizer.get_params())
#print(param.stop_words)
params = param.fit_transform(movies.post_tmdb_clean_plot)

### Below visualization renders clusters with topics
##### The topics does not necessarily mean 19 genres but it could associate to multiple genres since top words from tmdb_plots for each movie are selected and then classfied to a particualr group

## Topic 6 relates to War movies
## Topic 4 relates to Drama/Family
## Topic 1 relates to Drama/Family/Crime/Thriller

In [12]:
#http://pyldavis.readthedocs.io/en/latest/
pyLDAvis.enable_notebook()
py_data = pyLDAvis.sklearn.prepare(lda, tmdb_bow, param,mds='tsne')
pyLDAvis.display(py_data)


In [13]:
import graphlab as gl
import pandas as pd
import pyLDAvis
import pyLDAvis.graphlab

ModuleNotFoundError: No module named 'graphlab'

In [18]:
pyLDAvis.enable_notebook()
py_data = pyLDAvis.sklearn.prepare(lda, tmdb_bow, param,mds='tsne')
py_data

In [17]:
p = pyLDAvis.prepared_data_to_html(py_data)


In [19]:
pyLDAvis.sklearn.prepare(lda, tmdb_bow, param,mds='tsne')