# Movie Classification Team 11

# Latent Dirichelt Allocation

### Team Members:
Andrew Lund, Nicholas Morgam, Amay Umradia, Charles Webb

**The purpose of this notebook is for future scope of work:**
1. To explore the dataset with TMDB plot for 1000 movies using Latent Dirichelt Allocation. We will primarily be using the vanila Latent Dirichelt Allocation and initial knowledge of previous modelling techniques. 
2. To visualize the clusters for the unlabelled dataset.
3. The future scope of work would be to hyper-tune LDA technique and use more advanced models similar to LDA

In [2]:
#import libraries and set seaborn styling
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tmdbsimple as tmdb
import requests
import pandas as pd
import time
import numpy as np
from ast import literal_eval
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim import models
sns.set_context('talk')
sns.set_style('ticks')



---
# Load the movie data from /data

We will be using the TMDB plots as our predictor variable throughout this notebook. The idea is to use Latent Dirichelt Allocation on these plots to observe the classification based on common words

In [41]:
movies = pd.read_csv('data/movies.csv')
movies.head(2)

Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,release_date,...,imdb_bow_plot,combined_plots,combined_bow_plots,combined_clean_plot,tmdb_w2v_plot_mean,imdb_w2v_plot_mean,combined_w2v_plot_mean,tmdb_w2v_plot_matrix,imdb_w2v_plot_matrix,combined_w2v_plot_matrix
0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,1994-09-23,...,"(0, 398)\t0.22753905257\n (0, 759)\t0.21510...",Framed in the 1940s for the double murder of h...,"(0, 1092)\t0.15089615016\n (0, 811)\t0.1508...","[framed, 1940s, double, murder, wife, lover, u...","[0.0141657, 0.0357291, 0.0355669, 0.0669593, 0...","[0.00466357, 0.0901859, -0.0124761, 0.0549854,...","[0.00908005, 0.064875, 0.00985374, 0.0605507, ...","[[-0.0830078, 0.253906, 0.0712891, 0.0151978, ...","[[0.0201416, 0.114746, -0.357422, -0.228516, 0...","[[-0.0830078, 0.253906, 0.0712891, 0.0151978, ..."
1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Spanning the years 1945 to 1955 a chronicle o...,When the aging head of a famous crime family d...,36.965452,1972-03-14,...,"(0, 515)\t0.172597155095\n (0, 938)\t0.2252...",Spanning the years 1945 to 1955 a chronicle o...,"(0, 1773)\t0.104854849055\n (0, 287)\t0.089...","[spanning, years, 1945, 1955, chronicle, ficti...","[-0.0168208, 0.0596698, -0.00681898, 0.0429789...","[-0.0133263, 0.0813482, 0.0357648, 0.0675641, ...","[-0.0148731, 0.0717528, 0.0169162, 0.0566821, ...","[[0.0517578, 0.0250244, -0.122559, 0.196289, 0...","[[-0.074707, 0.498047, -0.0737305, 0.0727539, ...","[[0.0517578, 0.0250244, -0.122559, 0.196289, 0..."


In [609]:
type(movies.tmdb_clean_plot[0])

list

---
**Function for LDA which takes input as movies, number of topics to be classified, N_gram tokens and stop words**

In [606]:
def lda_process(movies,n_comp,feat,n_gram,stop_words):
    movies['post_tmdb_clean_plot'] = movies['tmdb_clean_plot'].apply(lambda x: post_process(x))
    lda = LatentDirichletAllocation(n_components=n_comp, max_iter=50,learning_method='online',learning_offset=100.)
    
    tf = TfidfVectorizer(max_features=feat,max_df=0.9,min_df=0.02,stop_words=stopwords,ngram_range=(1,n_gram),lowercase=True)
    print(type(movies.tmdb_clean_plot[0]))
    print(type(movies.post_tmdb_clean_plot))
    print(movies.tmdb_clean_plot)
    vectorized = tf.fit_transform(movies.post_tmdb_clean_plot)

    lda.fit(vectorized)
    print(vectorized)
    def print_top_words(model, feature_names, n_top_words=1000):
        for topic_idx, topic in enumerate(model.components_):
            print("Topic #%d:" % topic_idx)
            print(" ".join([feature_names[i]
                            for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print()

    #print("Topics in LDA model:")
    tf_feature_names = tf.get_feature_names()
    #print_top_words(lda, tf_feature_names)
    
    v = lda.transform(vectorized)
    v = v * 100
    #print (v)
    idx = pd.Index(movies.title)
    df = pd.DataFrame(v, index=idx)#, columns=target_genres)
    vals = np.around(df.values,2)
    
    tfidf_vectorizer = TfidfVectorizer(**tf.get_params())
    dtm_tfidf = tfidf_vectorizer.fit_transform(movies.post_tmdb_clean_plot)
    print(dtm_tfidf.shape)
    
    return df,lda,vectorized,tfidf_vectorizer

In [607]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from sklearn.decomposition.online_lda import LatentDirichletAllocation

import matplotlib.pyplot as plt
import seaborn as sns

import pyLDAvis
import pyLDAvis.sklearn

stopwords = stopwords.words('english')
def post_process(list1):
    str1 = ' '.join(list1)
    return str1

df,lda,vectorized,tfidf_vectorizer = lda_process(movies,19,8000,4,stopwords)
cm = sns.light_palette("lightblue", as_cmap=True)
s = df.sample(n=5,random_state=5).style.background_gradient(cmap=cm)
s




<class 'list'>
<class 'pandas.core.series.Series'>
0      [framed, 1940s, double, murder, wife, lover, u...
1      [spanning, years, 1945, 1955, chronicle, ficti...
2      [true, story, businessman, oskar, schindler, s...
3      [continuing, saga, corleone, crime, family, yo...
4      [standalone, version, series, pilot, alternate...
5      [direction, ruthless, instructor, talented, yo...
6      [creator, popular, video, game, system, dies, ...
7      [burger, loving, hit, man, philosophical, part...
8      [orbiting, quiet, backwater, planet, massed, f...
9      [ticking, time, bomb, insomniac, slippery, soa...
10     [former, prohibition, era, jewish, gangster, r...
11     [supernatural, tale, set, death, row, southern...
12     [larcenous, real, estate, clerk, marion, crane...
13     [serving, time, insanity, state, mental, hospi...
14     [man, low, iq, accomplished, great, things, li...
15     [batman, raises, stakes, war, crime, help, lt,...
16     [defense, prosecution, rested,

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
This Is England,1.08726,1.08726,1.08726,1.08726,1.08726,1.08726,1.08726,1.08726,1.08726,1.08726,7.04029,1.08726,1.08726,1.08726,1.08726,1.08726,74.4763,1.08726,1.08726
Romeo and Juliet,2.00778,2.00778,2.00778,2.00778,2.00778,63.8599,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778,2.00778
Annie Hall,1.53409,1.53409,72.3864,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409,1.53409
The Green Mile,1.32129,1.32129,1.32129,17.7101,1.32129,33.8584,1.32129,1.32129,1.32129,1.32129,1.32129,27.2909,1.32129,1.32129,1.32129,1.32129,1.32129,1.32129,1.32129
Out of the Past,1.18893,1.18893,1.18893,1.18893,1.18893,78.5993,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893,1.18893


In [597]:
df.head(5).style.background_gradient(cmap=cm)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
The Shawshank Redemption,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,66.8032,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,1.39027,9.56217,1.39027
The Godfather,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,1.52566,22.522,1.52566,1.52566,51.5418
Schindler's List,1.27478,1.27478,1.27478,1.27478,1.27478,49.6832,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,1.27478,28.6455
The Godfather: Part II,1.47206,1.47206,1.47206,1.47206,1.47206,1.47206,1.47206,1.47206,1.47206,49.4025,14.2924,1.47206,1.47206,1.47206,1.47206,12.752,1.47206,1.47206,1.47206
Twin Peaks,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,2.63158,52.6316,2.63158,2.63158,2.63158,2.63158,2.63158


In [590]:
from IPython.display import Image
Image(filename='data/pic1.JPG') 

<IPython.core.display.Image object>

In [535]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, vectorized, tfidf_vectorizer)

In [591]:
from IPython.display import Image
Image(filename='data/pic2.JPG') 

<IPython.core.display.Image object>

In [585]:
df,lda,vectorized,tfidf_vectorizer = lda_process(movies,8,8000000,20,stopwords)
cm = sns.light_palette("lightblue", as_cmap=True)
s = df.sample(n=5,random_state=5).style.background_gradient(cmap=cm)
s

(1000, 155)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
This Is England,2.58356,10.391,2.59923,2.59416,2.59135,2.58981,2.5914,74.0594
Romeo and Juliet,4.76851,4.78304,4.76879,4.77602,4.77695,66.5736,4.77581,4.77726
Annie Hall,3.64346,3.64556,3.64442,3.64521,3.64602,3.65508,3.64462,74.4756
The Green Mile,3.13926,3.15059,3.14518,3.14159,16.7245,14.697,17.759,38.2429
Out of the Past,2.87275,2.84914,2.8361,80.1038,2.83639,2.83279,2.83426,2.83481


In [586]:
pyLDAvis.enable_notebook()
py_data = pyLDAvis.sklearn.prepare(lda, vectorized, tfidf_vectorizer)
pyLDAvis.display(py_data)