##### Austin Hancock

## MSDS 7337 - Section 401
## Homework - #8
[Data Science @ Southern Methodist University](https://datascience.smu.edu/)

## Table of Contents
* [Description](#Description)
* [Tools](#Tools)
* [Hypothesis](#Hypothesis)
* [Data Extraction](#Data-Extraction)
* [Results](#Hypothesis)
    * [NMF Model](#NMF-Model)
    * [LDA Model](#LDA-Model)
* [Conclusion](#Conclusion)

## <a name="Description"></a>Description
For the Final Project I will be addressing the following:

    - Using the IMDB movie reviews from homeworks 5 and 7 as the dataset, perform topic modeling analysis 
    working with NMF and/or LDA.

## <a name="Tools"></a>Tools

In [10]:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import nltk; print("NLTK", nltk.__version__)
import bs4; print("bs4", bs4.__version__)
from bs4 import BeautifulSoup
import urllib
from urllib import request
import re; print("re", re.__version__)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd; print("Pandas", pd.__version__)
import sklearn; print("sklearn", sklearn.__version__)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn.cluster import KMeans
from collections import Counter
from sklearn import svm
import numpy as np; print("np", np.__version__)
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
from sklearn.decomposition import TruncatedSVD
from time import time
from sklearn.feature_extraction import text

Windows-10-10.0.17134-SP0
Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
NLTK 3.2.4
bs4 4.6.0
re 2.2.1
Pandas 0.20.3
sklearn 0.19.1
np 1.14.2



The twython library has not been installed. Some functionality from the twitter package will not be available.



## <a name="Hypothesis"></a>Hypothesis

To begin, I first need to decide which topic modeling approach is best for the dataset I am working with; Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF). While LDA has the ability for topics to share keywods and separate word tenses (which could both be useful in topic modeling), NMF forces a sparse number of topics and is better suited for analyzing movie reviews since these tend to be very short. 

In addition to which topic modeling approach will be best, I also am going to predict that the number of topics which produce an intuitive separation is 7. This prediciton comes from homework 7 in which I found that, using the K-Means clustering method, having 7 clusters produced feature names that were intuitive to understanding what the cluster's unerlying reviews were about.

## <a name="Data-Extraction"></a>Data Extraction

To begin, I first need to bring in the reviews from homeworks 5 and 7.

In [12]:
# Links to reviews of movies
moana_url = "https://www.imdb.com/title/tt3521164/reviews?ref_=tt_ql_3"
frozen_url = 'https://www.imdb.com/title/tt2294629/reviews?ref_=tt_ql_3'
coco_url = 'https://www.imdb.com/title/tt2380307/reviews?ref_=tt_ql_3'
LionKing_url = 'https://www.imdb.com/title/tt0110357/reviews?ref_=tt_ql_3'

def find_permalinks(url):
    page = request.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    review_containers = soup.find_all('div', class_ = 'review-container')
    for review in review_containers:
        permalinks.append("https://www.imdb.com" + review.find('a', attrs={'href': re.compile("/review")}).get('href'))
        
# Permalinks to reviews
permalinks = []

find_permalinks(moana_url)
find_permalinks(frozen_url)
find_permalinks(coco_url)
find_permalinks(LionKing_url)

In [13]:
# Grab the main review and rating from each link and add to list
reviews = []
ratings = []

for link in permalinks:
    page = request.urlopen(link)
    soup = BeautifulSoup(page, 'html.parser')
    review_containers = soup.find_all('div', class_ = 'review-container')
    for review in review_containers:
        reviews.append(review.find(class_ = 'text show-more__control').text)

In [14]:
# Confirm extraction of reviews and rating
print(reviews[0])

Moana is a return to the classic Disney formula, the clichés and characters ripped from a number of other animated films. However, the pure beauty and skill of the production rises the old story into new heights.Following from the success of Zootropolis, Moana follows a more traditional narrative we know and love; the princess who wishes for something more and is whisked on a supernatural adventure. We know this story so well yet Moana seems fresh and thrilling as if the plot was innovative. Perhaps this is due to the Polynesian setting or the morally ambiguous Maui, played perfectly by Dwayne Johnson, but most likely it is it the simple magic of Disney – the wonder for both children and adults has reached its peak with the perfection of the classic formula. For once, the clichés make the film more enjoyable. The quality of the animation helps too: it's clear they have reached the pinnacle of blending realistic textures with stylised designs, creating an aesthetic beauty that few other

## <a name="Results"></a>Results

#### <a name="NMF-Model"></a>NMF Model

I begin my first approach by creating a count vectorizer and building the NMF model.

In [15]:
# Vectorize the reviews using CountVectorizer
count_vectorizer = CountVectorizer(min_df=5, 
                                   max_df=0.9,
                                   stop_words='english',
                                   lowercase=True,
                                   token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
reviews_vectorized = count_vectorizer.fit_transform(reviews)

# Build NMF Model
num_topics = 7
nmf_model = NMF(n_components=num_topics)
nmf_Z = nmf_model.fit_transform(reviews_vectorized)

print('Model shape: ',nmf_Z.shape)
print('First review: ',nmf_Z[0])

Model shape:  (100, 7)
First review:  [0.83179423 0.05038102 0.29532762 0.55819483 0.         0.
 0.        ]


Above, you can see that the review is being weighted to different topics. Let's try and get some more information on what words make up these topics.

In [16]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

print("NMF Model:")
print_topics(nmf_model, count_vectorizer)
print("=" * 20)

NMF Model:
Topic 0:
[('disney', 5.350093693114414), ('movie', 2.118641608750007), ('movies', 1.3661164345433725), ('frozen', 1.1861026335242362), ('characters', 0.8554951766628711), ('animation', 0.835582590084551), ('people', 0.8020675726594527), ('songs', 0.7455918375414003), ('films', 0.7102453879579702), ('hans', 0.687288660106188)]
Topic 1:
[('elsa', 4.28303229513917), ('anna', 2.2103288538113848), ('film', 1.7901741719587492), ('love', 1.6105314943504727), ('story', 1.5835112226825865), ('beast', 1.3238383512034901), ('character', 1.0011744951701387), ('frozen', 0.9749414935418004), ('better', 0.5743931033269009), ('interesting', 0.5630382062608369)]
Topic 2:
[('king', 2.585949664269403), ('simba', 2.188990680204031), ('film', 2.001058414130342), ('lion', 1.9733890095787952), ('scar', 1.2278433143369314), ('animated', 0.9194031827589225), ('mufasa', 0.8530724149819668), ('best', 0.8512950142407938), ('story', 0.8467764476041851), ('animation', 0.7400838796365373)]
Topic 3:
[('moa

Great! Now we can see which words are in each of our topics. But the above output doesn't do much to tell me how the reviews are different from each other in an intuitive manner. Let's see if we can create some visualizations to give us an overall understaning of how these reviews differ from one another.

In [17]:
# initialize BokehJS
output_notebook()

In [44]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(reviews_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(reviews))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

> The above plot allows you to quickly see which documents are more similar to one another. Those that are far away from the pack tend to be unique. These unique reviews - 33, 36, 25, 95, etc. - tend to be much longer and use a wider vocabulary than the other reviews. Below, I display review 33 as an example of this.

In [19]:
print(reviews[33])



In [20]:
# Separation of words
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(reviews_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], count_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

> The above plot gives a quick view of the words within our reviews. Since the TruncatedSVD model is built using a count vectorizer, this plot shows which words occur the most. From the plot, we can see that the words 'movie', 'disney', 'elsa', 'anna', and 'love' occur quite often. Plots like these can be very useful in getting a quick glimpse at what your documents mention the most.

Now that we have a general understanding of how the NMF model works, let's try to fine-tune it a bit by creating different model variations.

First, I noticed from the initial topic modeling that there were some words that appear in many of the documents; "movie", "film", etc. I am going to add these to the stopwords. 

In [21]:
my_stop_words = text.ENGLISH_STOP_WORDS.union(["movie", "film"])

# I would have also removed 'just', but since this can be in reference to morality when used as an adjective, 
# I will leave it for now. In the future, it would be beneficial to apply a POS-tagger and check for instances such as this.

Now let's try changing from a standard count vectorizer to the tf-idf vectorizer.

In [22]:
n_features = 1000
n_top_words = 10
n_components= 7

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [23]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words=my_stop_words)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(reviews)
print("done in %0.3fs." % (time() - t0))

Extracting tf-idf features for NMF...
done in 0.049s.


Using this new vectorizer, lets compare between the Frobenius norm and Kullback-Leibler divergence. Note the change in setting up the NMF model parameters.

In [24]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (len(reviews), n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print()
print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=100 and n_features=1000...
done in 0.051s.


Topics in NMF model (Frobenius norm):
Topic #0: disney like just great characters movies animation best story animated
Topic #1: elsa anna love hans kristoff princess ice queen prince frozen
Topic #2: coco miguel dead family land death musician pixar tradition grandmother
Topic #3: moana maui zootopia character ocean plot fun need heart good
Topic #4: simba scar king mufasa lion father best cub nala stampede
Topic #5: mexican pixar born mexico culture make cried saw just family
Topic #6: accurate playing favorite great mermaid following fond food force forced



In [25]:
# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (len(reviews), n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print()
print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=100 and n_features=1000...
done in 0.501s.


Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: story characters animation disney best music songs years animated life
Topic #1: queen snow frozen anna elsa second bad princess powers plot
Topic #2: great pixar time story good visuals work coco miguel ll
Topic #3: moana little characters maui 10 disney fun far new good
Topic #4: important beautiful young reason friend instead watching uncle end did
Topic #5: mexican movies pixar just disney world make takes mind saw
Topic #6: great like say people story watched seen think just want



When comparing the two objective functions, it appears that the Frobenius norm provides much more specific feature names between topics. From the Frobenius norm implementation, I can identify the topics as:
    
    - 0: People who like disney movies/characters in general
    - 1: People who are describing the movie Frozen and its' plot/characters
    - 2: The movie Coco and its' plot
    - 3: People whoe like the movie Moana and find it to be fun
    - 4: People who are describing the movie The Lion King and its' plot/characters
    - 5: People describing how the movie Coco incorporates the mexican culture
    - 6: No inference can be made from the features in this topic

With the Kullback-Liebler divergance implementation, however, the features do not shed much light on what the underlying reviews are about:

    - 0: Reviews about disney movies and their music
    - 1: Reviews about the movie Frozen
    - 2: People who like the animation and story of the movie Coco
    - 3: People who like the movie Moana
    - 4: No inference can be made from the features in this topic
    - 5: No inference can be made from the features in this topic
    - 6: No inference can be made from the features in this topic

With our topics in hand, let's see the SVD visualizations we used in the initial model, but this time using our new word list and NMF model.

In [26]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(tfidf)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(reviews))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

In [27]:
# Separation of words
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(tfidf.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], tfidf_vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

> From the above plots, we can see that our documents and words are a bit more spread out than in the original NMF model. This is, in part, due to the addition of some of the words that were very frequent ('movie', 'film') being added to the stop words list. Since these two words were in many of the documents, they made them more similar to each other. Removing them resulted in the documents having a lower similarity score. While this made the documents less similar, I felt it necessary to remove them since all of the reviews are about movies and did not want to include those words as keywords.

Now that we have found the best tuned NMF model, I now want to see if changing the number of topics will give us better insights into the reviews. To do this, I will return the top features for a model of 4 topics and a model of 10 topics.

In [28]:
n_components= 4

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (len(reviews), n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print()
print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=100 and n_features=1000...
done in 0.154s.


Topics in NMF model (Frobenius norm):
Topic #0: simba king lion disney best animated animation great scar characters
Topic #1: elsa anna love hans frozen princess disney kristoff ice queen
Topic #2: pixar mexican coco family miguel mexico dead culture make death
Topic #3: moana maui disney good character zootopia plot just story great



In [29]:
n_components= 10

# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (len(reviews), n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print()
print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=100 and n_features=1000...
done in 0.116s.


Topics in NMF model (Frobenius norm):
Topic #0: disney just like characters movies animation best story great animated
Topic #1: elsa anna love hans kristoff princess ice queen prince frozen
Topic #2: coco miguel dead family land death musician pixar tradition grandmother
Topic #3: moana maui zootopia character ocean plot fun heart need ending
Topic #4: simba scar king mufasa lion best father cub nala stampede
Topic #5: mexican pixar born mexico culture make cried saw just family
Topic #6: accurate playing favorite great mermaid following fond food force forced
Topic #7: boring new positive don frankly reviews level play certainly word
Topic #8: daughter dia los frozen muertos walking background happens depth planet
Topic #9: great responsibility lessons learned terrific inspired forget younger childhood cast



> Much like the homework 7 results, I can see that separating the reviews into 4 topics produced a topic model that separated the reviews by movie. While this may have its uses, I still hold firm that a higher number of topics provides more insight into what the reviews are about other than which movie they belong to. The model with 10 topics began to detereriate, in terms of it's feature words' ability to describe what the reviews within that topic are about, at topic #6. 

From the NMF analysis above, I have found that the best NMF model is the Frobenius norm implemenation with 6 topics. Next, I will perform topic analysis using an LDA model and then compare the results.

#### <a name="LDA-Model"></a>LDA Model

The optimal range for this type of model is 30-50. Since I only have 100 reviews, I will go with a number lower than this and use 6.

In [47]:
n_components = 6
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words=my_stop_words)
t0 = time()
tf = tf_vectorizer.fit_transform(reviews)
print("done in %0.3fs." % (time() - t0))
print()

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (len(reviews), n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf features for LDA...
done in 0.036s.

Fitting LDA models with tf features, n_samples=100 and n_features=1000...
done in 0.283s.

Topics in LDA model:
Topic #0: simba king lion disney animated scar story animation mufasa father
Topic #1: disney family just moana like little love anna going end
Topic #2: king lion simba scar story mufasa great disney irons really
Topic #3: pixar just make coco inside life say characters time love
Topic #4: elsa disney story like love great just anna characters best
Topic #5: disney just story like love elsa great anna characters movies



> Due to the sparse number of reviews (100), the topics produced by the LDA model did not produce featues which would allow you to identify what the reviews within each topic are about. Another hinderance with this model is the sharing of keywords. Since many of the reviews compare the movie they are the subject of to other movies, the topics tend to blend movies together in terms of features.

## <a name="Conclusion"></a>Conclusion

From the analysis, I found the best approach for topic modeling of the 100 movie reviews from HWs 5 and 7 to be NMF with the Frobenius norm implementation because the features produced were more intuitive and the degredation of features occured more slowly. The optimal number of topics for this sparse data set was foun to be 6. This is due to the features of subsequent topics proviing no insight into the reviews.

In order to produce a greater number of topics / better model, it would be necessary to pull in a greater number of reviews.