### Homework 7
Sean Kennedy, SMU


In [42]:
import nltk
from nltk.tokenize import word_tokenize 
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import itertools
import nltk
from nltk.tokenize import word_tokenize 
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *

pd.options.display.max_colwidth=500



Cluster the reviews that you collected in homework 5, by doing the following:

1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.  

2.	Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact. 

3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why. 

NOTE: Code for scraping IMDB website modified from https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#step-4-for-each-of-the-movie-reviews-link-get-a-positive-user-review-link-and-a-negative-movie-review-link


### Code

In [68]:
grammar = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
    """
def k_means(reviews, clusters=5, show_top=10):
    # build a TFIDFVectorizer with the engligh stop words
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(reviews)

    # execute KMeans on the vectorized data
    model = KMeans(n_clusters=clusters, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)

    # print out the top terms per cluster for the user
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    for i in range(clusters):
        print(f'cluster {i}'),
        for ind in order_centroids[i, :show_top]:
            print(f'{terms[ind]}')
        print('\n')

def extract_titles(movie_soup):
    title = movie_soup.find('meta', {'name': 'title'})
    return title['content']


def best_worst(reviews):
    worst = reviews.index(min(reviews))

    best = reviews.index(max(reviews))

    return worst, best


def extract_reviews(movie_soup):
    user_review_ratings = [
        tag.previous_element
        for tag in movie_soup.find_all('span', attrs={'class': 'point-scale'})
    ]

    worst, best = best_worst(list(map(int, user_review_ratings)))

    user_review_list = movie_soup.find_all('a', attrs={'class': 'title'})

    w_review_tag = user_review_list[worst]
    b_review_tag = user_review_list[best]

    w_review_link = 'https://www.imdb.com' + w_review_tag['href']
    b_review_link = 'https://www.imdb.com' + b_review_tag['href']

    return w_review_link, b_review_link


def extract_review_text(review_soup):
    tag = review_soup.find('div', attrs={'class': 'text show-more__control'})
    return tag.getText()


def sentence_tokenize(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [get_np_tags(sent) for sent in sentences]
    return sentences


def get_np_tags(sentence):
    nps = []

    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(sentence)

    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)

    return nps

### Review Load (from HW5)

In [46]:
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=6.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''
response = requests.get(url, verify=False)
print(response)
soup = BeautifulSoup(response.text)

movie_titles = [
    tag.attrs['href'] for tag in soup.findAll('a', attrs={'class': None})
    if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')
]


movie_urls = [
    'https://www.imdb.com' + x + 'reviews' for x in set(movie_titles)
]

assert len(movie_urls) >= 100

movie_pages = [
    BeautifulSoup(requests.get(url, verify=False).text) for url in movie_urls 
]

titles = [extract_titles(x) for x in movie_pages]
#display(titles)

reviews = [extract_reviews(x) for x in movie_pages]
#display(reviews)

bad_reviews = [
    extract_review_text(BeautifulSoup(requests.get(url[0], verify=False).text)) for url in reviews 
]

good_reviews = [
    extract_review_text(BeautifulSoup(requests.get(url[1], verify=False).text)) for url in reviews 
]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0816692/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt7286456/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0114814/']

In [50]:
all_data = list(
    zip(titles, movie_titles, good_reviews, bad_reviews,
        [url[1] for url in reviews], [url[0] for url in reviews]))

In [4]:
all_data = pd.DataFrame(all_data)

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0816692/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt7286456/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0114814/reviews']

In [52]:
all_data.columns = pd.Index(
    ['title', 'url', 'good_review', 'bad_review', 'good_url', 'bad_url'])

In [5]:
all_data.set_index(['title'], inplace=True)

There are a total of 500 individual movie reviews
Displaying 10 reviews


['https://www.imdb.com/review/rw2599771/',
 'https://www.imdb.com/review/rw5478826/',
 'https://www.imdb.com/review/rw2286063/',
 'https://www.imdb.com/review/rw4692192/',
 'https://www.imdb.com/review/rw5589331/',
 'https://www.imdb.com/review/rw5195256/',
 'https://www.imdb.com/review/rw3119344/',
 'https://www.imdb.com/review/rw5145037/',
 'https://www.imdb.com/review/rw1136748/',
 'https://www.imdb.com/review/rw0370669/']

In [59]:
all_reviews = list(all_data['good_review'])
all_reviews.extend(list(all_data['bad_review']))

In [55]:
all_data['good_review_chunks'] = all_data['good_review'].apply(
    sentence_tokenize)
all_data['bad_review_chunks'] = all_data['bad_review'].apply(sentence_tokenize)
all_data.head()

Unnamed: 0,movie,user_review_permalink,user_review,sentiment
0,The Dark Knight,https://www.imdb.com/review/rw2599771/,"Let's open this review with the fact that Batman is THE BEST superhero ever, with no other even coming close to his levels. So when a sequel is made for an unforgettable movie like ""Batman Begins"" expectations are EXTREMELY high. I think that if you could describe this movie in a few words it would be ""152 minutes of pure awesomeness."" Christopher Nolan probably created the best ever comic book adaptation of a movie. The movie itself has been adapted in a much more realistic fashion than the...",negative
1,The Dark Knight,https://www.imdb.com/review/rw5478826/,"Confidently directed, dark, brooding, and packed with impressive action sequences and a complex story, The Dark Knight includes a career-defining turn from Heath Ledger as well as other Oscar worthy performances, TDK remains not only the best Batman movie, but comic book movie ever created.",positive
2,Inception,https://www.imdb.com/review/rw2286063/,"I have to say to make such an impressive trailer and such an uninteresting film, takes some doing.Here you have most of the elements that would make a very good film. You have great special effects, a sci-fi conundrum, beautiful visuals and good sound. Yet the most important part of the film is missing. There is no plot, character or soul to this film. It's like having a beautiful building on the outside with no paint or decoration on the inside.It's an empty shell of a film. There is no ten...",negative
3,Inception,https://www.imdb.com/review/rw4692192/,"My 3rd time watching this movie! Yet, it still stunned my mind, kept me enjoyed its every moment and left me with many thoughts afterward.\nFor someone like me, who've rarely slept without dream, it's so exciting watching how Christopher Nolan had illustrated every single characteristic of dream on the big screen. As it's been done so sophisticatedly, I do believe the rumour that Nolan had spent 10 years to finish the script of Inception. In my opinion, it's been so far the greatest achievem...",positive
4,Parasite,https://www.imdb.com/review/rw5589331/,"I find it somewhat odd that some people have interpreted this film as anticapitalist. It shines a very narrow light on poverty but it is all conveyed in a very surreal scenario which makes it quite difficult to emotionally engage with the situations that transpire. However, that's not the only reason.The main characters are not particularly likeable, which I'm sure defenders of the film would describe them as morally grey, given the prevailing cultural obsession with the moral ambiguity.It's...",negative


***1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.*** 

Running the on the raw (non-tokenized) reviews with **5 clusters** (sk-learn)

In [76]:
k_means(all_reviews)

Top terms per cluster:
Cluster 0:
 movie
 people
 film
 like
 good
 turkish
 think
 real
 story
 know
Cluster 1:
 film
 movie
 just
 story
 like
 best
 good
 time
 really
 don
Cluster 2:
 film
 hitchcock
 murder
 movie
 films
 best
 time
 just
 man
 grant
Cluster 3:
 movie
 film
 character
 good
 ending
 just
 great
 story
 like
 plot
Cluster 4:
 movie
 action
 film
 movies
 seen
 time
 best
 watch
 ve
 great




In [8]:
Running the on the raw (non-tokenized) reviews with **3 clusters** (sk-learn)

Top terms per cluster:
Cluster 0:
 message
 media
 movie
 fred
 lou
 symbolism
 dead
 true
 film
 evil
Cluster 1:
 movie
 film
 just
 like
 story
 don
 good
 people
 know
 think
Cluster 2:
 film
 movie
 plot
 films
 park
 time
 acting
 really
 oldboy
 way
Cluster 3:
 movie
 good
 film
 10
 fans
 action
 eastwood
 best
 movies
 love
Cluster 4:
 action
 movie
 time
 film
 best
 alien
 fi
 sci
 love
 films
Cluster 5:
 film
 story
 movie
 list
 films
 com
 city
 imdb
 10
 best
Cluster 6:
 film
 movie
 police
 action
 watch
 time
 characters
 character
 end
 great
Cluster 7:
 hitchcock
 grant
 bourne
 cary
 hepburn
 film
 bond
 kennedy
 spy
 hannay
Cluster 8:
 bond
 craig
 daniel
 action
 mission
 007
 james
 movie
 series
 goldfinger
Cluster 9:
 bourne
 cruise
 max
 film
 vincent
 korea
 murder
 action
 spielberg
 insider




In [75]:
k_means(all_reviews, clusters=3)

Top terms per cluster:
Cluster 0:
 corrugated
 chevaux
 les
 deux
 shells
 diaboliques
 principal
 lover
 apartment
 school
Cluster 1:
 movie
 imdb
 action
 film
 movies
 just
 time
 good
 10
 best
Cluster 2:
 horror
 film
 dead
 zombies
 mall
 house
 wednesday
 lives
 evil
 concerns
Cluster 3:
 film
 good
 like
 movie
 really
 great
 just
 character
 perfect
 best
Cluster 4:
 bond
 goldfinger
 film
 max
 martial
 arts
 gibson
 teenagerdefinite
 ussexy
 actionmoralitya
Cluster 5:
 film
 movie
 best
 people
 like
 love
 story
 performance
 time
 just
Cluster 6:
 action
 film
 bourne
 man
 police
 movie
 films
 story
 just
 best
Cluster 7:
 film
 just
 time
 life
 way
 movie
 like
 good
 films
 carlito
Cluster 8:
 spanish
 cell
 experiment
 prisoners
 211
 prison
 escape
 alcatraz
 film
 alicia
Cluster 9:
 film
 city
 crowe
 corruption
 best
 character
 confidential
 police
 story
 la
Cluster 10:
 war
 film
 turing
 movie
 donovan
 like
 character
 elizabeth
 just
 alma
Cluster 11:
 park

***2.	Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact.***

Taking a look at the top 10 words in each centroid can give an indication of the semantic center of the cluster.

cluster 0:
movie
film
like
story
action
time
best
just
good
really

***Things you might read in a movie review.***

cluster 1:
movie
film
like
good
just
don
action
plot
people
story


***A slightly more granular list of things you may read in a movie review***


cluster 2:
film
man
time
hannay
love
just
family
true
tom
story

**Some names that could be part of a love story***


cluster 3:
film
way
hitchcock
miller
pacino
story
great
al
time
scene

***Some all time classic actors and film directors***

cluster 4:
film
events
films
com
story
imdb
seen
list
www
http

***Another boring cluster...***

***Overall, these clusters are not very distinguishing other than cluster 3 which seems to have a semantic center around some actors/directors.***

cluster 0:
film
action
movie
just
time
great
like
watch
bond
plot

***Generic terms from movie reviews - and James Bond!***

cluster 1:
film
best
time
movie
story
films
like
great
just
acting

***Generic terms from movie reviews - Part Two***


cluster 2:
movie
story
good
just
like
movies
really
don
time
film

***More wildly positive generic terms from movie reviews***

***None of these clusters are particularly interesting or distinguishable. They contain a variety of generic terms that wouldn't create great separation if we were looking to lable or categorize the reviews in some way.***

***3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.***

Adding more clusters seemed to get clusters that were more distinguishing than going with fewer. A simple test of that theory can be carried out by running the algorithm again with 20 clusters.

In [78]:
k_means(all_reviews, clusters=20, show_top=15)

cluster 0
movie
feel
imdb
10
don
story
watch
opinion
characters
like
waste
review
high
just
rating


cluster 1
film
fred
lynch
man
alan
21
susan
5th
book
tony
edward
november
watch
grams
fail


cluster 2
hanks
movie
tom
cinema
spielberg
cruise
film
spanish
amazing
turkish
like
quality
noir
assassination
alo


cluster 3
film
brilliant
lecter
soderbergh
clarice
garrigan
just
swan
performance
superb
portman
pacion
selick
lead
ballet


cluster 4
fi
sci
film
best
predator
action
movie
gattaca
alien
time
lousy
runner
blade
favorite
love


cluster 5
action
film
movie
films
miller
max
just
characters
best
time
like
story
incredibly
character
better


cluster 6
movie
film
good
really
movies
like
people
characters
just
watch
think
story
character
best
going


cluster 7
movie
horror
film
story
movies
just
ve
seen
like
best
don
time
tarantino
got
simply


cluster 8
hitchcock
film
love
hitch
true
time
movie
seen
think
vertigo
stewart
grant
bruno
fans
rear


cluster 9
lee
killer
serial
lola
run
jone

***From a cursory examination of the clusters generated when allowing the algorithm to use 20 centroids, we can see that the differentiation of each cluster really starts to take form. Most clusters have review sentiment vocabulary (like, best, 10 etc) and some tilt towards a particular genre, director or actor. This is much better than our choice of using 5 (the default) or 3 (a lower number than the default). Further tuning could be done to see where the tradeoff between accuracy / generalization lies as far as the choice for the parameter k. Choosing a value too high could lead to clusters that are far too specific for the intended use.***