### NLP Homework: 7
Sean Kennedy, SMU

Cluster the reviews that you collected in homework 5, by doing the following:



In [42]:
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import itertools
import nltk
from nltk.tokenize import word_tokenize 
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from IPython import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

### Code

In [68]:
grammar = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
    """
def k_means(reviews, clusters=5, show_top=10):
    # build a TFIDFVectorizer with the engligh stop words
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(reviews)

    # execute KMeans on the vectorized data
    model = KMeans(n_clusters=clusters, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)

    # print out the top terms per cluster for the user
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    for i in range(clusters):
        print(f'cluster {i}'),
        for ind in order_centroids[i, :show_top]:
            print(f'{terms[ind]}')
        print('\n')

def extract_titles(movie_soup):
    title = movie_soup.find('meta', {'name': 'title'})
    return title['content']


def best_worst(reviews):
    worst = reviews.index(min(reviews))

    best = reviews.index(max(reviews))

    return worst, best


def extract_reviews(movie_soup):
    user_review_ratings = [
        tag.previous_element
        for tag in movie_soup.find_all('span', attrs={'class': 'point-scale'})
    ]

    worst, best = best_worst(list(map(int, user_review_ratings)))

    user_review_list = movie_soup.find_all('a', attrs={'class': 'title'})

    w_review_tag = user_review_list[worst]
    b_review_tag = user_review_list[best]

    w_review_link = 'https://www.imdb.com' + w_review_tag['href']
    b_review_link = 'https://www.imdb.com' + b_review_tag['href']

    return w_review_link, b_review_link


def extract_review_text(review_soup):
    tag = review_soup.find('div', attrs={'class': 'text show-more__control'})
    return tag.getText()


def sentence_tokenize(text):
    sentences = nltk.sent_tokenize(text)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [get_np_tags(sent) for sent in sentences]
    return sentences


def get_np_tags(sentence):
    nps = []

    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(sentence)

    for subtree in tree.subtrees():
        if subtree.label() == 'NP':
            t = subtree
            t = ' '.join(word for word, tag in t.leaves())
            nps.append(t)

    return nps

### Review Load (from HW5)

In [46]:
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=6.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''
response = requests.get(url, verify=False)
print(response)
soup = BeautifulSoup(response.text)

movie_titles = [
    tag.attrs['href'] for tag in soup.findAll('a', attrs={'class': None})
    if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')
]


movie_urls = [
    'https://www.imdb.com' + x + 'reviews' for x in set(movie_titles)
]

assert len(movie_urls) >= 100

movie_pages = [
    BeautifulSoup(requests.get(url, verify=False).text) for url in movie_urls 
]

titles = [extract_titles(x) for x in movie_pages]
#display(titles)

reviews = [extract_reviews(x) for x in movie_pages]
#display(reviews)

bad_reviews = [
    extract_review_text(BeautifulSoup(requests.get(url[0], verify=False).text)) for url in reviews 
]

good_reviews = [
    extract_review_text(BeautifulSoup(requests.get(url[1], verify=False).text)) for url in reviews 
]



<Response [200]>


































































In [50]:
all_data = list(
    zip(titles, movie_titles, good_reviews, bad_reviews,
        [url[1] for url in reviews], [url[0] for url in reviews]))

In [51]:
all_data = pd.DataFrame(all_data)

In [52]:
all_data.columns = pd.Index(
    ['title', 'url', 'good_review', 'bad_review', 'good_url', 'bad_url'])

In [53]:
all_data.set_index(['title'], inplace=True)

In [59]:
all_reviews = list(all_data['good_review'])
all_reviews.extend(list(all_data['bad_review']))

In [55]:
all_data['good_review_chunks'] = all_data['good_review'].apply(
    sentence_tokenize)
all_data['bad_review_chunks'] = all_data['bad_review'].apply(sentence_tokenize)
all_data.head()

Unnamed: 0_level_0,url,good_review,bad_review,good_url,bad_url,good_review_chunks,bad_review_chunks
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Source Code (2011) - IMDb,/title/tt0468569/,What a movie. The price is worth paying. This ...,"Oh, how we all love us some good old fashioned...",https://www.imdb.com/review/rw2407588/,https://www.imdb.com/review/rw2407030/,"[[movie], [price], [movie, point], [lot of mov...","[[Action- Thrillers], [Last year, heroes, trai..."
District 9 (2009) - IMDb,/title/tt0468569/,"I do not give out ratings of 10 lightly, but h...",Very rarely do I come out of the cinema so ani...,https://www.imdb.com/review/rw2110659/,https://www.imdb.com/review/rw2128160/,"[[ratings, film in years, rating.Neill Blomkam...","[[cinema, film, outstanding.The documentary st..."
Witness for the Prosecution (1957) - IMDb,/title/tt1375666/,At the end of the day the films you give top m...,"I love it when a movie captivates me, carries ...",https://www.imdb.com/review/rw1790103/,https://www.imdb.com/review/rw1311149/,"[[end, day, films, top marks, films, constant ...","[[movie, end], [masterpiece, human maneuvering..."
Dunkirk (2017) - IMDb,/title/tt1375666/,"Dunkirk is, in my opinion, yet another masterp...","Yes, awful. How can anyone honestly give this ...",https://www.imdb.com/review/rw3760993/,https://www.imdb.com/review/rw3766547/,"[[Dunkirk, opinion, masterpiece from mastermin...","[[], [anyone, film, positive recommendation], ..."
Serpico (1973) - IMDb,/title/tt6751668/,When Frank Serpico joined the police force he ...,Al Pacino dominates every scene he is in as Fr...,https://www.imdb.com/review/rw0131523/,https://www.imdb.com/review/rw1945261/,"[[Frank Serpico, police force, ideals and eage...","[[Al Pacino, scene, Frank Serpico, New York Ci..."


***1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.*** 

Running the on the raw (non-tokenized) reviews with **5 clusters** (sk-learn)

In [76]:
k_means(all_reviews)

cluster 0
movie
film
like
story
action
time
best
just
good
really


cluster 1
movie
film
like
good
just
don
action
plot
people
story


cluster 2
film
man
time
hannay
love
just
family
true
tom
story


cluster 3
film
way
hitchcock
miller
pacino
story
great
al
time
scene


cluster 4
film
events
films
com
story
imdb
seen
list
www
http




Running the on the raw (non-tokenized) reviews with **3 clusters** (sk-learn)

In [75]:
k_means(all_reviews, clusters=3)

cluster 0
film
action
movie
just
time
great
like
watch
bond
plot


cluster 1
film
best
time
movie
story
films
like
great
just
acting


cluster 2
movie
story
good
just
like
movies
really
don
time
film




***2.	Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact.***

Taking a look at the top 10 words in each centroid can give an indication of the semantic center of the cluster.

cluster 0:
movie
film
like
story
action
time
best
just
good
really

***Things you might read in a movie review.***

cluster 1:
movie
film
like
good
just
don
action
plot
people
story


***A slightly more granular list of things you may read in a movie review***


cluster 2:
film
man
time
hannay
love
just
family
true
tom
story

**Some names that could be part of a love story***


cluster 3:
film
way
hitchcock
miller
pacino
story
great
al
time
scene

***Some all time classic actors and film directors***

cluster 4:
film
events
films
com
story
imdb
seen
list
www
http

***Another boring cluster...***

***Overall, these clusters are not very distinguishing other than cluster 3 which seems to have a semantic center around some actors/directors.***

cluster 0:
film
action
movie
just
time
great
like
watch
bond
plot

***Generic terms from movie reviews - and James Bond!***

cluster 1:
film
best
time
movie
story
films
like
great
just
acting

***Generic terms from movie reviews - Part Two***


cluster 2:
movie
story
good
just
like
movies
really
don
time
film

***More wildly positive generic terms from movie reviews***

***None of these clusters are particularly interesting or distinguishable. They contain a variety of generic terms that wouldn't create great separation if we were looking to lable or categorize the reviews in some way.***

***3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.***

Adding more clusters seemed to get clusters that were more distinguishing than going with fewer. A simple test of that theory can be carried out by running the algorithm again with 20 clusters.

In [78]:
k_means(all_reviews, clusters=20, show_top=15)

cluster 0
movie
feel
imdb
10
don
story
watch
opinion
characters
like
waste
review
high
just
rating


cluster 1
film
fred
lynch
man
alan
21
susan
5th
book
tony
edward
november
watch
grams
fail


cluster 2
hanks
movie
tom
cinema
spielberg
cruise
film
spanish
amazing
turkish
like
quality
noir
assassination
alo


cluster 3
film
brilliant
lecter
soderbergh
clarice
garrigan
just
swan
performance
superb
portman
pacion
selick
lead
ballet


cluster 4
fi
sci
film
best
predator
action
movie
gattaca
alien
time
lousy
runner
blade
favorite
love


cluster 5
action
film
movie
films
miller
max
just
characters
best
time
like
story
incredibly
character
better


cluster 6
movie
film
good
really
movies
like
people
characters
just
watch
think
story
character
best
going


cluster 7
movie
horror
film
story
movies
just
ve
seen
like
best
don
time
tarantino
got
simply


cluster 8
hitchcock
film
love
hitch
true
time
movie
seen
think
vertigo
stewart
grant
bruno
fans
rear


cluster 9
lee
killer
serial
lola
run
jone

***From a cursory examination of the clusters generated when allowing the algorithm to use 20 centroids, we can see that the differentiation of each cluster really starts to take form. Most clusters have review sentiment vocabulary (like, best, 10 etc) and some tilt towards a particular genre, director or actor. This is much better than our choice of using 5 (the default) or 3 (a lower number than the default). Further tuning could be done to see where the tradeoff between accuracy / generalization lies as far as the choice for the parameter k. Choosing a value too high could lead to clusters that are far too specific for the intended use.***