### Home Work - 7
Balaji Avvaru

In [1]:
#import required libraries
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize 
import numpy as np
import re
import pandas as pd
import itertools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

#### Cluster the reviews that you collected in homework 5, by doing the following:  


1.	In Python, select any one of the clustering methods covered in this course. Run it over the collection of reviews, and show at least two different ways of clustering the reviews, e.g., changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.  

2.	Try to write a short phrase to characterize (give a natural interpretation of) what each cluster is generally centered on semantically. Is this hard to do in some cases? If so, make note of that fact. 

3.	Explain which of the two clustering results from question 1 is preferable (if one of them is), and why. 

Submit all of your inputs and outputs and your code for this assignment, along with a brief written explanation of your findings. 


In [2]:
#IMDB website URL
base_url = "https://www.imdb.com"

# API call to select:100 feature films which are atleast rated 4 with 50,000 votes in thriller genre sorted by rating
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=100'''

# Convert IMDB url to a BeautifulSoup object
response = requests.get(url)
movies_soup = BeautifulSoup(response.text, 'html.parser')

# get movie tags 
movie_tags = movies_soup.find_all('a', attrs={'class': None})

# filter the anchor-tags to get the titles of feature films
movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

# remove duplicate links
movie_tags = list(dict.fromkeys(movie_tags))

# Print out the number of reviews we have and show the first 5 items
print("There are a total of " + str(len(movie_tags)) + " movie user reviews")
print("Displaying first 5 user reviews links")
movie_tags[:5]

There are a total of 100 movie user reviews
Displaying first 5 user reviews links


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0114369/',
 '/title/tt0102926/']

In [3]:
# build out the list of reviews
review_links = [base_url + tag + 'reviews' for tag in movie_tags]

print("There are a total of " + str(len(review_links)) + " movie user reviews")
print("Displaying first 5 user reviews full url")
review_links[:5]

There are a total of 100 movie user reviews
Displaying first 5 user reviews full url


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews']

In [4]:
# get a list of soup objects
movie_soups = []
for link in review_links:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    movie_soups.append(soup)


In [5]:
# get a list movie review soup objects
movie_review_list = []
for movie_soup in movie_soups:
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           movie_soup.find_all('span', attrs={'class': 'point-scale'})]
    
    # find the index of negative and positive review, least user rating is considered as negative review and highest user rating is considered as positive review
    n_index = list(map(int, user_review_ratings)).index(min(list(map(int, user_review_ratings))))
    p_index = list(map(int, user_review_ratings)).index(max(list(map(int, user_review_ratings))))
    
    # get the review tags
    user_review_list = movie_soup.find_all('a', attrs={'class':'title'})
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = base_url + n_review_tag['href']
    p_review_link = base_url + p_review_tag['href']
    
    movie_review_list.append(n_review_link)
    movie_review_list.append(p_review_link)

movie_review_list[:5]

['https://www.imdb.com/review/rw1945777/',
 'https://www.imdb.com/review/rw1999145/',
 'https://www.imdb.com/review/rw2365579/',
 'https://www.imdb.com/review/rw2879376/',
 'https://www.imdb.com/review/rw5204791/']

In [6]:
# get review text from the review link
review_texts = []
for url in movie_review_list:
    # get the review_url's soup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    review_texts.append(tag.getText())

review_texts[:5]

["The first film in the re-imagining of the series was a big hit, but this sequel was a global success, especially with the superb performance by the star of Brokeback Mountain who tragically died from a (prescribed) drugs overdose shortly after filming had finished, from director Christopher Nolan (Memento, Insomnia). Basically a criminal terrorist and mastermind calling himself the Joker (posthumous Oscar, BAFTA and Golden Globe winning Heath Ledger) robs the bank run by the mob, and to take on the Mafia district attorney Harvey Dent (Aaron Eckhart) becomes the new face for justice and hope in Gotham City, with the help of Batman aka Bruce Wayne (Christian Bale) and Lieutenant James 'Jim' Gordon (Gary Oldman). Mob bosses Sal Maroni (Eric Roberts), Gambol (Michael Jai White) and the Chechen (Ritchie Coaster), who have had Chinese accountant Lau (Chin Han) hide their funds, are confronted by the Joker because he wants to kill the, but they all refuse to help, putting a bounty on him. T

In [7]:
# define afunction for KMeans clustering with default k value of 5
def getKMeans(reviews, kVal = 5):
    # build a TFIDFVectorizer with the engligh stop words
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(reviews)

    # execute KMeans on the vectorized data
    model = KMeans(n_clusters=kVal, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)

    # print out the top terms per cluster for the user
    print("Top terms per cluster:")
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    for i in range(kVal):
        print("Cluster %d:" % i),
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind])
        print('\n')

    print("\n")

In [8]:
# Execute the K-Means function on the reviews. We'll initially use the default number of clusters which is 5
getKMeans(review_texts)

Top terms per cluster:
Cluster 0:
 baran
 graphic
 violence
 novels
 felt
 line
 bit
 ultra
 good
 city


Cluster 1:
 film
 movie
 just
 really
 man
 like
 world
 action
 logan
 time


Cluster 2:
 film
 best
 movie
 great
 oscar
 dicaprio
 think
 police
 100
 story


Cluster 3:
 film
 10
 films
 best
 great
 bethany
 cox
 lot
 story
 brilliant


Cluster 4:
 hitchcock
 film
 movie
 like
 great
 people
 watch
 time
 good
 story






In [9]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 10
getKMeans(review_texts,10)

Top terms per cluster:
Cluster 0:
 film
 island
 viewer
 time
 shark
 investigation
 people
 cole
 quite
 victim


Cluster 1:
 film
 10
 great
 best
 pacino
 welles
 given
 dog
 superb
 wonderful


Cluster 2:
 film
 films
 bergman
 baran
 like
 thought
 kurosawa
 samurai
 persona
 people


Cluster 3:
 stewart
 hitchcock
 bond
 logan
 100
 james
 kelly
 window
 rear
 number


Cluster 4:
 movie
 action
 film
 like
 good
 say
 just
 time
 make
 story


Cluster 5:
 film
 johnny
 movie
 swan
 kubrick
 black
 style
 perros
 amores
 dreams


Cluster 6:
 kill
 wife
 milland
 film
 amy
 murder
 hitchcock
 nick
 100
 husband


Cluster 7:
 film
 best
 going
 great
 man
 noir
 movie
 story
 nicholson
 really


Cluster 8:
 best
 10
 film
 anime
 cox
 great
 bethany
 blade
 runner
 outstanding


Cluster 9:
 film
 haider
 india
 liked
 plot
 pakistan
 kashmir
 long
 really
 grant






In [10]:
# Execute the K-Means function on the reviews, use a number of clusters equal to 20
getKMeans(review_texts,20)

Top terms per cluster:
Cluster 0:
 swan
 reno
 besson
 portman
 brando
 leon
 natalie
 jean
 saint
 ballerina


Cluster 1:
 adrián
 logan
 movie
 superhero
 mills
 sonny
 keller
 things
 police
 bank


Cluster 2:
 100
 number
 best
 nominated
 amy
 years
 nick
 film
 cole
 greatest


Cluster 3:
 film
 violence
 people
 long
 baran
 ve
 action
 movie
 bit
 did


Cluster 4:
 film
 usual
 perros
 amores
 films
 lot
 suspects
 la
 just
 bourne


Cluster 5:
 film
 movie
 great
 really
 story
 like
 nascimento
 bit
 watch
 characters


Cluster 6:
 hitchcock
 10
 stewart
 performance
 film
 murder
 beautiful
 superb
 bethany
 cox


Cluster 7:
 macmurray
 money
 johnny
 soap
 ideas
 double
 dog
 indemnity
 stanwyck
 fred


Cluster 8:
 film
 willis
 original
 great
 bond
 remake
 probably
 twist
 bruce
 mystery


Cluster 9:
 welles
 film
 friend
 logan
 life
 evil
 orson
 harry
 arnab
 experience


Cluster 10:
 bergman
 films
 film
 thought
 persona
 like
 movie
 foreign
 death
 kill


Cluster 

In looking at the clusters defined above we see the following:

- With K=5, there doesn't seem to be any real definition of what the centroid is centered on. All of the clusters have the words film and movie in them which could be center points. The clusters themselves don't make a whole lot of sense

- With K=10, we do start to see that the clusters start to cluster around commonalities in specific movies. For example, Cluster 0 seems to indicate word attributes based on the movie "Heat" starring Al Pacino and Robert De Niro. Similar results are seen in Cluster 8, which seems to be focused on the movie "The Machinist" starring Christian Bale. On the flip side certain clusters don't look to good. Cluster 9 is one of these. It looks just like a random sampling of words.

- With K=20, we start to see a continuation of the clusters around commonalities to specific movies. We also start to see similar movie items start to be clustered together. For example, Cluster 18 seems to focus in on the movie "The Hunt for Red October". We also do see some "odd" clusters that don't really provide key information. For example Cluster 14 is just a bunch of random words that define a movie in general.

Explain which of the two clustering results from question 1 is preferable (if one of them is), and why.
From the clustering runs that were performed (K=5, K=10, K-20), the higher K is the most preferred. As described in question 2, the values that were higher than 5 started to get closer to grouping the reivews by specific topics or even got close to the specific movie. This would provide a much richer analysis and be much easier to explain than the items in clusters that are 5 or less. With the k value at 5 we saw that the clusters didn't really have much meaning.