### Natural Language Processing - Homework 8

Submitted by: Apurv Mittal

Collaborated with Ravi Sivaraman

### Question 1

First we fetch the reviews as covered in Homework # 5.

In [1]:
import requests
from bs4 import BeautifulSoup
import urllib.request
import nltk
from nltk.tag import pos_tag


# Defining function to form the IMDB URL for movie reviews

def get_review_permalink(movieid: str):
    new_url = f'https://www.imdb.com/title/{movieid}/reviews?ref_=tt_urv'
    imdb_data = urllib.request.urlopen(new_url).read().decode("UTF-8")
    soup = BeautifulSoup(imdb_data, "html.parser")
    permalink = []
    for link in soup.find_all('a'):
        a_text = link.text.strip()
        if a_text == "Permalink":
            review_link = link.get('href')
            review_link = "https://imdb.com" + review_link
            permalink.append(review_link)
    return permalink

# Defining function to fetch the review from the URL

def get_review(reviewurl: str):
    review_data = urllib.request.urlopen(reviewurl).read().decode("UTF-8")
    soup = BeautifulSoup(review_data, "html.parser")
    for eachdiv in soup.find_all('div'):
        if eachdiv.has_attr('class'):
            review_body_div_class = ['text', 'show-more__control']
            review_body_div_class.sort()
            divclass = eachdiv['class']
            divclass.sort()
            if review_body_div_class == divclass:
                return (eachdiv.text.strip())

# Reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

comedy_movies = ['tt0099088',
                 'tt0099422',
                 'tt0099611',
                 'tt0099785',
                 'tt0099938',
                 'tt0100142',
                 'tt0100405',
                 'tt0100758',
                 'tt0101272',
                 'tt0101587',
                 'tt0101786',
                 'tt0101902',
                 'tt0102032',
                 'tt0102057',
                 'tt0102059',
                 'tt0102492',
                 'tt0102510',
                 'tt0102943',
                 'tt0103060',
                 'tt0103639',
                 'tt0104070'
                                
                 ]

# Chunking and printing the reviews

permalinks_review = []
counter = 1
for each_movie in comedy_movies:
    if counter >100:
        break
    permalinks = get_review_permalink(each_movie)
    for review_url in permalinks:
        review = get_review(review_url)
        permalinks_review.append(review)
        counter += 1
       


As covered in Homework 5, first we get the reviews from IMDB for comedy movies. We store all the reviews under one variable `permalinks_review`

### Sentiment Analyzer

We will use `vader lexicon` sentiment analyzer for this exercise.  To start with, first setup the sentiment analyzer. We install the required libraries. Add some `dummy` sentences to test the analyzer.

Then print the sentiment scores for each sentences. It provides us with the `Negative`, `Neutral`, `Positive` weights of each sentence and based on the weights gives a `Compound` score. If the compound score is `Positive`, it means the overall sentiment is positive, whereas if the compund score is `Negative`, it means the overall sentiment is negative. The magnitude of the score tells us the magnitude of intensity of the particular sentiment.

Reference code: https://realpython.com/python-nltk-sentiment-analysis/

In [2]:
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
from nltk import tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

dummy_sentences = [
    "Most reviews are bad.",
    "Some reviews are good.",
    "If reviews are good, it means the movie is really good.",
    "If reviews are bad then movie could be bad or good.",
    "Better to watch movies with more good reviews."
    ]

for sentence in dummy_sentences:
    sentiment = SentimentIntensityAnalyzer()
    print(sentence)
    score = sentiment.polarity_scores(sentence)
    for k in sorted(score):
        print('{0}: {1}, '.format(k, score[k]), end='')
    print()


Most reviews are bad.
compound: -0.5809, neg: 0.556, neu: 0.444, pos: 0.0, 
Some reviews are good.
compound: 0.4404, neg: 0.0, neu: 0.508, pos: 0.492, 
If reviews are good, it means the movie is really good.
compound: 0.7003, neg: 0.0, neu: 0.608, pos: 0.392, 
If reviews are bad then movie could be bad or good.
compound: -0.6249, neg: 0.391, neu: 0.447, pos: 0.162, 
Better to watch movies with more good reviews.
compound: 0.7264, neg: 0.0, neu: 0.496, pos: 0.504, 


### Question 2

For this question, we start with the work done in Homework 7. We take the clusters created using `k-Means` for `k=16`. A sample of clusters printed below.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

from sklearn.cluster import KMeans


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(permalinks_review)


k_val = 16
kmeans = KMeans(n_clusters=k_val, init='k-means++', max_iter=200, n_init=10)
kmeans.fit(X)
labels=kmeans.labels_
review_clustered_df=pd.DataFrame(list(zip(permalinks_review,labels)),columns=['review','cluster'])
print(review_clustered_df.sort_values(by=['cluster']))

                                               review  cluster
99  A heavily dysfunctional family are going away ...        0
96  Home Alone will stand the test of time, methin...        0
86  This an absurd storyline. Hollywood loves to s...        0
79  My response to this movie is strictly prejudic...        0
50  Frankenhooker (1990) ** 1/2 (out of 4)More mad...        1
..                                                ...      ...
17  To come back for a sequel to a beloved film is...       14
2   This movie continues from BTTF2. Marty McFly (...       14
18  The final installment in the Back to the Futur...       14
6   I thought this was a fun film, but for reasons...       14
95  I like the bit when he slaps on aftershave and...       15

[100 rows x 2 columns]


Now we pass the clusters identified through the `vader` sentiment analyzer from Question # 1. 

For each cluster we print the `average`, `median`, `high` and `low` sentiment score.

In [17]:
from statistics import mean, median

for clusterid in range(k_val):

    cluster_data_list = review_clustered_df.loc[review_clustered_df['cluster'] == clusterid ]['review'].tolist()
    cluster_data = "".join(cluster_data_list)

    sentiment = SentimentIntensityAnalyzer()
    #score = sentiment.polarity_scores(cluster_data)
    print("Cluster # ", clusterid+1)
    #for k in sorted(score):
        #print('{0}: {1}, '.format(k, score[k]), end='')
    
    scores = [
        sentiment.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(cluster_data)
    ]
    
    print ("Average:",mean(scores),"Median:",median(scores), "Low:", min(scores), "High:",max(scores))
    
    
   

Cluster #  1
Average: -0.04519466666666667 Median: 0.0 Low: -0.9442 High: 0.92
Cluster #  2
Average: 0.10699528795811518 Median: 0.0258 Low: -0.974 High: 0.9455
Cluster #  3
Average: 0.16936570512820512 Median: 0.2023 Low: -0.8604 High: 0.9974
Cluster #  4
Average: 0.3229175 Median: 0.45940000000000003 Low: -0.8271 High: 0.9912
Cluster #  5
Average: 0.20986315789473683 Median: 0.0 Low: -0.7845 High: 0.9834
Cluster #  6
Average: 0.2874 Median: 0.20885 Low: 0.0 High: 0.8479
Cluster #  7
Average: 0.11902321428571429 Median: 0.0 Low: -0.842 High: 0.9485
Cluster #  8
Average: -0.15586785714285714 Median: -0.0386 Low: -0.8977 High: 0.8705
Cluster #  9
Average: 0.14957777777777778 Median: 0.0 Low: -0.7003 High: 0.8304
Cluster #  10
Average: 0.323225 Median: 0.27115 Low: 0.0 High: 0.7506
Cluster #  11
Average: 0.2845179487179487 Median: 0.3071 Low: -0.9561 High: 0.9569
Cluster #  12
Average: 0.24975333333333333 Median: 0.2724 Low: -0.6467 High: 0.9712
Cluster #  13
Average: 0.55015 Median: 0.5

Based on the scores printed above, we can see the average scores for most of the clusters in `positive`, however, for cluster `1` and cluster `8` the average score is negative. Even median score is negative for both of these clusters. This tells us that overall sentiment of these two cluster is `Negative`. 

Similarly, the average score for cluster `5`, `6`, `7`, `9`and `10` is positive; however the median score is `0`. Which tells the overall sentiment is more `neutral` than `positive`.

Cluster `13` has the highest average positive score and also a very high median score.This cluster is definately most `positive` reviews.

We can cross-check this analysis against the `vader` sentiment scores including `compound`, `positive` and `negative` scores.

In [18]:
for clusterid in range(k_val):

    cluster_data_list = review_clustered_df.loc[review_clustered_df['cluster'] == clusterid ]['review'].tolist()
    cluster_data = "".join(cluster_data_list)

    sentiment = SentimentIntensityAnalyzer()
    score = sentiment.polarity_scores(cluster_data)
    print("\nCluster # ", clusterid+1)
    for k in sorted(score):
        print('{0}: {1}, '.format(k, score[k]), end='')
    


Cluster #  1
compound: -0.9791, neg: 0.144, neu: 0.725, pos: 0.13, 
Cluster #  2
compound: 0.9999, neg: 0.103, neu: 0.737, pos: 0.159, 
Cluster #  3
compound: 1.0, neg: 0.095, neu: 0.722, pos: 0.183, 
Cluster #  4
compound: 0.9999, neg: 0.059, neu: 0.741, pos: 0.2, 
Cluster #  5
compound: 0.9996, neg: 0.099, neu: 0.718, pos: 0.183, 
Cluster #  6
compound: 0.9432, neg: 0.04, neu: 0.81, pos: 0.15, 
Cluster #  7
compound: 0.999, neg: 0.129, neu: 0.693, pos: 0.178, 
Cluster #  8
compound: -0.9922, neg: 0.199, neu: 0.664, pos: 0.137, 
Cluster #  9
compound: 0.9348, neg: 0.033, neu: 0.874, pos: 0.093, 
Cluster #  10
compound: 0.8805, neg: 0.0, neu: 0.787, pos: 0.213, 
Cluster #  11
compound: 0.9997, neg: 0.103, neu: 0.698, pos: 0.199, 
Cluster #  12
compound: 0.9997, neg: 0.086, neu: 0.715, pos: 0.199, 
Cluster #  13
compound: 0.8016, neg: 0.0, neu: 0.582, pos: 0.418, 
Cluster #  14
compound: 0.9966, neg: 0.029, neu: 0.691, pos: 0.28, 
Cluster #  15
compound: 0.9999, neg: 0.055, neu: 0.786,

The above scores confirm our analysis that Cluster `1` and `8` has negative sentiment and others have positive sentiments.

#### -End of Homework 8-