# NLP - HW8
### Miguel Bonilla

- [1. Load Sentiment Vocabulary](#1.-Load-Sentiment-Vocabulary)
    - [a. Normalize Data](#a.-Normalize-Data)
    - [b. Vader Sentiment Analyzer](#b.-Vader-Sentiment-Analyzer)
- [2. Computer Sentiment Scores for Each Cluster](#2.-Compute-Sentiment-Scores-for-Each-Cluster)
- [3. Sentiment Analysis of Chunks](#3.-Sentiment-Analysis-of-Chunks)

Perform a vocabulary-based sentiment analysis of the movie reviews you used in homework 5 and
homework 7, by doing the following:
1. In Python, load one of the sentiment vocabularies referenced in the textbook, and run the
sentiment analyzer as explained in the corresponding reference. Add words to the
sentiment vocabulary, if you think you need to, to better fit your particular text collection.
2. For each of the clusters you created in homework 7, compute the average, median, high,
and low sentiment scores for each cluster. Explain whether you think this reveals anything
interesting about the clusters.
3. For extra credit, analyze sentiment of chunks as follows:
     - a. Take the chunks from homework 5, and in Python, run each chunk individually
through your sentiment analyzer that you used in question 1. If the chunk registers
a nonneutral sentiment, save it in a tabular format (the chunk, the sentiment score).
     - b. Now sort the table twice, once to show the highest negative-sentiment-scoring
chunks at the top and again to show the highest positive-sentiment-scoring chunks
at the top. Examine the upper portions of both sorted lists, to identify any trends,
and explain what you see.

Submit all of your inputs and outputs and your code for this assignment, along with a brief written
explanation of your findings.

In [1]:
import pandas as pd
import numpy as np
from requests import get
from bs4 import BeautifulSoup
import contractions
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk
from unidecode import unidecode
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from collections import Counter
import ast

### 1. Load Sentiment Vocabulary
#### a. Normalize Data

In [2]:
### assign headers since IMDB rejects the requests without it
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'}

In [3]:
## load static URL list (from HW5)
url_list = pd.read_csv("https://raw.githubusercontent.com/boneeyah/DS7337/main/mb_hw5_urls.csv")

In [4]:
# function goes through the table with the URLs to get each direct URL
# Parses through the content of each URL to grab the main review
# tokenizes the sentences of each review
# returns a dataframe with the movie title, review id, and the setence tokens
def grab_review(links_table):
    text = []
    for i in range(len(links_table)):
        review = get(links_table.url[i],headers)
        review_soup = BeautifulSoup(review.content, 'html.parser')
        text.append(review_soup.find(class_='text show-more__control').text)
    return(pd.DataFrame({'movie':links_table.movie,
                         'review':links_table.review,
                         'text':text                         
                        }))

In [5]:
review_text = grab_review(url_list)

In [6]:
special = ['\x96',':',',','-','(',')','[',']','–','/','#','``',';','.','&','"',"''",'?','!','....','--','...','*','..',"'"]
stop_words = nltk.corpus.stopwords.words('english') + ['movie','film','horror', 'thing','quiet','place','alien','covenant','shining','films','john','one']
special = stop_words + ["'s","'t","'d","'ll","'m","'re","'ve","n't"] + special

In [7]:
def normalize_list(review_list):
    expanded_words = [contractions.fix(term) for term in review_list] #expands contractions
    term = [word_tokenize(term.lower()) for term in expanded_words] #tokenizes the expanded strings
    blank_list = []
    for i in range(len(review_list)):
        blank_list.append(' '.join([unidecode(w) for w in term[i] if w not in special and w.isalpha()])) #removes special characters numbers, puts string back together
    return(np.array(blank_list))

In [8]:
review_text['norm_text'] = normalize_list(review_text.text)

In [9]:
cv = CountVectorizer(ngram_range=(1,2), max_df=.8,min_df=20)
cv_matrix = cv.fit_transform(review_text.norm_text)
cv_matrix.shape

(100, 68)

In [10]:
NUM_Clusters = 4
km4 = KMeans(n_clusters=NUM_Clusters, max_iter=1000,n_init=500,random_state=326).fit(cv_matrix)
km4

In [11]:
Counter(km4.labels_)

Counter({0: 20, 2: 66, 3: 4, 1: 10})

In [12]:
review_text['cluster'] = km4.labels_

#### b. Vader Sentiment Analyzer

In [13]:
def sent_analyzer(review,threshold,verbose):
    analyzer = SentimentIntensityAnalyzer()
    pol_scores = []
    comb_score = []
    for i in range(len(review)):
        score_list = analyzer.polarity_scores(review[i])
        agg_score = score_list['compound']
        comb_score.append(agg_score)
        pol_scores.append('positive' if agg_score >=threshold else 'negative' if agg_score <=-threshold else 'neutral')
    return(pd.DataFrame({'compound score':comb_score,'sentiment':pol_scores}))
        

In [14]:
agg_list = sent_analyzer(review_text.norm_text,.1,True)
review_text = review_text.join(agg_list)
review_text

Unnamed: 0,movie,review,text,norm_text,cluster,compound score,sentiment
0,The_Thing,The_Thing_0,"""I know I'm human. And if you were all these t...",know human things would attack right still hum...,0,0.9294,positive
1,The_Thing,The_Thing_1,"A classic film. John Carpenter's ""The Thing"" i...",classic carpenter entertaining ever made fast ...,0,0.9989,positive
2,The_Thing,The_Thing_2,John Carpenter shows how much he loves the 195...,carpenter shows much loves original giving utm...,2,0.8807,positive
3,The_Thing,The_Thing_3,"""The ultimate in alien terror,"" it says. It's ...",ultimate terror says wrong greatest ever made ...,2,0.9829,positive
4,The_Thing,The_Thing_4,"Remake of the classic 1951 ""The Thing From Ano...",remake classic another world men completely is...,2,-0.9619,negative
...,...,...,...,...,...,...,...
95,The_Shining,The_Shining_20,Horror films are not my cup of tea. But The Sh...,cup tea gotten reconsider preference incredibl...,1,0.9777,positive
96,The_Shining,The_Shining_21,The Shining is directed by Stanley Kubrick and...,directed stanley kubrick based novel steven ki...,2,0.9915,positive
97,The_Shining,The_Shining_22,The Shining (1980) is a movie in my DVD collec...,dvd collection recently rewatched hbomax story...,1,0.9923,positive
98,The_Shining,The_Shining_23,"*!!- SPOILERS - !!*Before I begin this, let me...",spoilers begin let say advantages seeing big s...,3,0.9967,positive


The vader sentiment analyzer appears to have done a good job at assigning sentiment scores on the normalized corpus. No additional words were added to the vader sentiment analyzer lexicon.

### 2. Compute Sentiment Scores for Each Cluster

In [15]:
review_text.groupby('cluster').describe(percentiles = [.5])

Unnamed: 0_level_0,compound score,compound score,compound score,compound score,compound score,compound score
Unnamed: 0_level_1,count,mean,std,min,50%,max
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,20.0,0.39899,0.867324,-0.9777,0.9529,0.9989
1,10.0,0.60632,0.616949,-0.8925,0.92605,0.9923
2,66.0,0.011556,0.849469,-0.9915,0.0009,0.9933
3,4.0,0.508025,0.972619,-0.9509,0.99315,0.9967


The results give a lot of insight for the sentiments in each cluster. The cluster which appears to be the most distinct is cluster 2 (N=66) for which the median and average sentiment are both closer to zero. Clusters 0 (N=20), 1 (N=10), 3 (N=4) have very similar min and max sentiment scores, however there is a slight noticeable difference in the median sentiment scores (.95, .92 and .99, respectively); and a a large difference in their average sentiments cores (.40, .6 and .5 respectively).

### 3. Sentiment Analysis of Chunks

The Vader Sentiment Score Analyzer takes strings as input, without taking into account the POS tags, therefore we will start by reconstructing the word chunks by removing the tags and joining the words for each chunk together by first iterating through each touple in a chunk, and taking the first object (the word) from each touple.

In [16]:
np_chunks = pd.read_csv("https://raw.githubusercontent.com/boneeyah/DS7337/main/mb_hw5_npchunks.csv")

In [17]:
review_id = np_chunks.iloc[:,0]
np_chunks = np_chunks['NP-chunks']

In [18]:
np_chunks

0             [('This', 'DT'), ('thing', 'NN')]
1                            [('inside', 'NN')]
2           [('an', 'DT'), ('imitation', 'NN')]
3                            [('nobody', 'NN')]
4       [('John', 'NNP'), ('Carpenter', 'NNP')]
                         ...                   
7691                          [('screw', 'NN')]
7692                          [('cause', 'NN')]
7693                          [('death', 'NN')]
7694                         [('mayhem', 'NN')]
7695            [('the', 'DT'), ('film', 'NN')]
Name: NP-chunks, Length: 7696, dtype: object

In [19]:
total_list = []
for j in range(len(np_chunks)):
    new_list = []
    for i in range(len(ast.literal_eval(np_chunks[j]))):
        new_list.append(ast.literal_eval(np_chunks[j])[i][0])
    total_list.append(' '.join(new_list))

In [20]:
np_chunks_sent = pd.DataFrame({'review_id':review_id,'review':total_list})
np_chunks_sent = np_chunks_sent.join(sent_analyzer(np_chunks_sent['review'],.1,True))

In [21]:
np_chunks_sent

Unnamed: 0,review_id,review,compound score,sentiment
0,The_Thing_0,This thing,0.0000,neutral
1,The_Thing_0,inside,0.0000,neutral
2,The_Thing_0,an imitation,0.0000,neutral
3,The_Thing_0,nobody,0.0000,neutral
4,The_Thing_0,John Carpenter,0.0000,neutral
...,...,...,...,...
7691,The_Shining_24,screw,-0.1027,negative
7692,The_Shining_24,cause,0.0000,neutral
7693,The_Shining_24,death,-0.5994,negative
7694,The_Shining_24,mayhem,0.0000,neutral


In [22]:
np_chunks_sent.sort_values("compound score", ascending=False).head()

Unnamed: 0,review_id,review,compound score,sentiment
6484,The_Shining_13,a brilliant cinematic masterpiece,0.836,positive
3842,Alien_Covenant_2,Heaven Paradise,0.8176,positive
1664,The_Thing_19,a true masterpiece,0.7845,positive
7154,The_Shining_22,A true masterpiece,0.7845,positive
2873,A_Quiet_Place_13,great talent,0.7845,positive


In [23]:
np_chunks_sent.sort_values("compound score").head()

Unnamed: 0,review_id,review,compound score,sentiment
3743,Alien_Covenant_2,mad killer,-0.8176,negative
2959,A_Quiet_Place_15,worst problem,-0.7783,negative
2821,A_Quiet_Place_13,unfair criticism,-0.7184,negative
4239,Alien_Covenant_6,panicky idiot,-0.7003,negative
4954,Alien_Covenant_17,either kill,-0.6908,negative


should add to lexicon, 'horror' 'frightening' 'fear', 'scary' as positive

In [24]:
np_chunks_sent.to_csv('MB_hw8_chunk_sent.csv')
review_text.to_csv('MB_h8_review_sent.csv')
review_text.groupby('cluster').describe(percentiles = [.5]).to_csv('MB_h8_sent_agg.csv')