# NLP - HW7
### Miguel Bonilla

In [1]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from requests import get
import re
import contractions
from sklearn.feature_extraction.text import CountVectorizer

- [1. Clustering with K-Means](#1.-Clustering-with-K-Means)
    - [a. Loading and Normalizing Data](#a.-Loading-and-Normalizing-Data)
    - [b. K-Means with K = 6](#b.-K-Means-with-K-=-6)
    - [c. K-Means with K = 4](#c.-K-Means-with-K-=-3)
- [2. Characterize Each Cluster](#2.-Characterize-Each-Cluster)
- [3. Explain Which Result is Preferable](#3.-Explain-Which-Result-is-Preferable)

Cluster the reviews that you collected in homework 5, by doing the following:  
1. In Python, select any one of the clustering methods covered in this course. Run it over the
collection of reviews, and show at least two different ways of clustering the reviews, e.g.,
changing k in k-Means clustering or changing where you “cut” in Agnes or Diana.  
2. Try to write a short phrase to characterize (give a natural interpretation of) what each
cluster is generally centered on semantically. Is this hard to do in some cases? If so, make
note of that fact.  
3. Explain which of the two clustering results from question 1 is preferable (if one of them is),
and why.  
Submit all of your inputs and outputs and your code for this assignment, along with a brief written
explanation of your findings

### 1. Clustering with K-Means
#### a. Loading and Normalizing Data

In [2]:
### assign headers since IMDB rejects the requests without it
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'}

In [3]:
## load static URL list (from HW5)

url_list = pd.read_csv("https://raw.githubusercontent.com/boneeyah/DS7337/main/mb_hw5_urls.csv")

In [4]:
# function goes through the table with the URLs to get each direct URL
# Parses through the content of each URL to grab the main review
# tokenizes the sentences of each review
# returns a dataframe with the movie title, review id, and the setence tokens
def grab_review(links_table):
    text = []
    for i in range(len(links_table)):
        review = get(links_table.url[i],headers)
        review_soup = BeautifulSoup(review.content, 'html.parser')
        text.append(review_soup.find(class_='text show-more__control').text)
    return(pd.DataFrame({'movie':links_table.movie,
                         'review':links_table.review,
                         'text':text                         
                        }))

In [5]:
review_text = grab_review(url_list)

In [6]:
special = ['\x96',':',',','-','(',')','[',']','–','/','#','``',';','.','&','"',"''",'?','!','....','--','...','*','..',"'"]
stop_words = nltk.corpus.stopwords.words('english') + ['movie','film','horror', 'thing','quiet','place','alien','covenant','shining','films','john','one']
special = stop_words + ["'s","'t","'d","'ll","'m","'re","'ve","n't"] + special

In [7]:
def normalize_list(review_list):
    term = [word_tokenize(term.lower()) for term in review_list]
    blank_list = []
    for i in range(len(review_list)):
        blank_list.append(' '.join([w for w in term[i] if w not in special]))
    return(np.array(blank_list))

In [8]:
norm_corpus = normalize_list(review_text.text)

#### b. K-Means with K = 6

In [9]:
cv = CountVectorizer(ngram_range=(1,2), max_df=.8,min_df=20)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix.shape

(100, 77)

In [10]:
from sklearn.cluster import KMeans
NUM_Clusters = 6
km = KMeans(n_clusters=NUM_Clusters, max_iter=1000,n_init=500,random_state=326).fit(cv_matrix)
km

In [11]:
from collections import Counter
Counter(km.labels_)

Counter({0: 11, 3: 61, 1: 15, 4: 2, 5: 10, 2: 1})

In [12]:
review_text['kmeans_cluster'] = km.labels_

In [13]:
movie_clusters = (review_text[['movie','review','kmeans_cluster']].sort_values(by='kmeans_cluster',ascending=False))

In [14]:
feature_names = cv.get_feature_names_out()
topn_features = 15
ordered_centroids = km.cluster_centers_.argsort()[:,::-1]
for cluster_num in range(NUM_Clusters):
    key_features = [feature_names[index] for index in ordered_centroids[cluster_num,:topn_features]]
    movies = movie_clusters[movie_clusters['kmeans_cluster'] == 
                           cluster_num]['movie'].value_counts().to_string()
    print('CLUSTER#'+str(cluster_num+1))
    print('Key Features:',key_features)
    print('Movies:\n', movies,sep='')
    print('---------------------')

CLUSTER#1
Key Features: ['effects', 'also', 'time', 'ever', 'best', 'good', 'seen', 'see', 'story', 'much', 'made', 'first', 'the', 'years', 'even']
Movies:
The_Thing         8
A_Quiet_Place     2
Alien_Covenant    1
---------------------
CLUSTER#2
Key Features: ['would', 'like', 'make', 'even', 'could', 'get', 'people', 'movies', 'time', 'story', 'old', 'good', 'made', 'go', 'around']
Movies:
A_Quiet_Place     8
Alien_Covenant    5
The_Shining       1
The_Thing         1
---------------------
CLUSTER#3
Key Features: ['jack', 'the', 'also', 'scene', 'would', 'scary', 'like', 'way', 'say', 'even', 'could', 'nicholson', 'family', 'done', 'little']
Movies:
The_Shining    1
---------------------
CLUSTER#4
Key Features: ['get', 'time', 'like', 'characters', 'even', 'see', 'well', 'great', 'good', 'really', 'story', 'would', 'the', 'still', 'much']
Movies:
Alien_Covenant    19
The_Thing         16
A_Quiet_Place     15
The_Shining       11
---------------------
CLUSTER#5
Key Features: ['jack'

#### c. K-Means with K = 4

In [15]:
NUM_Clusters = 4
km2 = KMeans(n_clusters=NUM_Clusters, max_iter=1000,n_init=500,random_state=326).fit(cv_matrix)
km2

In [16]:
Counter(km2.labels_)

Counter({2: 20, 1: 66, 3: 4, 0: 10})

In [17]:
df = review_text.copy(deep=True)
df['kmeans_cluster'] = km2.labels_

In [18]:
movie_clusters2 = (df[['movie','review','kmeans_cluster']].sort_values(by='kmeans_cluster',ascending=False))

In [19]:
topn_features = 10
ordered_centroids2 = km2.cluster_centers_.argsort()[:,::-1]
for cluster_num in range(NUM_Clusters):
    key_features = [feature_names[index] for index in ordered_centroids2[cluster_num,:topn_features]]
    movies = movie_clusters2[movie_clusters2['kmeans_cluster'] == 
                           cluster_num]['movie'].value_counts().to_string()
    print('CLUSTER#'+str(cluster_num+1))
    print('Key Features:',key_features)
    print('Movies:\n', movies,sep='')
    print('---------------------')

CLUSTER#1
Key Features: ['jack', 'nicholson', 'the', 'family', 'best', 'time', 'like', 'us', 'well', 'ever']
Movies:
The_Shining    10
---------------------
CLUSTER#2
Key Features: ['get', 'like', 'time', 'would', 'characters', 'good', 'well', 'even', 'the', 'see']
Movies:
Alien_Covenant    20
The_Thing         18
A_Quiet_Place     17
The_Shining       11
---------------------
CLUSTER#3
Key Features: ['would', 'even', 'like', 'time', 'story', 'made', 'make', 'good', 'much', 'also']
Movies:
A_Quiet_Place     8
The_Thing         7
Alien_Covenant    5
---------------------
CLUSTER#4
Key Features: ['jack', 'scene', 'also', 'way', 'like', 'family', 'story', 'the', 'well', 'better']
Movies:
The_Shining    4
---------------------


### 2. Characterize Each Cluster

#### a. Cluster Interpretation K = 6

- Cluster 1: The cluster is centered on reviews discussing a film as one of the best ever made, with great effects. One of the best films in years.
- Cluster 2: The cluster seems to be centered on reviews that are very generic, with less specificity than reviews in other clusters.
- Cluster 3: The cluster is centered on a review talking about Jack Nicholson and dealing with family, and discussing a scary scene
- Cluster 4: Focused on reviews that mention great characters and good story
- Cluster 5: Centered on reviews that mention Jack, dealing with family, and mentioning a specific scene
- Cluster 6: Centered on reviews that focus on Jack Nicholson, mentioning the film has a great story, and is scary.

#### b. Cluster Interpretation K = 4

- Cluster 1: This cluster is centered on reviews that focus on Jack Nicholson and mention the family. 
- Cluster 2: This cluster is centered on reviews that  focus on the movie having good characters
- Cluster 3: This cluster is centered on reviews that focus on the movie having a good story
- Cluster 4: This cluster is centered on reviews that focus on Jack (Nicholson), a specific scene and the story all, from The Shinning

### 3. Explain Which Result is Preferable

The result with 4 clusters is preferable in this case. While both the K=6 and K=4 clustering models return results which mostly separate "The Shining" from the other movies. The K=4 model does a better job at defining that separation, with clusters that appear to be semantically more distinct than the K = 6 clusters.

In retrospect, 3 of the priorly selected movies are very similar in theme and plot. "The Thing", "Alien Covenant", and "A Quiet Place" all have alien monsters that drive the horror of the movies, which makes "The Shining" the most different out of the 4 movies. The K=3 model does a better job at creating clusters that show this difference.