# Question

Use the IMDB Movie review dataset and perform theClustering process and identify the popular terms in the clusters. Use the IMDB Movie review dataset and perform the Clustering process and identify the popular terms in the clusters.

# Movie Review Dataset

In this notebook, we will take up the task of performing a `K Means` Clustering process on the `IMDB 50K Moovie Review Dataset`, and identify the most popular terms in the clusters.

* [Reference](https://medium.com/@MSalnikov/text-clustering-with-k-means-and-tf-idf-f099bcf95183)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score 

import numpy as np
import pandas as pd

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')

import re

import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Import the Dataset

In this section, we import the dataset and analyse it.

In [4]:
df = pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


In [7]:
# Convert the sentiment column into numerical values

sentiment_map = {
    'positive' : 0,
    'negative' : 1
}

df['sentiment'] = [sentiment_map[item] for item in df['sentiment']]

In [8]:
# Get the number of output class

num_target_classes = len(df['sentiment'].unique())

In [9]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,0
1,A wonderful little production. <br /><br />The...,0
2,I thought this was a wonderful way to spend ti...,0
3,Basically there's a family where a little boy ...,1
4,"Petter Mattei's ""Love in the Time of Money"" is...",0


## 2. Pre-Processing Function

In [10]:
def preprocess(review) :
    
    # Remove HTML tags
    TAG_RE = re.compile(r'<[^>]+>')
    review = TAG_RE.sub('', review)
    
    # Remove punctuations and numbers
    review = re.sub('[^a-zA-Z]', ' ', review)

    # Single character removal
    review = re.sub(r"\s+[a-zA-Z]\s+", ' ', review)

    # Removing multiple spaces
    review = re.sub(r'\s+', ' ', review)
    
    # Convert to lower case
    review = review.lower()
    
    # Delete extra spaces
    review = review.strip()
    
    # Delete stop words
    stop_words = set(stopwords.words("english"))
    words = nltk.tokenize.word_tokenize(review)
    filtered_words = [word for word in words if word not in stop_words]
    review = " ".join(filtered_words)
    
    # Return the processed text
    return review

## 3. TF-IDF Vectorization

In [11]:
# Initialzie the vectorizer instance with the pre-processing function written above

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocess)

In [12]:
# Convert the reviews in the dataset into an array to feed into this vectorizer

reviews_list = df['review'].tolist()

In [13]:
# Apply the vectorizer on the reviews

tfidf = tfidf_vectorizer.fit_transform(reviews_list)

## 4. K Means Clustering

In [14]:
# Initialize the model

model = KMeans(n_clusters=num_target_classes)

In [15]:
# Fit the model on the prepared tfidf

model.fit(tfidf)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [16]:
# Save the trained model

pickle.dump(model, open("q1_model.pkl", "wb"))

## 5. Make Predictions

In this section, we can use the trained model to make predictions for any text we write.

In [24]:
# Enter a review

predict_reviews = ["tf and idf is awesome!", "bad movie"]

In [25]:
# Make the predictions

model.predict(tfidf_vectorizer.transform(predict_reviews))

array([1, 0], dtype=int32)

We can notice here that the model has accurately predicted the class of the first sentence as `positive` and the second as `negative`.

## 6. Identify top terms

In this section, we identify the top terms in each cluster, to understand the trends of the dataset

In [30]:
print("Top terms per cluster:")

order_centroids = model.cluster_centers_.argsort()[:, ::-1]

terms = tfidf_vectorizer.get_feature_names()
for i in range(num_target_classes):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print("-----------------------------")


Top terms per cluster:
Cluster 0:
 movie
 bad
 like
 movies
 one
 good
 really
 even
 see
 would
-----------------------------
Cluster 1:
 film
 one
 movie
 like
 good
 story
 time
 well
 show
 would
-----------------------------
