# SMAI Assignment - 2

## Question 1: Naive Bayes and Clustering

### Part 1: Naive Bayes

[Files](https://drive.google.com/drive/folders/1OUVrOMp2jSSBDJSqvEyXDFTrhiyZnqit?usp=sharing)

You will be performing Sentiment Analysis on a product review dataset with reviews from customers and star rating belonging to four classes (1,2,4,5). You can use sklearn for this question. Your tasks are as follows:

1.   Clean the text by removing punctations and preprocess them using techniques such as stop word removal, stemming etc. You can explore anything!
1.  Create BoW features using the word counts. You can choose the words that form the features such that the performance is optimised. Use the train-test split provided in `train_test_index.pickle` and report any interesting observations based on metrics such as accurarcy, precision, recall and f1 score (You can use Classification report in sklearn).
1. Repeat Task 2 with TfIdf features.

In [11]:
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [6]:
with open('train_test_index.pickle', 'rb') as handle:
    train_test_index_dict = pickle.load(handle)

In [26]:
print(train_test_index_dict.keys())


dict_keys(['train_index', 'test_index'])


In [27]:
train_indices = train_test_index_dict['train_index']
test_indices = train_test_index_dict['test_index']

In [9]:
import pandas as pd

data = pd.read_csv('product_reviews.csv')
data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,Went in for a lunch. Steak sandwich was delici...,5.0,1
2,This place has gone down hill. Clearly they h...,1.0,0
3,"Walked in around 4 on a Friday afternoon, we s...",1.0,0
4,Michael from Red Carpet VIP is amazing ! I rea...,4.0,1


In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/abhinavanand/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
stemmer = PorterStemmer()

In [14]:
def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    words = text.split()
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    return ' '.join(words)

In [17]:
data['cleaned_text'] = data['text'].apply(clean_text)

# task 1 bag of word

In [19]:
from sklearn.feature_extraction.text import CountVectorizer


In [20]:
vectorizer = CountVectorizer(max_features=1500) 
X_bow = vectorizer.fit_transform(data['cleaned_text']).toarray()

In [24]:
X_bow

2

In [31]:
# split the data

X_train_bow, X_test_bow = X_bow[train_indices], X_bow[test_indices]
y_train, y_test = data['stars'][train_indices], data['stars'][test_indices]

In [32]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [33]:
classifier = MultinomialNB()  # initialize the classifier
classifier.fit(X_train_bow,y_train)


In [34]:
y_pred_bow = classifier.predict(X_test_bow)

In [35]:
print(classification_report(y_test, y_pred_bow))

              precision    recall  f1-score   support

         1.0       0.69      0.72      0.70      1149
         2.0       0.44      0.43      0.43       587
         4.0       0.47      0.60      0.53      1981
         5.0       0.83      0.74      0.78      5082

    accuracy                           0.68      8799
   macro avg       0.61      0.62      0.61      8799
weighted avg       0.70      0.68      0.69      8799



# task 2 using tfidf 

### Part 2: Clustering

You will be performing kmeans clustering on the same product reviews dataset from Part 1. In this question, instead of statistically computing features, you will use the embeddings obtained from a neural sentiment analysis model (huggingface: siebert/sentiment-roberta-large-english).

You can use sklearn for this question. Your tasks are as follows:


1. Perform kmeans clustering using sklearn. Try various values for number of clusters (k) and plot the elbow curve. For each value of k, plot WCSS (Within-Cluster Sum of Square). WCSS is the sum of the squared distance between each point and the centroid in a cluster.
1. Perform task 1 with cluster initialisation methods [k-means++, forgy ("random" in sklearn)].
1. In this case, since the ground truth labels (star rating) are available we can evaluate the clustering using metrics like purity, nmi and rand score. Implement these metrics from scratch and evaluate the clustering. [Reference](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)

In [None]:
import gzip
import numpy as np

f = gzip.GzipFile('roberta_embeds.npy.gz', "r")
embeds = np.load(f)
print(embeds.shape)