# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
#Write your code here.

import pandas as pd

#Load and format the training data into a data frame
sentiments = []
texts = []

with open('/content/stsa-train.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ', 1)
        sentiment = int(parts[0])
        text = parts[1]
        sentiments.append(sentiment)
        texts.append(text)

train_data = pd.DataFrame({'sentiment': sentiments, 'text': texts})

train_data

Unnamed: 0,sentiment,text
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...
...,...,...
6915,1,"painful , horrifying and oppressively tragic ,..."
6916,0,take care is nicely performed by a quintet of ...
6917,0,"the script covers huge , heavy topics in a bla..."
6918,0,a seriously bad film with seriously warped log...


In [2]:
sentiments = []
texts = []

with open('/content/stsa-test.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(' ', 1)
        sentiment = int(parts[0])
        text = parts[1]
        sentiments.append(sentiment)
        texts.append(text)

test_data = pd.DataFrame({'sentiment': sentiments, 'text': texts})
test_data

Unnamed: 0,sentiment,text
0,0,"no movement , no yuks , not much of anything ."
1,0,"a gob of drivel so sickly sweet , even the eag..."
2,0,"gangs of new york is an unapologetic mess , wh..."
3,0,"we never really feel involved with the story ,..."
4,1,this is one of polanski 's best films .
...,...,...
1816,0,"an often-deadly boring , strange reading of a ..."
1817,0,the problem with concept films is that if the ...
1818,0,"safe conduct , however ambitious and well-inte..."
1819,0,"a film made with as little wit , interest , an..."


In [3]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

import re

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(sentence):
    sentence = str(sentence)
    sentence = sentence.lower()
    sentence = sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url = re.sub(r'http\S+', '', cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words = [stemmer.stem(w) for w in filtered_words]
    lemma_words = [lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(lemma_words)

# Apply the preprocessing to the training data
train_data['text'] = train_data['text'].apply(preprocess)

# Apply the preprocessing to the test data
test_data['text'] = test_data['text'].apply(preprocess)

# Convert the text data into a numerical representation using Bag of Words
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_data['text'])
y_train = train_data['sentiment']
X_test = vectorizer.transform(test_data['text'])
y_test = test_data['sentiment']

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [4]:
from sklearn.model_selection import train_test_split
# Split the training data into training and validation data with a ratio of 80:20
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [6]:
# Create the Multinomial Naive Bayes model and perform 10-fold cross-validation on the training data
nb = MultinomialNB()
scores = cross_val_score(nb, X_train, y_train, cv=10, scoring='accuracy')
print("Multinomial Naive Bayes Mean accuracy: ", scores.mean())
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_val)
print("Multinomial Naive Bayes Classification Report:\n", classification_report(y_val, y_pred_nb))

Multinomial Naive Bayes Mean accuracy:  0.7785433572048753
Multinomial Naive Bayes Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.74      0.77       671
           1       0.77      0.84      0.81       713

    accuracy                           0.79      1384
   macro avg       0.79      0.79      0.79      1384
weighted avg       0.79      0.79      0.79      1384



In [7]:
# Create the SVM model and perform 10-fold cross-validation on the training data
svm = SVC()
scores = cross_val_score(svm, X_train, y_train, cv=10, scoring='accuracy')
print("SVM Mean accuracy: ", scores.mean())
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_val)
print("SVM Classification Report:\n", classification_report(y_val, y_pred_svm))

SVM Mean accuracy:  0.7584850601575914
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.73      0.77       671
           1       0.77      0.84      0.80       713

    accuracy                           0.79      1384
   macro avg       0.79      0.78      0.78      1384
weighted avg       0.79      0.79      0.78      1384



In [8]:
# Create the KNN model and perform 10-fold cross-validation on the training data
knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
print("KNN Mean accuracy: ", scores.mean())
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_val)
print("KNN Classification Report:\n", classification_report(y_val, y_pred_knn))

KNN Mean accuracy:  0.5765881538833144
KNN Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.61      0.58       671
           1       0.60      0.54      0.57       713

    accuracy                           0.58      1384
   macro avg       0.58      0.58      0.58      1384
weighted avg       0.58      0.58      0.58      1384



In [9]:
# Create the Decision Tree model and perform 10-fold cross-validation on the training data
dt = DecisionTreeClassifier()
scores = cross_val_score(dt, X_train, y_train, cv=10, scoring='accuracy')
print("Decision Tree Mean accuracy: ", scores.mean())

dt.fit(X_train, y_train)
y_pred = dt.predict(X_val)
print("Classification Report for Decision Tree Model: \n", classification_report(y_val, y_pred))

Decision Tree Mean accuracy:  0.6679855856796861
Classification Report for Decision Tree Model: 
               precision    recall  f1-score   support

           0       0.68      0.64      0.66       671
           1       0.68      0.72      0.70       713

    accuracy                           0.68      1384
   macro avg       0.68      0.68      0.68      1384
weighted avg       0.68      0.68      0.68      1384



In [10]:
# Create the Random Forest model and perform 10-fold cross-validation on the training data
rf = RandomForestClassifier()
scores = cross_val_score(rf, X_train, y_train, cv=10, scoring='accuracy')
print("Random Forest Mean accuracy: ", scores.mean())
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_val)
print("Classification Report for Decision Tree Model: \n", classification_report(y_val, y_pred_rf))

Random Forest Mean accuracy:  0.7299449670651061
Classification Report for Decision Tree Model: 
               precision    recall  f1-score   support

           0       0.77      0.69      0.73       671
           1       0.74      0.80      0.77       713

    accuracy                           0.75      1384
   macro avg       0.75      0.75      0.75      1384
weighted avg       0.75      0.75      0.75      1384



In [11]:
# Create the XGBoost model and perform 10-fold cross-validation on the training data
xgb = XGBClassifier()
scores = cross_val_score(xgb, X_train, y_train, cv=10, scoring='accuracy')
print("XGBoost Mean accuracy: ", scores.mean())
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_val)
print("Classification Report for Decision Tree Model: \n", classification_report(y_val, y_pred_xgb))

XGBoost Mean accuracy:  0.716043112396446
Classification Report for Decision Tree Model: 
               precision    recall  f1-score   support

           0       0.73      0.73      0.73       671
           1       0.75      0.75      0.75       713

    accuracy                           0.74      1384
   macro avg       0.74      0.74      0.74      1384
weighted avg       0.74      0.74      0.74      1384



In [12]:
#Training a Word2Vec Model
nltk.download('punkt')
# Tokenize the reviews into sentences
sentences = [nltk.sent_tokenize(review) for review in train_data['text']]

from gensim.models import Word2Vec

# train model
model = Word2Vec(sentences, min_count=1, vector_size=300)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.key_to_index)
# access vector for one sentence
print(model.wv.get_vector('offer new insight matter charact exactli spring life'))
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word2Vec<vocab=6899, vector_size=300, alpha=0.025>
[-2.7902445e-04 -3.2790855e-03 -1.0501762e-03  4.1931868e-05
  1.8708174e-03 -2.8985785e-03  1.2176129e-03 -3.1945610e-03
  1.6634866e-03  2.5917320e-03  1.8078133e-03  2.8512545e-03
 -3.2797336e-04 -2.6543073e-03  1.3295221e-03  1.1467402e-03
  2.2770402e-03  2.2057907e-03 -1.3995854e-03  1.6731803e-03
  2.3202591e-03  1.2234731e-03 -3.3294440e-03 -3.0530309e-03
 -2.4362532e-03  1.4558275e-03  2.7001460e-04 -2.3820950e-03
 -8.9830277e-04  2.3827453e-03  1.3769603e-03 -5.2890100e-04
 -1.7734810e-03 -3.1183537e-03 -2.6763459e-03  2.2984720e-03
 -1.7644898e-03 -5.0507742e-04 -1.1627702e-03 -2.7044734e-03
  2.2734403e-04 -1.1609737e-03 -6.9915928e-04  2.2314782e-03
 -2.0104253e-03 -1.7740738e-03  2.6385942e-03  1.4671132e-03
  1.4610529e-04  9.0708176e-04 -9.7704690e-04  3.5650333e-04
 -9.0388855e-04 -1.8372322e-03  2.3970092e-03 -3.0308127e-04
  2.1171998e-03  1.6272315e-03  1.3714135e-03 -1.5134811e-05
 -4.1125735e-04 -5.2097639e-05  1.

In [13]:
import numpy as np
import string
import warnings
warnings.filterwarnings('ignore')

# Define a function to vectorize a sentence
def vectorize_sentence(sentence, model):
    # Combine the words in the sentence into a single string
    sentence = ' '.join(sentence)
    # Remove punctuation and convert to lowercase
    sentence = sentence.translate(str.maketrans('', '', string.punctuation)).lower()
    # Split sentence into words
    words = sentence.split()
    # Filter out words that are not in the Word2Vec model
    words = [word for word in words if word in model.wv]
    # If there are no words in the sentence that are in the Word2Vec model, return a zero vector
    if len(words) == 0:
        return np.zeros(model.vector_size)
    # Otherwise, return the average of the word vectors in the sentence
    else:
        return np.mean([model.wv[word] for word in words], axis=0)


# Vectorize the train set
X_train_vec = np.array([vectorize_sentence(sentence, model) for sentence in sentences])
y_train = train_data['sentiment']

# Split train set into train and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_vec, y_train, test_size=0.2, random_state=42)

# Vectorize the test set
X_test = np.array([vectorize_sentence(sentence, model) for sentence in test_data['text']])
y_test = test_data['sentiment']

from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Train SVM model
svm = SVC(kernel='linear', random_state=42,)
# Perform 10-fold cross-validation on the training data
scores = cross_val_score(svm, X_train, y_train, cv=10)

# Print the cross-validation scores
print('Mean cross-validation score:', scores.mean())
svm.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = svm.predict(X_val)

# Evaluate the model
print(classification_report(y_val, y_pred))

Mean cross-validation score: 0.5233018455291452
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       671
           1       0.52      1.00      0.68       713

    accuracy                           0.52      1384
   macro avg       0.26      0.50      0.34      1384
weighted avg       0.27      0.52      0.35      1384



## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [14]:
# Write your code here
#importing necessary libraries
import re
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
import pandas as pd
df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

In [16]:
new =df.head(1000)
new['Reviews']

0      I feel so LUCKY to have found this used (phone...
1      nice phone, nice up grade from my pantach revu...
2                                           Very pleased
3      It works good but it goes slow sometimes but i...
4      Great phone to replace my lost phone. The only...
                             ...                        
995    It's a decent for the price.. I've had this on...
996                                   Is good cell phone
997    Amazing phone. Cables and case included, also ...
998                                             Excelent
999       Excellent, it meets the requirements requested
Name: Reviews, Length: 1000, dtype: object

In [17]:
# Special characters removal
new['After noise removal'] = new['Reviews'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))

# Punctuation removal
new['Punctuation removal'] = new['After noise removal'].str.replace('[^\w\s]','')

# Remove numbers
new['Remove numbers'] = new['Punctuation removal'].str.replace('\d+', '')

# Stopwords removal
stop_word = stopwords.words('english')
new['Stopwords removal'] = new['Remove numbers'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_word))

# Lower Casing
new['Lower casing'] = new['Stopwords removal'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Tokenization
new['Tokenization'] = new['Lower casing'].apply(lambda x: TextBlob(x).words)

# Stemming
st = PorterStemmer()
new['Stemming'] = new['Tokenization'].apply(lambda x: " ".join([st.stem(word) for word in x]))

# Lemmatization
new['Lemmatization'] = new['Stemming'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
new['Lemmatization']

0      i feel lucki found use phone u use hard phone ...
1      nice phone nice grade pantach revu veri clean ...
2                                              veri plea
3        it work good goe slow sometim good phone i love
4      great phone replac lost phone the thing volum ...
                             ...                        
995    it decent price i one 6 month the con 1 i pret...
996                                   is good cell phone
997    amaz phone cabl case includ also screen pritec...
998                                                excel
999                            excel meet requir request
Name: Lemmatization, Length: 1000, dtype: object

In [18]:
#Implementing kmeans usind TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer  # Import the TfidfVectorizer class for generating TF-IDF features

# Instantiate a TfidfVectorizer object with default parameters
tfidf_vect = TfidfVectorizer()

# Use the fit_transform method of the TfidfVectorizer object to generate a TF-IDF matrix from the 'Lemmatization' column of the 'new' dataframe
tfidf = tfidf_vect.fit_transform(new['Lemmatization'].values)

# Print the shape of the resulting TF-IDF matrix
tfidf.shape

(1000, 2620)

In [19]:
from sklearn.cluster import KMeans  # Import the KMeans class from scikit-learn

# Instantiate a KMeans object with 10 clusters and a random state of 99
model_tf = KMeans(n_clusters=10, random_state=99)

# Use the fit method of the KMeans object to fit the model to the TF-IDF matrix generated in the previous step
model_tf.fit(tfidf)

In [20]:
# Assign the cluster labels generated by the KMeans algorithm to a variable
labels_tf = model_tf.labels_

# Assign the cluster centers generated by the KMeans algorithm to a variable
cluster_center_tf = model_tf.cluster_centers_

# Print the cluster centers to the console
print(cluster_center_tf)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00097592 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [21]:
# Use the get_feature_names_out method of the TfidfVectorizer object to extract the terms in the corpus
terms1 = tfidf_vect.get_feature_names_out()

# Print the first 100 terms to the console
print(terms1[1:100])

['00pm' '03' '04' '0mp' '10' '100' '1080p' '10k' '11' '110' '115' '12'
 '13' '14' '1400' '15' '16' '169' '1700' '178mb' '18' '1800' '1900' '1999'
 '1gb' '1it' '1st' '1thi' '1x' '20' '2001' '200mb' '200ppi' '2012' '2013'
 '2014' '2015' '2016' '2017' '2100' '2100mhz' '2250' '2300mah' '24' '25'
 '25cm' '288' '29' '2g' '2gb' '2mp' '2nd' '2pm' '2sim' '30' '300'
 '300megabyt' '302' '303' '30ish' '30text' '31' '32' '32g' '32gb' '39'
 '3d' '3g' '3gcellphon' '3inch' '3week' '40' '400mp2' '450' '4g' '4gb'
 '50' '500' '512mb' '55' '55th' '5c' '5g' '5in' '5j' '5mm' '5mp' '5th'
 '60' '600' '628' '635' '64' '64bit' '64gig' '6in' '70' '700' '710']


In [22]:
df1 = new  # Assign the DataFrame 'new' to a new variable 'df1'
df1['Tfidf Clus Label'] = model_tf.labels_  # Add a new column 'Tfidf Clus Label' to 'df1' containing the clustering labels obtained from 'model_tf'
df1[['Lemmatization','Tfidf Clus Label']].head()  # Select the 'Lemmatization' and 'Tfidf Clus Label' columns from 'df1' and display the first five rows using the `head()` method

Unnamed: 0,Lemmatization,Tfidf Clus Label
0,i feel lucki found use phone u use hard phone ...,5
1,nice phone nice grade pantach revu veri clean ...,5
2,veri plea,1
3,it work good goe slow sometim good phone i love,4
4,great phone replac lost phone the thing volum ...,5


In [23]:
df1.groupby(['Tfidf Clus Label'])['Reviews'].count()

Tfidf Clus Label
0     32
1    384
2     35
3     51
4     61
5    255
6     59
7     59
8     21
9     43
Name: Reviews, dtype: int64

In [24]:
print("Top clusters:")
order_centroids = model_tf.cluster_centers_.argsort()[:, ::-1]
for i in range(10):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms1[ind], end='')
        print()

Top clusters:
Cluster 0: arriv
 return
 dead
 wast
 product
 work
 thi
 phone
 disappoint
 condit
Cluster 1: work
 phone
 it
 bad
 good
 like
 bueno
 use
 well
 fine
Cluster 2: excelent
 telefono
 bueno
 muy
 producto
 far
 fastest
 faster
 fast
 fashion
Cluster 3: excel
 product
 recommend
 seller
 thank
 100
 phone
 fash
 function
 five
Cluster 4: good
 veri
 phone
 product
 price
 buy
 tank
 thank
 mobil
 recomend
Cluster 5: phone
 use
 work
 the
 it
 card
 sim
 would
 one
 thi
Cluster 6: love
 phone
 it
 best
 like
 screen
 buy
 price
 camera
 good
Cluster 7: great
 phone
 work
 easi
 use
 love
 expect
 good
 thi
 price
Cluster 8: ok
 far
 wcdma
 it
 quit
 peopl
 someon
 buy
 slow
 phone
Cluster 9: the
 horribl
 phone
 work
 item
 overal
 pictur
 good
 never
 take


In [25]:
for i in range(10):
    print("4 reviews of ensured to cluster ", i)
    print("-" * 70)
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][0]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][2]]['Reviews'])
    print('\n')
    print(df1.iloc[df1.groupby(['Tfidf Clus Label']).groups[i][6]]['Reviews'])
    print('\n')
    print("_" * 70)

4 reviews of ensured to cluster  0
----------------------------------------------------------------------
Great. Arrived quickly.


arrived broken and forgot to send back


The phone is not powering on I sent it to Jamaica WI and will return once I receive it back


______________________________________________________________________
4 reviews of ensured to cluster  1
----------------------------------------------------------------------
Very pleased


It's battery life is great. It's very responsive to touch. The only issue is that sometimes the screen goes black and you have to press the top button several times to get the screen to re-illuminate.


unfortunately Sprint could not activate the phone due to the blocking issue with the phone, the matter was handled very well and quickly. Very satisfied with the service.


______________________________________________________________________
4 reviews of ensured to cluster  2
-----------------------------------------------------------

In [26]:
# Generating bag of words features and Implementing K-means.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(new['Reviews'].values)

In [27]:
from sklearn.cluster import KMeans  # Importing the KMeans clustering algorithm from sklearn.cluster library
model = KMeans(n_clusters = 10,init='k-means++',random_state=99)  # Creating an instance of the KMeans clustering algorithm with 10 clusters, k-means++ initialization, and a fixed random state of 99.
model.fit(bow)  # Fitting the KMeans clustering model on the bag-of-words (bow) matrix.

In [28]:
labels = model.labels_  # Obtaining the cluster labels for each data point in the input matrix using the 'labels_' attribute of the KMeans model
cluster_center = model.cluster_centers_  # Obtaining the cluster centers (centroids) for each of the clusters using the 'cluster_centers_' attribute of the KMeans model

In [29]:
from sklearn import metrics
print(metrics.silhouette_score(bow, labels, metric='euclidean'))

0.39528026327024723


In [30]:
new['Bow Label'] = model.labels_
new[['Lemmatization','Bow Label']].head()

Unnamed: 0,Lemmatization,Bow Label
0,i feel lucki found use phone u use hard phone ...,9
1,nice phone nice grade pantach revu veri clean ...,0
2,veri plea,0
3,it work good goe slow sometim good phone i love,0
4,great phone replac lost phone the thing volum ...,9


In [31]:
#Implementing DBSCAN
from sklearn.cluster import DBSCAN  # Importing the DBSCAN clustering algorithm from the sklearn.cluster library
import numpy as np  # Importing the numpy library and aliasing it as np

minPts = 2 * 100  # Setting the value of the minimum points parameter for DBSCAN algorithm to be twice the dimensionality of the dataset

def lower_bound(nums, target):  # Defining a function to return the number in the array just greater than or equal to itself
    l, r = 0, len(nums) - 1
    while l <= r:  # Implementing binary search to find the nearest number
        mid = int(l + (r - l) / 2)
        if nums[mid] >= target:
            r = mid - 1
        else:
            l = mid + 1
    return l

def compute200thnearestneighbour(x, data):  # Defining a function to compute the 200th nearest neighbor of a point in the dataset
    dists = []  # Initializing an empty list to store the distances
    for val in data:  # Computing the distance between the given point and all other points in the dataset
        dist = np.sum((x - val) **2 )
        if (len(dists) == 200 and dists[199] > dist):
          l = int(lower_bound(dists, dist))
          if l < 200 and l >= 0 and dists[l] > dist:
              dists[l] = dist
        else:
          dists.append(dist)
          dists.sort()

    return dists[199]  # Returning the distance to the 200th nearest neighbor

In [32]:
list_of_sent_train = list()

for i in new["Lower casing"].values:
  list_of_sent_train.append(i.split())

In [33]:
import gensim

# Train Word2Vec model
list_of_sent_train = list()
for i in new["Lower casing"].values:
  list_of_sent_train.append(i.split())
w2v_model = gensim.models.Word2Vec(list_of_sent_train, workers=4)

import numpy as np

# Create sentence vectors using trained Word2Vec model
sent_vectors = []
count = 1
for sent in list_of_sent_train:
    sent_vec = np.zeros(100)
    cnt_words = 1
    for word in sent:
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
sent_vectors = np.array(sent_vectors)
sent_vectors = np.nan_to_num(sent_vectors)

# Calculate 200th nearest neighbor for each sentence vector
twohundrethneigh = []
for val in sent_vectors[:300]:
    twohundrethneigh.append(compute200thnearestneighbour(val, sent_vectors[:300]))
twohundrethneigh.sort()

# Train DBSCAN model
model = DBSCAN(eps=5, min_samples=minPts, n_jobs=-1)
model.fit(sent_vectors)

In [34]:
new['AVG-W2V Clus Label'] = model.labels_
new[['Lemmatization','AVG-W2V Clus Label']].head()

Unnamed: 0,Lemmatization,AVG-W2V Clus Label
0,i feel lucki found use phone u use hard phone ...,0
1,nice phone nice grade pantach revu veri clean ...,0
2,veri plea,0
3,it work good goe slow sometim good phone i love,0
4,great phone replac lost phone the thing volum ...,0


In [35]:
#Implementing Hierarchial Clustering
import scipy
from scipy.cluster import hierarchy

In [36]:
from sklearn.cluster import AgglomerativeClustering

# Instantiate AgglomerativeClustering object with the number of clusters, affinity measure, and linkage criteria
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')

# Fit the clustering model to the sent_vectors data using the fit_predict() method
Agg = cluster.fit_predict(sent_vectors)

# Create a new DataFrame called "hier" with the same data as the "new" DataFrame
hier = new

# Assign the cluster labels to the AVG-W2V Clus Label column of the hier DataFrame
hier['AVG-W2V Clus Label'] = cluster.labels_

# Display the first 5 rows of the hier DataFrame
hier.head(5)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,After noise removal,Punctuation removal,Remove numbers,Stopwords removal,Lower casing,Tokenization,Stemming,Lemmatization,Tfidf Clus Label,Bow Label,AVG-W2V Clus Label
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel so LUCKY to have found this used phone...,I feel LUCKY found used phone us used hard pho...,i feel lucky found used phone us used hard pho...,"[i, feel, lucky, found, used, phone, us, used,...",i feel lucki found use phone us use hard phone...,i feel lucki found use phone u use hard phone ...,5,9,0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice up grade from my pantach revu...,nice phone nice grade pantach revue Very clean...,nice phone nice grade pantach revue very clean...,"[nice, phone, nice, grade, pantach, revue, ver...",nice phone nice grade pantach revu veri clean ...,nice phone nice grade pantach revu veri clean ...,5,0,2
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,Very pleased,Very pleased,Very pleased,Very pleased,very pleased,"[very, pleased]",veri pleas,veri plea,1,0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good but it goes slow sometimes but i...,It works good goes slow sometimes good phone I...,it works good goes slow sometimes good phone i...,"[it, works, good, goes, slow, sometimes, good,...",it work good goe slow sometim good phone i love,it work good goe slow sometim good phone i love,4,0,0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone to replace my lost phone The only...,Great phone replace lost phone The thing volum...,great phone replace lost phone the thing volum...,"[great, phone, replace, lost, phone, the, thin...",great phone replac lost phone the thing volum ...,great phone replac lost phone the thing volum ...,5,9,0


In [37]:
# Reading a review which belong to each group.
for i in range(5):
    print("2 reviews of assigned to cluster ", i)
    print("-" * 70)
    print(hier.iloc[hier.groupby(['AVG-W2V Clus Label']).groups[i][0]]['Lemmatization'])
    print('\n')
    print(hier.iloc[hier.groupby(['AVG-W2V Clus Label']).groups[i][1]]['Lemmatization'])
    print('\n')
    print("_" * 70)

2 reviews of assigned to cluster  0
----------------------------------------------------------------------
i feel lucki found use phone u use hard phone line someon upgrad sold one my son like old one final fell apart 2 5 year want upgrad thank seller realli appreci honesti said use phone i recommend seller highli would


it work good goe slow sometim good phone i love


______________________________________________________________________
2 reviews of assigned to cluster  1
----------------------------------------------------------------------
veri plea


describ fast ship


______________________________________________________________________
2 reviews of assigned to cluster  2
----------------------------------------------------------------------
nice phone nice grade pantach revu veri clean set easi set never android phone fantast say least perfect size surf social medium great phone samsung


phone look good stay charg buy new batteri still stay charg long i trash money lost nev

In [38]:
hier.groupby(['AVG-W2V Clus Label'])['Reviews'].count()

AVG-W2V Clus Label
0    346
1    108
2    301
3    100
4    145
Name: Reviews, dtype: int64

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [39]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
This assignment gave a clear understanding of Machine learning models both
Supervised and Unsupervised. In question 1, I worked with classification models
and in question 2, I worked with clustering methods.




'''

'\nPlease write you answer here:\nThis assignment gave a clear understanding of Machine learning models both \nSupervised and Unsupervised. In question 1, I worked with classification models \nand in question 2, I worked with clustering methods.\n\n\n\n\n'