# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [59]:
train_path = 'exercise05_datacollection\exercise09_datacollection\stsa-train.txt'
test_path = 'exercise05_datacollection\exercise09_datacollection\stsa-test.txt'

def load_data(filepath):
    reviews = []
    labels = []
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            split_line = line.split(maxsplit=1)  # Split only on the first space
            if len(split_line) == 2:  # Ensure there's both a label and a review
                labels.append(int(split_line[0]))
                reviews.append(split_line[1].strip())  # Remove any extra whitespace

    return reviews, labels


train_reviews, train_labels = load_data(train_path)
test_reviews, test_labels = load_data(test_path)

In [60]:
train_reviews , test_reviews

(['a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films',
  'apparently reassembled from the cutting-room floor of any given daytime soap .',
  "they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .",
  'this is a visually stunning rumination on love , memory , history and the war between art and commerce .',
  "jonathan parker 's bartleby should have been the be-all-end-all of the modern-office anomie films .",
  'campanella gets the tone just right -- funny in the middle of sad in the middle of hopeful .',
  'a fan film that for the uninitiated plays better on video with the sound turned down .',
  'béart and berling are both superb , while huppert ... is magnificent .',
  'a little less extreme than in the past , with longer exposition sequences between them , and

In [61]:
train_labels , test_labels

([1,
  0,
  0,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  1,
  0,
  1,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  0,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  1,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  1,
  1,


In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(train_reviews)
X_test = vectorizer.transform(test_reviews)

y_train = train_labels
y_test = test_labels

In [63]:
from sklearn.model_selection import train_test_split

X_traintfidf, X_valtfidf, y_traintfidf, y_valtfidf = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [64]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

classifiers = {
    "MultinomialNB": MultinomialNB(),
    "SVM": make_pipeline(StandardScaler(with_mean=False), SVC()),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

# Results dictionary
results = {}



In [65]:
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    labels_pred = clf.predict(X_val)
    accuracy = accuracy_score(y_val, labels_pred)
    recall = recall_score(y_val, labels_pred)
    precision = precision_score(y_val, labels_pred)
    f1 = f1_score(y_val, labels_pred)
    
    results[name] = {
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1 Score": f1
    }

# Display results
for clf_name, metrics in results.items():
    print(f"{clf_name} Performance:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")


MultinomialNB Performance:
Accuracy: 0.9408
Recall: 0.9691
Precision: 0.9201
F1 Score: 0.9440


SVM Performance:
Accuracy: 0.9884
Recall: 0.9930
Precision: 0.9847
F1 Score: 0.9888


KNN Performance:
Accuracy: 0.5108
Recall: 0.0533
Precision: 0.9500
F1 Score: 0.1009


Decision Tree Performance:
Accuracy: 0.9986
Recall: 1.0000
Precision: 0.9972
F1 Score: 0.9986


Random Forest Performance:
Accuracy: 0.9986
Recall: 1.0000
Precision: 0.9972
F1 Score: 0.9986


XGBoost Performance:
Accuracy: 0.7984
Recall: 0.9229
Precision: 0.7460
F1 Score: 0.8251




In [69]:
for name, clf in classifiers.items():
    y_pre_test = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pre_test)
    recall = recall_score(y_test, y_pre_test)
    precision = precision_score(y_test, y_pre_test)
    f1 = f1_score(y_test, y_pre_test)
    
    results[name] = {
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1 Score": f1
    }

# Display results
for clf_name, metrics in results.items():
    print(f"{clf_name} Performance:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")

MultinomialNB Performance:
Accuracy: 0.7985
Recall: 0.8845
Precision: 0.7542
F1 Score: 0.8142


SVM Performance:
Accuracy: 0.7501
Recall: 0.8317
Precision: 0.7146
F1 Score: 0.7687


KNN Performance:
Accuracy: 0.5058
Recall: 0.0165
Precision: 0.7143
F1 Score: 0.0323


Decision Tree Performance:
Accuracy: 0.6634
Recall: 0.7085
Precision: 0.6492
F1 Score: 0.6775


Random Forest Performance:
Accuracy: 0.7452
Recall: 0.7547
Precision: 0.7400
F1 Score: 0.7473


XGBoost Performance:
Accuracy: 0.6974
Recall: 0.8042
Precision: 0.6621
F1 Score: 0.7263




In [66]:
import gensim
from gensim.models import Word2Vec

sentences_train = [review.split() for review in train_reviews]
sentences_test = [review.split() for review in test_reviews]

w2v_model = Word2Vec(sentences=sentences_train, vector_size=100, window=5, min_count=1, workers=4)

def document_vector(doc):
    doc = [word for word in doc if word in w2v_model.wv.index_to_key]
    return np.mean(w2v_model.wv[doc], axis=0) if doc else np.zeros(w2v_model.vector_size)

X_train_w2v = np.array([document_vector(doc) for doc in sentences_train])
X_test_w2v = np.array([document_vector(doc) for doc in sentences_test])

In [67]:
X_trainw2v, X_valw2v, y_trainw2v, y_valw2v = train_test_split(X_train_w2v, y_train, test_size=0.2, random_state=42)

In [71]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_trainw2v_scaled = scaler.fit_transform(X_trainw2v)
X_valw2v_scaled = scaler.transform(X_valw2v)

In [73]:
for name, clf in classifiers.items():
    clf.fit(X_trainw2v_scaled, y_trainw2v)
    labels_pred = clf.predict(X_valw2v_scaled)
    accuracy = accuracy_score(y_valw2v, labels_pred)
    recall = recall_score(y_val, labels_pred)
    precision = precision_score(y_val, labels_pred)
    f1 = f1_score(y_val, labels_pred)
    
    results[name] = {
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1 Score": f1
    }

# Display results
for clf_name, metrics in results.items():
    print(f"{clf_name} Performance:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")

MultinomialNB Performance:
Accuracy: 0.5260
Recall: 0.9341
Precision: 0.5224
F1 Score: 0.6700


SVM Performance:
Accuracy: 0.5838
Recall: 0.7770
Precision: 0.5705
F1 Score: 0.6580


KNN Performance:
Accuracy: 0.5289
Recall: 0.6045
Precision: 0.5381
F1 Score: 0.5694


Decision Tree Performance:
Accuracy: 0.5238
Recall: 0.5540
Precision: 0.5367
F1 Score: 0.5452


Random Forest Performance:
Accuracy: 0.5441
Recall: 0.5975
Precision: 0.5532
F1 Score: 0.5745


XGBoost Performance:
Accuracy: 0.5715
Recall: 0.6129
Precision: 0.5796
F1 Score: 0.5958




In [75]:
for name, clf in classifiers.items():
    y_pre_test = clf.predict(X_test_w2v)
    accuracy = accuracy_score(y_test, y_pre_test)
    recall = recall_score(y_test, y_pre_test)
    precision = precision_score(y_test, y_pre_test)
    f1 = f1_score(y_test, y_pre_test)
    
    results[name] = {
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1 Score": f1
    }

# Display results
for clf_name, metrics in results.items():
    print(f"{clf_name} Performance on Test Data:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")

MultinomialNB Performance on Test Data:
Accuracy: 0.4992
Recall: 1.0000
Precision: 0.4992
F1 Score: 0.6659


SVM Performance on Test Data:
Accuracy: 0.4992
Recall: 1.0000
Precision: 0.4992
F1 Score: 0.6659


KNN Performance on Test Data:
Accuracy: 0.5261
Recall: 0.9142
Precision: 0.5142
F1 Score: 0.6582


Decision Tree Performance on Test Data:
Accuracy: 0.5008
Recall: 0.0000
Precision: 0.0000
F1 Score: 0.0000


Random Forest Performance on Test Data:
Accuracy: 0.4909
Recall: 0.0231
Precision: 0.3500
F1 Score: 0.0433


XGBoost Performance on Test Data:
Accuracy: 0.5008
Recall: 0.0000
Precision: 0.0000
F1 Score: 0.0000




  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


- As MultinomialNB worked well on data and word2vec worked well , we do the cross validation on the x_train and y_train which was vectorzied using word2vec

In [76]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

classifier = MultinomialNB()
# Perform 10-fold cross-validation
cv_scores = cross_val_score(classifier, X_train, y_train, cv=10)

print("CV average score: %.2f" % cv_scores.mean())

CV average score: 0.78


In [78]:
classifier.fit(X_train, y_train)
print(classifier.score(X_test , y_test))

0.7984623833058759


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [39]:
import pandas as pd

file_path = 'Amazon_Unlocked_Mobile.csv'
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


# word to vec

In [41]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r'\W', ' ', str(text))
    text = text.lower()
    text = re.sub(r'\s+[a-z]\s+', ' ', text)
    text = re.sub(r'^[a-z]\s+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    words = word_tokenize(text)
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

data['Reviews'] = data['Reviews'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shash\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [42]:
for row in data['Reviews']:
    print(row.split())
    break

['like', 'design', 'features', 'iphone', '5', 'purchased', 'refurb', 'model', 'battery', 'recall', 'list', 'clearly', 'serious', 'battery', 'life', 'issues', 'returning', 'hooray', 'amazon', 'return', 'policy', 'purchase', 'new', 'one', 'different', 'amazon', 'vendor']


In [43]:
from gensim.models import Word2Vec

sentences = [row.split() for row in data['Reviews']]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)

def get_average_word2vec(tokens_list, vector, generate_missing=False, k=100):
    if len(tokens_list) < 1:
        return np.zeros(k)
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

data['w2v'] = data['Reviews'].apply(lambda x: get_average_word2vec(x.split(), w2v_model.wv))

In [48]:
from sklearn.cluster import KMeans, DBSCAN
from scipy.cluster.hierarchy import linkage, dendrogram

def apply_clustering(X, method='kmeans'):
    if method == 'kmeans':
        kmeans = KMeans(n_clusters=5)
        kmeans.fit(X)
        return kmeans.labels_
    elif method == 'dbscan':
        dbscan = DBSCAN(eps=0.5, min_samples=5)
        dbscan.fit(X)
        return dbscan.labels_
    elif method == 'hierarchical':
        Z = linkage(X, 'ward')
        return Z

# Example usage with Word2Vec vectors
X_w2v = np.array(list(data['w2v']))
labels_kmeans_w2v = apply_clustering(X_w2v, method='kmeans')

  super()._check_params_vs_input(X, default_n_init=10)


In [49]:
labels_kmeans_dbscan = apply_clustering(X_w2v, method='dbscan')

In [50]:
labels_kmeans_hierarchical = apply_clustering(X_w2v, method='hierarchical')

In [51]:
labels_kmeans_w2v , labels_kmeans_dbscan , labels_kmeans_hierarchical

(array([0, 0, 0, ..., 3, 0, 2]),
 array([0, 0, 0, ..., 0, 0, 0], dtype=int64),
 array([[2.96900000e+03, 5.91600000e+03, 0.00000000e+00, 2.00000000e+00],
        [3.87100000e+03, 9.61700000e+03, 0.00000000e+00, 2.00000000e+00],
        [6.67000000e+02, 7.01800000e+03, 0.00000000e+00, 2.00000000e+00],
        ...,
        [1.99890000e+04, 1.99940000e+04, 4.55405812e+01, 6.73600000e+03],
        [1.99930000e+04, 1.99960000e+04, 6.46716096e+01, 7.77300000e+03],
        [1.99950000e+04, 1.99970000e+04, 1.32225683e+02, 1.00000000e+04]]))

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [82]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''while compared to other assignmnets , I feel okay with this assignmnet as its model build and its more links are given for refence which made the assignmnet easy'''

'while compared to other assignmnets , I feel okay with this assignmnet as its model build and its more links are given for refence which made the assignmnet easy'