<a href="https://colab.research.google.com/github/chandanareddy1201/INFO-5731---Computational-Methods-for-Information-Systems/blob/main/Nagireddigari_Chandana_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
# In the First step I am importing all necessasry libraries:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.pipeline import make_pipeline
import pandas as pd

#Here I am doing some preprocessing:
def preprocess_input(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = {'label': [], 'text': []}
        for line in lines:
            # Split on the first space assuming the label is the first character
            split_line = line.split(' ', 1)
            if len(split_line) == 2:  # Ensure there is a text part after splitting
                data['label'].append(int(split_line[0]))  # Label is the first part as an integer
                data['text'].append(split_line[1].strip())  # Text is the second part, stripped of extra spaces/newlines
    return pd.DataFrame(data)


# Here I am Defining a function(evaluate_models) to evaluate models:
def evaluate_models(data, labels):
    # This is to Split the data into train and validate sets:
    X_train, X_validate, y_train, y_validate = train_test_split(data, labels, test_size=0.2, random_state=42, stratify=labels)

    # Here we Vectorize text data:
    vectorizer = CountVectorizer(stop_words='english', max_features=10000)
    X_train = vectorizer.fit_transform(X_train)
    X_validate = vectorizer.transform(X_validate)

    # This is to Define the classifiers:
    classifiers = {
        'MultinomialNB': MultinomialNB(),
        'SVM': SVC(),
        'KNN': KNeighborsClassifier(),
        'Decision Tree': DecisionTreeClassifier(),
        'Random Forest': RandomForestClassifier()
    }

    # Here we Train classifiers and print results:
    for name, clf in classifiers.items():
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_validate)
        print(f"{name} - Accuracy: {accuracy_score(y_validate, y_pred):.2f}")
        print(f"      - Recall: {recall_score(y_validate, y_pred):.2f}")
        print(f"      - Precision: {precision_score(y_validate, y_pred):.2f}")
        print(f"      - F1 Score: {f1_score(y_validate, y_pred):.2f}")


train_data = preprocess_input('/content/stsa-train.txt')
test_data = preprocess_input('/content/stsa-test.txt')

evaluate_models(train_data['text'], train_data['label'])


MultinomialNB - Accuracy: 0.76
      - Recall: 0.81
      - Precision: 0.75
      - F1 Score: 0.78
SVM - Accuracy: 0.74
      - Recall: 0.75
      - Precision: 0.75
      - F1 Score: 0.75
KNN - Accuracy: 0.55
      - Recall: 0.56
      - Precision: 0.57
      - F1 Score: 0.56
Decision Tree - Accuracy: 0.64
      - Recall: 0.75
      - Precision: 0.64
      - F1 Score: 0.69
Random Forest - Accuracy: 0.71
      - Recall: 0.76
      - Precision: 0.70
      - F1 Score: 0.73


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

def preprocess_input(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = {'label': [], 'text': []}
        for line in lines:
            split_line = line.split(' ', 1)
            if len(split_line) == 2:
                data['label'].append(int(split_line[0]))
                data['text'].append(split_line[1].strip())
    return pd.DataFrame(data)

train_data = preprocess_input('/content/stsa-train.txt')

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X = vectorizer.fit_transform(train_data['text'])

# Apply KMeans clustering
num_clusters = 2  # Assuming we want to find 2 clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)

# Predict the cluster for each data point
y_cluster_kmeans = kmeans.predict(X)

# Calculate Silhouette Score
silhouette_avg_kmeans = silhouette_score(X, y_cluster_kmeans)
print('Silhouette Score:', silhouette_avg_kmeans)
#DBSCAN
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

def preprocess_input(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = {'label': [], 'text': []}
        for line in lines:
            split_line = line.split(' ', 1)
            if len(split_line) == 2:
                data['label'].append(int(split_line[0]))
                data['text'].append(split_line[1].strip())
    return pd.DataFrame(data)
train_data = preprocess_input('/content/stsa-train.txt')

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X = vectorizer.fit_transform(train_data['text'])

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
cluster_labels = dbscan.fit_predict(X)

# Examine cluster assignment and noise
train_data['cluster'] = cluster_labels
num_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
num_noise = list(cluster_labels).count(-1)

print(f"Number of clusters: {num_clusters}")
print(f"Number of noise points: {num_noise}")

# Optionally, calculate silhouette score (excluding noise if present)
if num_clusters > 1:
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(f"Silhouette Score: {silhouette_avg}")

# Display head of the DataFrame to see some of the cluster assignments
print(train_data.head(20))
#Hierarchical clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import pandas as pd

def preprocess_input(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = {'label': [], 'text': []}
        for line in lines:
            split_line = line.split(' ', 1)
            if len(split_line) == 2:
                data['label'].append(int(split_line[0]))
                data['text'].append(split_line[1].strip())
    return pd.DataFrame(data)
# Load the data
train_data = preprocess_input('/content/stsa-train.txt')

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X = vectorizer.fit_transform(train_data['text'])

# Apply Agglomerative Clustering
# The choice of the number of clusters (n_clusters) could be informed by prior knowledge about the dataset or determined using methods like the elbow method.
n_clusters = 2
clustering = AgglomerativeClustering(n_clusters=n_clusters)
clustering.fit(X.toarray())  # Convert to dense array if the data size is manageable

# Calculate the Silhouette Score
silhouette_avg = silhouette_score(X.toarray(), clustering.labels_)
print(f"Silhouette Score: {silhouette_avg}")
#Word2Vec
import pandas as pd
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Load and preprocess the data
train_data = preprocess_input('/content/stsa-train.txt')

# Tokenize the documents
train_data['tokenized_text'] = train_data['text'].apply(lambda x: word_tokenize(x.lower()))

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=train_data['tokenized_text'], vector_size=100, window=5, min_count=1, workers=4)

# Create document-level embeddings by averaging word embeddings
def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc if word in word2vec_model.wv.index_to_key]
    return np.mean(word2vec_model.wv[doc], axis=0) if doc else np.zeros(word2vec_model.vector_size)

train_data['doc_vector'] = train_data['tokenized_text'].apply(document_vector)

# Create a matrix of document vectors
doc_vectors = np.vstack(train_data['doc_vector'])

# Apply KMeans clustering
num_clusters = 2  # Assuming we want to try clustering into positive and negative sentiment
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(doc_vectors)

# Silhouette Score
silhouette_avg = silhouette_score(doc_vectors, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")
#BERT:
from transformers import BertTokenizer, BertModel
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import torch
import pandas as pd
import numpy as np

# Function to get embeddings in batches
def get_bert_embeddings(model, tokenizer, texts, batch_size=8):
    model.eval()
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encoded_input = tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt', max_length=128)
        encoded_input = encoded_input.to(device)
        with torch.no_grad():
            output = model(**encoded_input)
        batch_embeddings = output.last_hidden_state.mean(dim=1).detach().cpu().numpy()
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)


def preprocess_input(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = {'label': [], 'text': []}
        for line in lines:
            split_line = line.split(' ', 1)
            if len(split_line) == 2:
                data['label'].append(int(split_line[0]))
                data['text'].append(split_line[1].strip())
    return pd.DataFrame(data)
train_data = preprocess_input('/content/stsa-train.txt')

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Move model to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Get BERT embeddings in batches
embeddings = get_bert_embeddings(model, tokenizer, train_data['text'].tolist(), batch_size=8)  # Adjust batch size based on GPU memory

# Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(embeddings)

# Calculate the Silhouette Score
silhouette_avg = silhouette_score(embeddings, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")




Silhouette Score: 0.002837430743918387
Number of clusters: 2
Number of noise points: 6897
Silhouette Score: -0.28718194476151976
    label                                               text  cluster
0       1  a stirring , funny and finally transporting re...       -1
1       0  apparently reassembled from the cutting-room f...       -1
2       0  they presume their audience wo n't sit still f...       -1
3       1  this is a visually stunning rumination on love...       -1
4       1  jonathan parker 's bartleby should have been t...       -1
5       1  campanella gets the tone just right -- funny i...       -1
6       0  a fan film that for the uninitiated plays bett...       -1
7       1  béart and berling are both superb , while hupp...       -1
8       0  a little less extreme than in the past , with ...       -1
9       0                     the film is strictly routine .       -1
10      1  a lyrical metaphor for cultural and personal s...       -1
11      0  the most repugnant a

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Silhouette Score: 0.5105365514755249


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

**Write your response here:**

.

.

.

.

.




A mixed picture of the clustering algorithms' performance on the sentiment analysis dataset may be obtained by comparing them based on their Silhouette Scores. With a score of 0.0028, which is very near to zero, K-means suggests that there may be some cluster overlap. DBSCAN's performance was subpar; its negative score of -0.287 indicated erroneous grouping in the absence of any discernible pattern. Additionally, the hierarchical clustering practically zero negative score (-0.0006) suggested the potential of a few small unstructured or overlapping clusters. With a score of 0.51, Word2Vec scored better than the rest, showing great cluster quality and good cluster separation. With a somewhat higher positive score of 0.099 than K-means, DBSCAN, and Hierarchical clustering, BERT was not as successful as Word2Vec in finding more distinct clusters. The Silhouette Scores showed that Word2Vec's embeddings produced the most distinct and well-separated clusters overall.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



I have witnessed two tasks related to Machine Learning techniques which includes text data processing and clustering. During these tasks I have faced few challenges while doing the preprocessing steps it was little difficult for success of any mML model to process taht text. And I ahve also faced some issue while choosing right feature extraction method whcih impacts on performance of clustering algorithms.
From these above tasks I have learned how machine learning libraries work and I tried to understand how different algorithms work adn their implications. Mainly I have learned sorting of errors like data path eroors and mismatched.
It would be better if there was better error handling, parameter tuning and comaparitive analysis.