# Comparison of Word Embedding and Sentence Embedding Techniques

## Introduction
Your CTO has tasked you with developing a machine-learning model that can analyse the reviews on the company's products and classify them as positive or negative. In this exercise, we'll compare two popular text embedding techniques, Word Embedding and Sentence Embedding, on a dataset of product reviews. We will practise how these embeddings are generated and how they can be used for text classification tasks.

## Step 1: Import Necessary Libraries

In [None]:
# Install the Sentence Transformers library for sentence-level embeddings (e.g., BERT-based)
!pip install -U sentence-transformers -q

# Install Gensim library, used for Word2Vec and other word embedding models
!pip install gensim -q

# Importing core Python libraries for data manipulation and visualization
import pandas as pd       # For working with structured data (dataframes, CSVs, etc.)
import numpy as np        # For numerical operations, arrays, and math functions
import matplotlib.pyplot as plt  # For plotting charts
import seaborn as sns     # For attractive statistical visualizations

# Importing Scikit-learn tools for model evaluation and splitting
from sklearn.model_selection import train_test_split  # To split data into train/test sets
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  # To evaluate model performance
from sklearn.linear_model import LogisticRegression  # Logistic regression classifier

# For Word Embedding (Word2Vec)
from gensim.models import Word2Vec  # Gensim's implementation of Word2Vec
from nltk.tokenize import word_tokenize  # Tokenizer to split text into words

# For Sentence Embedding using pretrained transformer-based models
from sentence_transformers import SentenceTransformer  # Easily get embeddings for full sentences

# Download necessary NLTK data for tokenization
import nltk
nltk.download('punkt')        # Word tokenizer model (required by word_tokenize)
nltk.download('punkt_tab')    # (Optional/legacy) Table of abbreviations for sentence tokenization – not strictly necessary


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Step 2: Load and Preprocess Data

In [None]:
# Load the dataset from a GitHub URL
# The dataset is expected to have product reviews and sentiment labels
url = "https://raw.githubusercontent.com/wei-research/AdvNLP/main/Product_Review_Dataset.csv"
data = pd.read_csv(url)  # Read the CSV file into a pandas DataFrame

# Clean the review text to prepare for embedding
# Step 1: Convert all characters in the 'clean_comment' column to lowercase
data['clean_comment'] = data['clean_comment'].str.lower()

# Step 2: Remove all characters that are not letters or whitespace
# This removes punctuation, numbers, and special characters
data['clean_comment'] = data['clean_comment'].str.replace('[^a-zA-Z\s]', '', regex=True)

# Display the first few rows of the cleaned dataset to verify changes
data.head()


Unnamed: 0,clean_comment,category
0,this smartphone has an amazing camera and batt...,1
1,the headphones stopped working after just a we...,0
2,the laptop is fast lightweight and perfect for...,1
3,these shoes are very uncomfortable and started...,0
4,i love this coffee maker it brews quickly and ...,1


## Step 3: Word Embedding with Word2Vec

In [None]:
# Tokenize each cleaned comment into individual words (tokens)
# Example: "great product" → ['great', 'product']
data['tokens'] = data['clean_comment'].apply(word_tokenize)

# Train a Word2Vec model on the tokenized comments
# - vector_size=100: each word will be represented by a 100-dimensional vector
# - window=5: context window size (5 words to the left and right)
# - min_count=1: include all words, even those that appear only once
# - workers=4: use 4 threads for parallel training
word2vec_model = Word2Vec(sentences=data['tokens'], vector_size=100, window=5, min_count=1, workers=4)

# Define a function to get the Word2Vec embedding for a given list of tokens
# For each comment, it averages the word vectors of all known words in the comment
def get_word2vec_embedding(tokens):
    # Filter out words not in the model's vocabulary, then compute the mean vector
    embedding = np.mean(
        [word2vec_model.wv[word] for word in tokens if word in word2vec_model.wv],
        axis=0
    )
    return embedding  # Returns a 100-dimensional vector per comment

# Apply the embedding function to each tokenized comment
# Adds a new column with the averaged word vector per comment
data['word2vec_embeddings'] = data['tokens'].apply(get_word2vec_embedding)

# Show the first few comments alongside their generated Word2Vec embeddings
data[['clean_comment', 'word2vec_embeddings']].head()


Unnamed: 0,clean_comment,word2vec_embeddings
0,this smartphone has an amazing camera and batt...,"[0.002153913, 7.228344e-05, -0.00026805827, -0..."
1,the headphones stopped working after just a we...,"[-0.00023075377, 0.0023876259, 0.0005207668, -..."
2,the laptop is fast lightweight and perfect for...,"[0.0019213614, -0.0011442981, 0.0020381378, 0...."
3,these shoes are very uncomfortable and started...,"[-0.0033436287, 0.0010039428, -0.0008265797, -..."
4,i love this coffee maker it brews quickly and ...,"[-0.0022549175, 0.00081933977, 0.0010736331, 0..."


## Step 4: Sentence Embedding with Sentence-BERT

In [None]:
# Load a pre-trained sentence transformer model for generating sentence-level embeddings
# 'paraphrase-MiniLM-L6-v2' is a compact yet powerful model from Hugging Face, suitable for sentence similarity and classification
sentence_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Generate sentence embeddings for each cleaned comment
# This model captures the overall semantic meaning of the entire sentence
# The lambda function applies the model's `encode()` method to each comment
data['sentence_embeddings'] = data['clean_comment'].apply(lambda x: sentence_model.encode(x))

# Display the first few comments with their corresponding sentence embeddings
data[['clean_comment', 'sentence_embeddings']].head()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return forward_call(*args, **kwargs)


Unnamed: 0,clean_comment,sentence_embeddings
0,this smartphone has an amazing camera and batt...,"[-0.62249535, 0.3684942, -0.06423672, -0.17897..."
1,the headphones stopped working after just a we...,"[-0.14725561, -0.5442201, 0.22721826, -0.27950..."
2,the laptop is fast lightweight and perfect for...,"[-0.50854665, 0.40044096, -0.067704774, 0.1941..."
3,these shoes are very uncomfortable and started...,"[-0.62390167, -0.012481386, 0.34501055, 0.3130..."
4,i love this coffee maker it brews quickly and ...,"[-0.8667885, -0.48353273, -0.77278084, 0.39238..."


## Step 5: Compare Classification Performance

In [None]:
# Prepare feature arrays for classification
# Convert lists of vectors into 2D NumPy arrays for both embedding types
X_word2vec = np.vstack(data['word2vec_embeddings'].values)  # Shape: (num_samples, 100)
X_sentence = np.vstack(data['sentence_embeddings'].values)  # Shape: (num_samples, 384)

# Target variable (e.g., sentiment labels like "positive" or "negative")
y = data['category']

# Split Word2Vec-based features into training and test sets (80/20 split)
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(
    X_word2vec, y, test_size=0.2, random_state=42
)

# Split Sentence-BERT-based features into training and test sets
X_train_sent, X_test_sent, y_train_sent, y_test_sent = train_test_split(
    X_sentence, y, test_size=0.2, random_state=42
)

# -----------------------------
# Train and Evaluate Word2Vec Model
# -----------------------------

# Initialize a logistic regression classifier for Word2Vec features
model_w2v = LogisticRegression()

# Fit the model on the training data
model_w2v.fit(X_train_w2v, y_train_w2v)

# Predict sentiment categories on the test data
y_pred_w2v = model_w2v.predict(X_test_w2v)

# -----------------------------
# Train and Evaluate Sentence-BERT Model
# -----------------------------

# Initialize a logistic regression classifier for Sentence-BERT features
model_sent = LogisticRegression()

# Fit the model on the training data
model_sent.fit(X_train_sent, y_train_sent)

# Predict sentiment categories on the test data
y_pred_sent = model_sent.predict(X_test_sent)

# -----------------------------
# Evaluation Metrics for Word2Vec
# -----------------------------

# Print accuracy score for the Word2Vec model
print("Word2Vec Model Accuracy:", accuracy_score(y_test_w2v, y_pred_w2v))

# Print detailed classification metrics (precision, recall, F1-score)
print("Word2Vec Classification Report:")
print(classification_report(y_test_w2v, y_pred_w2v))

# -----------------------------
# Evaluation Metrics for Sentence-BERT
# -----------------------------

# Print accuracy score for the Sentence-BERT model
print("Sentence-BERT Model Accuracy:", accuracy_score(y_test_sent, y_pred_sent))

# Print detailed classification metrics for Sentence-BERT
print("Sentence-BERT Classification Report:")
print(classification_report(y_test_sent, y_pred_sent))


Word2Vec Model Accuracy: 0.9
Word2Vec Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.80      0.89         5
           1       0.83      1.00      0.91         5

    accuracy                           0.90        10
   macro avg       0.92      0.90      0.90        10
weighted avg       0.92      0.90      0.90        10

Sentence-BERT Model Accuracy: 1.0
Sentence-BERT Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         5
           1       1.00      1.00      1.00         5

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



## Conclusion
We have compared Word Embedding (Word2Vec) with Sentence Embedding (Sentence-BERT) techniques by using them to classify product reviews. Sentence embeddings generally capture more contextual and semantic information compared to word embeddings, which leads to better classification performance as observed in the results. Now you can present the model and your findings to your CTO.