# **Sentiment Classification Using Embeddings and Classical ML Models**

This project focuses on sentiment classification using Amazon, IMDb, and Yelp reviews. We preprocess the text, generate embeddings using SentenceTransformer, and train two modelsâ€”Logistic Regression and Random Forestâ€”to compare performance across different embedding strategies.

In [72]:
from google.colab import userdata

import spacy

import pandas as pd
import numpy as np

from openai import OpenAI

import nltk
from nltk.corpus import stopwords

import re
import string

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

from sklearn.model_selection import train_test_split
from collections import Counter
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

#nltk.download('all')
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

#nltk.download('punkt')
#nltk.download('stopwords')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

# **Setup and Data Preparation**



In [90]:
#Authenticate with OpenAI

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=OPENAI_API_KEY)




In [75]:
#Load and preprocess Amazon, IMDB, and Yelp datasets

dfa = pd.read_csv('amazon5.txt', sep='\t', names=['review', 'label'], header=None, encoding='utf-8')
dfi = pd.read_csv('imdb5.txt', sep='\t', names=['review', 'label'], header=None, encoding='utf-8')
dfy = pd.read_csv('yelp5.txt', sep='\t', names=['review', 'label'], header=None, encoding='utf-8')

dfi['label'] = dfi['review'].str.split().str[-1]
dfi['review'] = dfi['review'].str.split().str[:-1]
dfi['review'] = dfi['review'].apply(lambda x: ' '.join(x))
df = pd.concat([dfa, dfi, dfy], ignore_index=True)

df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  3000 non-null   object
 1   label   3000 non-null   object
dtypes: object(2)
memory usage: 47.0+ KB


In [76]:
df.head()


Unnamed: 0,review,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


# **Text Preprocessing and Tokenization**

In [77]:
# Text cleaning and lemmatization functions

negwords = [ 'no', 'nor', 'not', 'ain', 'aren', "aren't", 'don', "don't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
mystopwords = list(set(stopwords.words('english')) - set(negwords))
wnl = WordNetLemmatizer()

def get_wordnet_pos(tag):
  """
Maps a part-of-speech tag to a WordNet-compatible POS tag.

Args:
    tag (str): POS tag from nltk.pos_tag.

Returns:
    str: WordNet-compatible POS tag (e.g., wordnet.ADJ).
  """

  if tag.startswith('J'):
    return wordnet.ADJ
  elif tag.startswith('V'):
    return wordnet.VERB
  elif tag.startswith('N'):
    return wordnet.NOUN
  elif tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

def clean_tok(doc):
  """
Cleans and tokenizes a text string. Applies lowercasing, stopword removal,
repetition normalization, POS tagging, and lemmatization.

Args:
    doc (str): Raw text to process.

Returns:
    list: List of cleaned and lemmatized tokens.
  """
  doc  = re.sub(r'[^A-Za-z\s]', ' ',doc)
  doc = doc.split()
  tokens = [word.lower() for word in doc]
  tokens = [word for word in tokens if word not in mystopwords]
  tokens = [word for word in tokens if len(word)>1]

  tokens = [re.sub(r"(.)\1{2,}", r"\1\1", word) for word in tokens]

  tagged_tokens = pos_tag(tokens)

  tokens = [wnl.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_tokens]
  return tokens


In [78]:
# Preview of cleaned and tokenized reviews

Y = df.label
Y = Y.astype(int)
X = df.review

Xclean = [clean_tok(x) for x in X]
for x in Xclean[0:5]:
  print(x)

['no', 'way', 'plug', 'u', 'unless', 'go', 'converter']
['good', 'case', 'excellent', 'value']
['great', 'jawbone']
['tie', 'charger', 'conversation', 'last', 'minute', 'major', 'problem']
['mic', 'great']


# **Train/Validation/Test Split and Dictionary Creation**

In [79]:
# Split data into train, validation, and test sets

x_train, x_test, y_train, y_test = train_test_split(Xclean, Y, test_size=0.3, random_state=42)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=42)

print('X,y Train:', len(x_train), len(y_train))
print('X,y Val:', len(x_val), len(y_val))
print('X,y Test', len(x_test), len(y_test))

x_train[0:5]

X,y Train: 2100 2100
X,y Val: 450 450
X,y Test 450 450


[['clip', 'belt', 'deffinitely', 'make', 'feel', 'like', 'cent', 'come'],
 ['keep', 'good', 'work', 'amazon'],
 ['don', 'many', 'word', 'say', 'place', 'everything', 'pretty', 'well'],
 ['disappointed', 'battery'],
 ['film',
  'try',
  'serious',
  'sophisticated',
  'thriller',
  'horror',
  'flick',
  'fail',
  'miserably']]

In [80]:
# Build vocabulary from training data

vocabulary = Counter()

# Count token frequency in the training set
for sample in x_train:
    vocabulary.update(sample)

# Keep only tokens that appear at least twice
vocabulary = Counter({token: count for token, count in vocabulary.items() if count >= 2})

print('Vocabulary size:')
print('Length of dictionary:', len(vocabulary))
print('\nTop 10 most frequent tokens:')
print(vocabulary.most_common(10))


Vocabulary size:
Length of dictionary: 1475

Top 10 most frequent tokens:
[('not', 208), ('good', 179), ('movie', 159), ('great', 156), ('film', 137), ('phone', 123), ('bad', 108), ('one', 108), ('time', 98), ('make', 95)]


In [81]:
# Filter train, validation, and test sets using the generated vocabulary
# Only retain words that exist in the vocabulary

train_x = [[word for word in sample if word in vocabulary] for sample in x_train]
val_x   = [[word for word in sample if word in vocabulary] for sample in x_val]
test_x  = [[word for word in sample if word in vocabulary] for sample in x_test]

# Preview the first few filtered samples
for sample in train_x[:5]:
    print(sample)


['clip', 'belt', 'make', 'feel', 'like', 'come']
['keep', 'good', 'work', 'amazon']
['don', 'many', 'word', 'say', 'place', 'everything', 'pretty', 'well']
['disappointed', 'battery']
['film', 'try', 'serious', 'thriller', 'horror', 'flick', 'fail']


# **Generate sentence-level embeddings using vocabulary-based word vectors**

In [82]:
# Initialize the SentenceTransformer model
model = SentenceTransformer('all-mpnet-base-v2')

# Generate sentence embeddings for the full original dataset (used only for reference or visualization)
X_emb = model.encode(X, show_progress_bar=True)

# Create a dictionary of word embeddings for each word in the vocabulary
vector_dict = {}
for word in vocabulary:
    vector_dict[word] = model.encode(word)

print("Vector dictionary size:", len(vector_dict))
print("Vocabulary size:", len(vocabulary))

# Helper function to compute the average embedding of a document
def average_embedding(token_list, embedding_dict):
    """
    Compute the mean embedding vector for a list of tokens using a given embedding dictionary.
    Returns a zero vector if no tokens are found in the dictionary.
    """
    vectors = [embedding_dict[word] for word in token_list if word in embedding_dict]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(next(iter(embedding_dict.values())).shape)

# Compute average embeddings for train, validation, and test sets
trainEmb = np.array([average_embedding(doc, vector_dict) for doc in train_x])
valEmb = np.array([average_embedding(doc, vector_dict) for doc in val_x])
testEmb = np.array([average_embedding(doc, vector_dict) for doc in test_x])


Batches:   0%|          | 0/94 [00:00<?, ?it/s]

Vector dictionary size: 1475
Vocabulary size: 1475


In [83]:
#Check the shape of the embedding matrices
print("Train Embeddings:", trainEmb.shape)
print("Validation Embeddings:", valEmb.shape)
print("Test Embeddings:", testEmb.shape)


Train Embeddings: (2100, 768)
Validation Embeddings: (450, 768)
Test Embeddings: (450, 768)


# **Train and Evaluate Classification Models**


In [84]:
# Logistic Regression

logistic_model = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    max_iter=4000,
    C=0.5,
    random_state=1
)

logistic_model.fit(trainEmb, y_train)

print('LR: Train Accuracy: %.2f%%' % (100 * logistic_model.score(trainEmb, y_train)))
print('LR: Validation Accuracy: %.2f%%' % (100 * logistic_model.score(valEmb, y_val)))


LR: Train Accuracy: 80.33%
LR: Validation Accuracy: 78.67%


In [85]:
# Random Forest Classifier

rf_model = RandomForestClassifier(
    n_estimators=100,
    criterion="gini",
    max_depth=3,
    min_samples_split=15,
    min_samples_leaf=2,
    max_features="sqrt",
    bootstrap=True,
    n_jobs=-1,
    max_leaf_nodes=100,
    random_state=5
)

rf_model.fit(trainEmb, y_train)

print('RF: Train Accuracy: %.2f%%' % (100 * rf_model.score(trainEmb, y_train)))
print('RF: Validation Accuracy: %.2f%%' % (100 * rf_model.score(valEmb, y_val)))

# Note: The Random Forest model shows mild overfitting (~5% difference between training and validation accuracy),
# which is expected due to its complexity. Validation performance remains strong.


RF: Train Accuracy: 85.62%
RF: Validation Accuracy: 80.89%


# **Model Evaluation on Test Set**

In [86]:
from sklearn.metrics import classification_report, confusion_matrix

# Evaluate Logistic Regression on test set
print("ðŸ”¹ Logistic Regression - Test Evaluation")
y_pred_lr = logistic_model.predict(testEmb)
print("Classification Report:")
print(classification_report(y_test, y_pred_lr))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))

# Evaluate Random Forest on test set
print("\nðŸ”¹ Random Forest - Test Evaluation")
y_pred_rf = rf_model.predict(testEmb)
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))


ðŸ”¹ Logistic Regression - Test Evaluation
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.82      0.81       217
           1       0.83      0.79      0.81       233

    accuracy                           0.81       450
   macro avg       0.81      0.81      0.81       450
weighted avg       0.81      0.81      0.81       450

Confusion Matrix:
[[179  38]
 [ 48 185]]

ðŸ”¹ Random Forest - Test Evaluation
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.85      0.80       217
           1       0.84      0.74      0.79       233

    accuracy                           0.80       450
   macro avg       0.80      0.80      0.80       450
weighted avg       0.80      0.80      0.80       450

Confusion Matrix:
[[185  32]
 [ 60 173]]


# **Generate Sentence Embeddings and Split Dataset**

In [87]:
# Generate sentence embeddings using SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')
X_emb = model.encode(X, show_progress_bar=True)

# Split into train, validation, and test sets (stratified)
X_train, X_temp, y_train, y_temp = train_test_split(X_emb, Y, test_size=0.3, random_state=42, stratify=Y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Ensure correct types
y_train = y_train.astype(int)
y_val = y_val.astype(int)
y_test = y_test.astype(int)


Batches:   0%|          | 0/94 [00:00<?, ?it/s]

In [88]:
# Train Logistic Regression model
logistic_model = LogisticRegression(
    penalty='l2',
    solver='liblinear',
    max_iter=4000,
    C=5,
    random_state=1
)

logistic_model.fit(X_train, y_train)
y_pred_lr = logistic_model.predict(X_val)

print("ðŸ”¹ Logistic Regression Results")
print("Train Accuracy: %.2f%%" % (100 * logistic_model.score(X_train, y_train)))
print("Validation Accuracy: %.2f%%" % (100 * logistic_model.score(X_val, y_val)))
print("\nClassification Report (Validation):")
print(classification_report(y_val, y_pred_lr))

ðŸ”¹ Logistic Regression Results
Train Accuracy: 96.57%
Validation Accuracy: 95.78%

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       225
           1       0.96      0.96      0.96       225

    accuracy                           0.96       450
   macro avg       0.96      0.96      0.96       450
weighted avg       0.96      0.96      0.96       450



In [89]:
# Train Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=200,
    criterion="gini",
    max_depth=4,
    min_samples_split=15,
    min_samples_leaf=2,
    max_features="sqrt",
    bootstrap=True,
    n_jobs=-1,
    max_leaf_nodes=100,
    random_state=5
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_val)

print("ðŸ”¹ Random Forest Results")
print("Train Accuracy: %.2f%%" % (100 * rf_model.score(X_train, y_train)))
print("Validation Accuracy: %.2f%%" % (100 * rf_model.score(X_val, y_val)))
print("\nClassification Report (Validation):")
print(classification_report(y_val, y_pred_rf))


ðŸ”¹ Random Forest Results
Train Accuracy: 95.43%
Validation Accuracy: 93.33%

Classification Report (Validation):
              precision    recall  f1-score   support

           0       0.94      0.92      0.93       225
           1       0.93      0.94      0.93       225

    accuracy                           0.93       450
   macro avg       0.93      0.93      0.93       450
weighted avg       0.93      0.93      0.93       450



# **Conclusion**

Random Forest slightly outperformed Logistic Regression, especially with sentence-level embeddings. Overall, pre-trained embeddings combined with classical models proved effective for sentiment analysis.