<a href="https://colab.research.google.com/github/andr3w1699/HumanLanguageTechnologyProject/blob/main/Text%20Classification%20with%20Word%20Embeddings%20and%20Traditional%20ML%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification with Neural Text Representation on Amazon Reviews
## Static Embedding + Classification Head
The goal of this notebook is first to represent each of the review in the dataset as fixed-dim embedding vector that is obtained by pooling together all the embedding of the tokens that constitute the review. For example we can take the average of the embedding of all the token, or take the maximum of each component among all embedding. Whichever way the pooling is implemented, we get a single embedding vector that represents/summarizes the whole review unlike the other proposed approach where a sequence of embedding vectors was created, one for each token and then a model for sequential data such as RNN was used.

So we have a dataset consisting of pairs of embeddings of a review and corresponding sentiment. We want to train a Machine Learning algorithm (MLP, SVM,...) to solve the sentiment classification task.

In [1]:
!pip install -q gdown

In [2]:
import tarfile
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import re
import string
import os
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB

In [3]:
file_id = '0Bz8a_Dbh9QhbZVhsUnRWRDhETzA'
output_name = 'amazon_review_full_csv.tar.gz'

!gdown --id {file_id} -O {output_name}

Downloading...
From (original): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA
From (redirected): https://drive.google.com/uc?id=0Bz8a_Dbh9QhbZVhsUnRWRDhETzA&confirm=t&uuid=28a90df5-fefd-4822-9cab-e75495c4138c
To: /content/amazon_review_full_csv.tar.gz
100% 644M/644M [00:14<00:00, 43.8MB/s]


In [4]:
with tarfile.open(output_name, "r:gz") as tar:
    tar.extractall("Dataset")

In [5]:
pd.set_option('display.max_colwidth', None)

df_train = pd.read_csv(
    './Dataset/amazon_review_full_csv/train.csv',
    header=None,
    names=['label', 'title', 'text'],
    quotechar='"',
    doublequote=True,
    escapechar='\\',
    engine='python',
    encoding='utf-8',
    on_bad_lines='skip'
)

df_train.head()

Unnamed: 0,label,title,text
0,3,more like funchuck,"Gave this to my dad for a gag gift after directing ""Nunsense,"" he got a reall kick out of it!"
1,5,Inspiring,"I hope a lot of people hear this cd. We need more strong and positive vibes like this. Great vocals, fresh tunes, cross-cultural happiness. Her blues is from the gut. The pop sounds are catchy and mature."
2,5,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
3,4,Chrono Cross OST,"The music of Yasunori Misuda is without question my close second below the great Nobuo Uematsu.Chrono Cross OST is a wonderful creation filled with rich orchestra and synthesized sounds. While ambiance is one of the music's major factors, yet at times it's very uplifting and vigorous. Some of my favourite tracks include; ""Scars Left by Time, The Girl who Stole the Stars, and Another World""."
4,5,Too good to be true,Probably the greatest soundtrack in history! Usually it's better to have played the game first but this is so enjoyable anyway! I worked so hard getting this soundtrack and after spending [money] to get it it was really worth every penny!! Get this OST! it's amazing! The first few tracks will have you dancing around with delight (especially Scars Left by Time)!! BUY IT NOW!!


In [6]:
df_sample = df_train.sample(n=100_000, random_state=42)

In [7]:
df_sample = df_sample[df_sample['label'] != 3].copy()

df_sample['sentiment'] = df_sample['label'].apply(lambda x: 1 if x > 3 else 0)

df_sample['review'] = df_sample['title'].fillna('') + ' ' + df_sample['text'].fillna('')

In [8]:
df_sample.head()

Unnamed: 0,label,title,text,sentiment,review
2344762,2,Not I was hoping for,"Don't expect these to be magical.I am pretty easy on tools, but these aren't that tough.On one stuck bolt, the first one I used broke without even giving up a fight. The 2nd one grabbed well, bu then it bent like a piece of licorice.I would recommend these for very light-duty work.Fortunately Amazon was amazing in the return process.",0,"Not I was hoping for Don't expect these to be magical.I am pretty easy on tools, but these aren't that tough.On one stuck bolt, the first one I used broke without even giving up a fight. The 2nd one grabbed well, bu then it bent like a piece of licorice.I would recommend these for very light-duty work.Fortunately Amazon was amazing in the return process."
736127,2,roof bag,This bag held up good however it is not at all waterproof we had to dry all our clothes when we arrived atOur destination,0,roof bag This bag held up good however it is not at all waterproof we had to dry all our clothes when we arrived atOur destination
1295906,5,Suspensful,"Truly after reading the reviews written by previous readers about this book, i just had to read the book. I was intrigued ultimately about the premise of the story, a bengal tiger lose in north Georgia, AWESOME. Reading the first half of the story i found comparisons to that of the movie ""Jaws"", mystique, suspense, and great story telling both depicted by Warner and Speilber. The second half of the story then wandered a bit from the roots and went into a metaphorical shift to ones manhood. But i do truly wish many to read this story which i couldnt put down!!! Also Jim Grahams character in this story is truly charasmatic and geniune!! READ IT",1,"Suspensful Truly after reading the reviews written by previous readers about this book, i just had to read the book. I was intrigued ultimately about the premise of the story, a bengal tiger lose in north Georgia, AWESOME. Reading the first half of the story i found comparisons to that of the movie ""Jaws"", mystique, suspense, and great story telling both depicted by Warner and Speilber. The second half of the story then wandered a bit from the roots and went into a metaphorical shift to ones manhood. But i do truly wish many to read this story which i couldnt put down!!! Also Jim Grahams character in this story is truly charasmatic and geniune!! READ IT"
2790144,2,Received Broken,"When my son received this gift, the bowl was broken. I told him to call the company, but he hasn't done that yet, so I don't know what the resolution is.",0,"Received Broken When my son received this gift, the bowl was broken. I told him to call the company, but he hasn't done that yet, so I don't know what the resolution is."
1089436,1,"If you are computer literate just a little bit, do not read",As you may have already noticed from the title of this review this is one of the worst techno-thrillers that one could choose for reading and you would be really annoyed of this dilettante writing with regard to the author's knowledge of cryptography and computers. Having in mind the popularity of Dan Brown as one of the best selling authors out there this book is a complete waste of time. Sorry to say. I give this book 1 out 5. Unfortunately there is no 0 star rating.,0,"If you are computer literate just a little bit, do not read As you may have already noticed from the title of this review this is one of the worst techno-thrillers that one could choose for reading and you would be really annoyed of this dilettante writing with regard to the author's knowledge of cryptography and computers. Having in mind the popularity of Dan Brown as one of the best selling authors out there this book is a complete waste of time. Sorry to say. I give this book 1 out 5. Unfortunately there is no 0 star rating."


# Text processing

In [9]:
texts = df_sample['review'].astype(str).tolist()
labels = df_sample['sentiment'].tolist()

In [10]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)  # replace punctuation with space
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if len(t) > 1]  # remove very short tokens
    return tokens

tokenized_texts = [preprocess_text(t) for t in texts]

# Load and Build the GloVE embedding

In [12]:
glove_path = "glove.6B.zip"

if not os.path.isfile(glove_path):
    !wget -c http://nlp.stanford.edu/data/glove.6B.zip
else:
    print("GloVe file already exists. Skipping download.")

GloVe file already exists. Skipping download.


In [13]:
if not os.path.isfile("glove.6B.100d.txt"):
    !unzip -q glove.6B.zip glove.6B.100d.txt
    print("Unzipped glove.6B.100d.txt")
else:
    print("glove.6B.100d.txt already exists. Skipping unzip.")

glove.6B.100d.txt already exists. Skipping unzip.


In [14]:
def load_glove_embeddings(file_path, embedding_dim=100):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

embedding_dim = 100
glove_path = 'glove.6B.100d.txt'
glove_embeddings = load_glove_embeddings(glove_path, embedding_dim)

In [15]:
def embed_review(tokens, embeddings, dim=100, method='avg'):
    vecs = [embeddings[t] for t in tokens if t in embeddings]
    if not vecs:
        return np.zeros(dim)
    vecs = np.array(vecs)
    if method == 'avg':
        return vecs.mean(axis=0)
    elif method == 'max':
        return vecs.max(axis=0)
    else:
        raise ValueError("Pooling method must be 'avg' or 'max'")

embedding_dim = 100
X_avg = np.array([embed_review(t, glove_embeddings, embedding_dim, 'avg') for t in tokenized_texts])
X_max = np.array([embed_review(t, glove_embeddings, embedding_dim, 'max') for t in tokenized_texts])

In [16]:
# Encode labels
le = LabelEncoder()
y = le.fit_transform(labels)

# Train/Test split
X_train_avg, X_test_avg, y_train, y_test = train_test_split(X_avg, y, test_size=0.2, random_state=42, stratify=y)
X_train_max, X_test_max, _, _ = train_test_split(X_max, y, test_size=0.2, random_state=42, stratify=y)

In [17]:
# SVM with Average Pooling
svm_avg = SVC()
svm_avg.fit(X_train_avg, y_train)
preds_svm_avg = svm_avg.predict(X_test_avg)
print("SVM (Avg Pooling):\n", classification_report(y_test, preds_svm_avg))

SVM (Avg Pooling):
               precision    recall  f1-score   support

           0       0.77      0.82      0.79      8039
           1       0.81      0.76      0.78      7985

    accuracy                           0.79     16024
   macro avg       0.79      0.79      0.79     16024
weighted avg       0.79      0.79      0.79     16024



In [19]:
# SVM with Max Pooling
svm_max = SVC()
svm_max.fit(X_train_max, y_train)
preds_svm_max = svm_max.predict(X_test_max)
print("SVM (Max Pooling):\n", classification_report(y_test, preds_svm_max))

SVM (Max Pooling):
               precision    recall  f1-score   support

           0       0.71      0.72      0.72      8039
           1       0.72      0.70      0.71      7985

    accuracy                           0.71     16024
   macro avg       0.71      0.71      0.71     16024
weighted avg       0.71      0.71      0.71     16024



In [20]:
# MLP with Average Pooling
mlp_avg = MLPClassifier(hidden_layer_sizes=(128,), max_iter=200)
mlp_avg.fit(X_train_avg, y_train)
preds_mlp_avg = mlp_avg.predict(X_test_avg)
print("MLP (Avg Pooling):\n", classification_report(y_test, preds_mlp_avg))


MLP (Avg Pooling):
               precision    recall  f1-score   support

           0       0.78      0.83      0.81      8039
           1       0.82      0.77      0.79      7985

    accuracy                           0.80     16024
   macro avg       0.80      0.80      0.80     16024
weighted avg       0.80      0.80      0.80     16024





In [22]:

# MLP with Max Pooling
mlp_max = MLPClassifier(hidden_layer_sizes=(128,), max_iter=200)
mlp_max.fit(X_train_max, y_train)
preds_mlp_max = mlp_max.predict(X_test_max)
print("MLP (Max Pooling):\n", classification_report(y_test, preds_mlp_max))




MLP (Max Pooling):
               precision    recall  f1-score   support

           0       0.67      0.77      0.71      8039
           1       0.72      0.61      0.66      7985

    accuracy                           0.69     16024
   macro avg       0.69      0.69      0.69     16024
weighted avg       0.69      0.69      0.69     16024



In [23]:
# Logistic Regression with Average Pooling
logreg_avg = LogisticRegression(max_iter=200)
logreg_avg.fit(X_train_avg, y_train)
preds_logreg_avg = logreg_avg.predict(X_test_avg)
print("Logistic Regression (Avg Pooling):\n", classification_report(y_test, preds_logreg_avg))

Logistic Regression (Avg Pooling):
               precision    recall  f1-score   support

           0       0.77      0.80      0.79      8039
           1       0.79      0.76      0.78      7985

    accuracy                           0.78     16024
   macro avg       0.78      0.78      0.78     16024
weighted avg       0.78      0.78      0.78     16024



In [24]:
# Logistic Regression with Max Pooling
logreg_max = LogisticRegression(max_iter=200)
logreg_max.fit(X_train_max, y_train)
preds_logreg_max = logreg_max.predict(X_test_max)
print("Logistic Regression (Max Pooling):\n", classification_report(y_test, preds_logreg_max))

Logistic Regression (Max Pooling):
               precision    recall  f1-score   support

           0       0.68      0.69      0.68      8039
           1       0.68      0.67      0.68      7985

    accuracy                           0.68     16024
   macro avg       0.68      0.68      0.68     16024
weighted avg       0.68      0.68      0.68     16024



In [25]:
# Random Forest with Average Pooling
rf_avg = RandomForestClassifier(n_estimators=100, random_state=42)
rf_avg.fit(X_train_avg, y_train)
preds_rf_avg = rf_avg.predict(X_test_avg)
print("Random Forest (Avg Pooling):\n", classification_report(y_test, preds_rf_avg))

Random Forest (Avg Pooling):
               precision    recall  f1-score   support

           0       0.74      0.78      0.76      8039
           1       0.76      0.72      0.74      7985

    accuracy                           0.75     16024
   macro avg       0.75      0.75      0.75     16024
weighted avg       0.75      0.75      0.75     16024



In [27]:
# Random Forest with Max Pooling
rf_max = RandomForestClassifier(n_estimators=100, random_state=42)
rf_max.fit(X_train_max, y_train)
preds_rf_max = rf_max.predict(X_test_max)
print("Random Forest (Max Pooling):\n", classification_report(y_test, preds_rf_max))

Random Forest (Max Pooling):
               precision    recall  f1-score   support

           0       0.71      0.73      0.72      8039
           1       0.72      0.70      0.71      7985

    accuracy                           0.72     16024
   macro avg       0.72      0.72      0.72     16024
weighted avg       0.72      0.72      0.72     16024



In [28]:
# XGBoost with Average Pooling
xgb_avg = XGBClassifier(random_state=42)
xgb_avg.fit(X_train_avg, y_train)
preds_xgb_avg = xgb_avg.predict(X_test_avg)
print("XGBoost (Avg Pooling):\n", classification_report(y_test, preds_xgb_avg))

XGBoost (Avg Pooling):
               precision    recall  f1-score   support

           0       0.77      0.80      0.78      8039
           1       0.79      0.76      0.78      7985

    accuracy                           0.78     16024
   macro avg       0.78      0.78      0.78     16024
weighted avg       0.78      0.78      0.78     16024



In [29]:
# XGBoost with Max Pooling
xgb_max = XGBClassifier(random_state=42)
xgb_max.fit(X_train_max, y_train)
preds_xgb_max = xgb_max.predict(X_test_max)
print("XGBoost (Max Pooling):\n", classification_report(y_test, preds_xgb_max))

XGBoost (Max Pooling):
               precision    recall  f1-score   support

           0       0.78      0.78      0.78      8039
           1       0.78      0.77      0.78      7985

    accuracy                           0.78     16024
   macro avg       0.78      0.78      0.78     16024
weighted avg       0.78      0.78      0.78     16024



In [30]:

# KNN with Average Pooling
knn_avg = KNeighborsClassifier(n_neighbors=5)
knn_avg.fit(X_train_avg, y_train)
preds_knn_avg = knn_avg.predict(X_test_avg)
print("KNN (Avg Pooling):\n", classification_report(y_test, preds_knn_avg))


KNN (Avg Pooling):
               precision    recall  f1-score   support

           0       0.68      0.75      0.71      8039
           1       0.72      0.65      0.68      7985

    accuracy                           0.70     16024
   macro avg       0.70      0.70      0.70     16024
weighted avg       0.70      0.70      0.70     16024



In [31]:
# KNN with Max Pooling
knn_max = KNeighborsClassifier(n_neighbors=5)
knn_max.fit(X_train_max, y_train)
preds_knn_max = knn_max.predict(X_test_max)
print("KNN (Max Pooling):\n", classification_report(y_test, preds_knn_max))

KNN (Max Pooling):
               precision    recall  f1-score   support

           0       0.61      0.64      0.62      8039
           1       0.62      0.58      0.60      7985

    accuracy                           0.61     16024
   macro avg       0.61      0.61      0.61     16024
weighted avg       0.61      0.61      0.61     16024

