# Comparing traditional word embeddings to attention-based

As we have discussed, deep learning can be thought of as a two part process where the first part involves extracting features from the input. In this lab, we'll look at how even when using a simple model in the second half, improving the quality of feature extraction can dramatically improve the performance of the model.

We will compare the performance of two distinct approaches to embedding - the classic GLOVE embeddings, and newer, attention-based embeddings. We will use a simple logistic regression model to classify movie reviews as positive or negative, and compare the performance of the two models.

# Setup

We will be using the `sentence-transformers` library to get transformer embeddings, the `gensim` library to get GLOVE embedding, and the `scikit-learn` library to train our model.

In [None]:
!pip install -U -q sentence-transformers gensim scikit-learn openpyxl

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_excel('https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k/raw/refs/heads/master/train.xlsx')

In [None]:
df = df.sample(1000)

# Preprocessing

We will preprocess the data by removing stopwords, punctuation, and converting the text to lowercase. Gensim can help us do this easily.

In [None]:
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric

# Preprocess the text
CUSTOM_FILTERS = [lambda x: x.lower(), strip_multiple_whitespaces, strip_numeric, strip_punctuation, remove_stopwords]
df['cleaned_review'] = df['Reviews'].astype(str).apply(lambda x: preprocess_string(x, CUSTOM_FILTERS))
df.sample(5)

In [None]:
df['Sentiment'] = df['Sentiment'].map({'pos': 1, 'neg': 0})
df['cleaned_review'] = df['cleaned_review'].apply(lambda x: ' '.join(x))

In [None]:
df.sample(5)

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['Sentiment'], test_size=0.2, random_state=42)

# GLOVE Embeddings

We will use the `gensim` library to load the GLOVE embeddings. We will use the 100-dimensional embeddings trained on the Wikipedia 2014 + Gigaword 5 dataset, because the file size is fairly small (128MB) and it is a good balance between size and quality. GLOVE is similar to Word2Vec, but uses a different algorithm to train the embeddings.

In [None]:
from gensim import downloader

# Load the GLOVE embeddings
glove_vectors = downloader.load('glove-wiki-gigaword-100')

In [None]:
glove_vectors.most_similar('good')

In [None]:
import numpy as np


# Preprocess the text

def get_glove_embedding(text):
    # Get the embeddings for each word
    embeddings = [glove_vectors[word] for word in text if word in glove_vectors]
    if len(embeddings) == 0:
        return None
    # Average the embeddings
    return np.mean(embeddings, axis=0)

# Get the embeddings for the training and testing sets
X_train_glove = X_train.apply(get_glove_embedding)
X_test_glove = X_test.apply(get_glove_embedding)

# Transformer Embeddings

We will use the `sentence-transformers` library to get our transformer embeddings. We will use the `paraphrase-MiniLM-L3-v2` model, which is the smallest (and thus fastest) model available in the library.

In [None]:
from sentence_transformers import SentenceTransformer

# Load the transformer embeddings
print('Loading transformer embeddings...')
transformer = SentenceTransformer('paraphrase-MiniLM-L3-v2', model_kwargs={'torch_dtype': 'float16'})

# Get the embeddings for the training and testing sets
print('Getting embeddings for the training set...')
X_train_bert = transformer.encode(X_train.tolist())
print('Getting embeddings for the testing set...')
X_test_bert = transformer.encode(X_test.tolist())

# Training the Model

We will use a simple logistic regression model to classify the reviews as positive or negative. We will train the model using the GLOVE embeddings, and then using the BERT embeddings, and compare the performance of the two models.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train the model using the GLOVE embeddings
glove_lr = LogisticRegression()
glove_lr.fit(X_train_glove.tolist(), y_train)

# Train the model using the BERT embeddings
trfrm_lr = LogisticRegression()
trfrm_lr.fit(X_train_bert, y_train)

# Evaluate the models
glove_score = glove_lr.score(X_test_glove.tolist(), y_test)
bert_score = trfrm_lr.score(X_test_bert, y_test)

print("--- GLOVE Embeddings ---")
print(classification_report(y_test, glove_lr.predict(X_test_glove.tolist())))
print("--- Transformer Embeddings  ---")
print(classification_report(y_test, trfrm_lr.predict(X_test_bert)))