## Embeddings and Sentence Classification 

In [8]:
import re
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from nltk.stem import  WordNetLemmatizer
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
!pip install gpt4all
from gpt4all import Embed4All

!pip install torch==2.1.0

import torch
import torch.nn as nn

nltk.download('stopwords')
nltk.download('wordnet')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Exploring Embeddings

Put simply, Embeddings are fixed-size **dense** vector representations of tokens in natural language. This means you can represent words as vectors, sentences as vectors, even other entities like entire graphs as vectors.

So what really makes them different from something like One-Hot vectors?

What's special is that they have semantic meaning baked into them. This means you can model relationships between entities in text, which itself leads to a lot of fun applications. All modern architectures make use of Embeddings in some way.

We will be using *pretrained* Embeddings: this means that we will be using Embeddings that have already been trained on a large corpus of text. This is because training Embeddings from scratch is a very computationally expensive task, and we don't have the resources to do so. Fortunately, there were some good samaritans who have already done this for us, and we can use their publicly available Embeddings for our own tasks.

In [9]:
corpus = api.load('text8')
w2vmodel = Word2Vec(corpus)

print("Done loading word2vec model!")

Done loading word2vec model!


Now that we've loaded in the Embeddings, we can create an Embedding **layer** in PyTorch, `nn.Embedding`, that will perform the processing step for us.

Note in the following cell how there is a given **vocab size** and **embedding dimension** for the model: this is important to note because some sets of Embeddings may be defined for a large set of words (a large vocab), whereas older ones perhaps have a smaller set (a small vocab); the Embedding dimension essentially tells us how many *features* have been learned for a given word, that will allow us to perform further processing on top of.

In [10]:
# Define embedding layer using gensim
embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(w2vmodel.wv.vectors))

# Get some information from the w2vmodel
print(f"Vocab size: {len(w2vmodel.wv.key_to_index)}")

print(f"Some of the words in the vocabulary:\n{list(w2vmodel.wv.key_to_index.keys())[:10]}")

print(f"Embedding dimension: {w2vmodel.wv.vectors.shape[1]}")

Vocab size: 71290
Some of the words in the vocabulary:
['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two']
Embedding dimension: 100


Now, for a demonstration, we instantiate two words, turn them into numbers (encoding them via their index in the vocab), and pass them through the Embedding layer.

Note how the resultant Embeddings both have the same shape: 1 word, and 100 elements in the vector.

In [11]:
# Take two words and get their embeddings
word1 = "king"
word2 = "queen"

def word2vec(word):
    return embedding_layer(torch.LongTensor([w2vmodel.wv.key_to_index[word]]))

king_embedding = word2vec(word1)
queen_embedding = word2vec(word2)

print(f"Embedding Shape for '{word1}': {king_embedding.shape}")
print(f"Embedding Shape for '{word2}': {queen_embedding.shape}")

Embedding Shape for 'king': torch.Size([1, 100])
Embedding Shape for 'queen': torch.Size([1, 100])


When we have vectors whose scale is arbitrary, one nice way to measure how *similar* they are is with the Cosine Similarity measure.


$$ \text{Cosine Similarity}(\mathbf{u},\mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} $$


We can apply this idea to our Embeddings. To see how "similar" two words are to the model, we can generate their Embeddings and take the Cosine Similarity of them. This will be a number between -1 and 1 (just like the range of the cosine function). When the number is close to 0, the words are not similar.

In [12]:
def cosine_similarity(vec1, vec2):
    '''
    Computes the cosine similarity between two vectors
    '''
    dot_product = torch.dot(torch.squeeze(vec1), torch.squeeze(vec2))
    norm_vec1 = torch.norm(vec1)
    norm_vec2 = torch.norm(vec2)

    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity.item()

def compute_word_similarity(word1, word2):
    '''
    Takes in two words, computes their embeddings and returns the cosine similarity
    '''

    with torch.no_grad():
        embedding_word1 = embedding_layer(torch.tensor([w2vmodel.wv.key_to_index[word1]]))
        embedding_word2 = embedding_layer(torch.tensor([w2vmodel.wv.key_to_index[word2]]))
        similarity = cosine_similarity(embedding_word1, embedding_word2)

    return similarity

word1 = 'king'
word2 = 'queen'
word3 = 'king'
print(f"Similarity between '{word1}' and '{word2}': {compute_word_similarity(word1, word2)}")
print(f"Similarity between '{word1}' and '{word3}': {compute_word_similarity(word1, word3)}")

Similarity between 'king' and 'queen': 0.699151337146759
Similarity between 'king' and 'king': 1.0000001192092896


In [13]:
del embedding_layer

## Sentence Classification Classification with Sentence Embeddings 

Now let's move on to an actual application: classifying whether a tweet is about a real disaster or not. As you can imagine, this could be a valuable model when monitoring social media for disaster relief efforts.

Since we are using Sentence Embeddings, we want something that will take in a sequence of words and throw out a single fixed-size vector. For this task, we will make use of an LLM via the `gpt4all` library.

This library will allow us to generate pretrained embeddings for sentences, that we can use as **features** to feed to any classifier of our choice.

In [14]:

df = pd.read_csv("/content/disaster_tweets.csv")
df = df[["text", "target"]]


X = df["text"]
y = df["target"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print("train set shape:", X_train.shape, y_train.shape)
print("validation set shape:", X_val.shape, y_val.shape)

# print(X_train.shape, X_val.shape)

train set shape: (6090,) (6090,)
validation set shape: (1523,) (1523,)


Before jumping straight to Embeddings, since our data is sourced from the cesspool that is Twitter, we should probably do some cleaning. This can involve the removal of URLs, punctuation, numbers that don't provide any meaning, stopwords, and so on.'

Now for the fun part, creating our Embeddings!

This functionality makes use of a model called [Sentence-BERT](https://arxiv.org/abs/1908.10084). This is a Transformer-based model that has been trained on a large corpus of text, and is able to generate high-quality Sentence Embeddings for us.

In [33]:
# TODO: Clean the sentences (5 marks)


# TODO: Fill out the following functions, adding more if desired

def lowercase(txt):
    return txt.lower()

def remove_punctuation(txt):
    return re.sub(r'[^\w\s]', '', txt)

def remove_stopwords(txt):
    stop_words = set(stopwords.words('english'))
    words = txt.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

def remove_numbers(txt):
    return re.sub(r'\d+', '', txt)

def remove_url(txt):
    return re.sub(r'http\S+', '', txt)

def normalize_sentence(txt):
    '''
    Aggregates all the above functions to normalize/clean a sentence
    '''
    txt = lowercase(txt)
    txt = remove_url(txt)
    txt = remove_punctuation(txt)
    txt = remove_stopwords(txt)
    txt = remove_numbers(txt)
    txt = txt.strip()
    txt = re.sub(r'[^a-zA-Z0-9\s]', ' ', txt)


    return txt

X_train_cleaned = X_train.apply(normalize_sentence)
X_val_cleaned = X_val.apply(normalize_sentence)

min_sentence_length = 20
X_train_filtered = X_train_cleaned[X_train_cleaned.apply(len) >= min_sentence_length]
y_train_filtered = y_train[X_train_cleaned.apply(len) >= min_sentence_length]
X_val_filtered = X_val_cleaned[X_val_cleaned.apply(len) >= min_sentence_length]
y_val_filtered = y_val[X_val_cleaned.apply(len) >= min_sentence_length]


print("train set shape after cleaning:", X_train_filtered.shape, y_train_filtered.shape)
print("validation set shape after cleaning:", X_val_filtered.shape, y_val_filtered.shape)
print(X_train_filtered[:10], y_train_filtered[:10])
print(X_val_filtered[:10], y_val_filtered[:10])

print("Text data type:", X_train_cleaned.dtype)
print("Labels data type:", y_train_filtered.dtype)
print("NaN values in text data:", X_train_cleaned.isnull().any())
print("NaN values in labels:", y_train_filtered.isnull().any())
print("Special characters in text data:", X_train_cleaned[X_train_cleaned.str.contains(r'[^a-zA-Z0-9\s]')])
print("Leading/trailing spaces in text data:", X_train_cleaned[X_train_cleaned.str.match(r'^\s|\s$')])


train set shape after cleaning: (5872,) (5872,)
validation set shape after cleaning: (1471,) (1471,)
4996    courageous honest analysis need use atomic bom...
3263    zachzaidman thescore wld b shame golf cart bec...
4907    tell barackobama rescind medals honor given us...
2855    worried ca drought might affect extreme weathe...
4716    youngheroesid lava blast amp power red panther...
7538    wreckage conclusively confirmed mh malaysia pm...
3172    builder dental emergency ruined plan emotional...
3932    bmx issues areal flood advisory shelby al till...
5833       wisenews chinas stock market crash gems rubble
7173    robertoneill getting hit foul ball sitting har...
Name: text, dtype: object 4996    1
3263    0
4907    1
2855    1
4716    0
7538    1
3172    1
3932    1
5833    1
7173    0
Name: target, dtype: int64
2644            new weapon cause unimaginable destruction
2227    famping things gishwhes got soaked deluge goin...
5448    dt georgegalloway rt gallowaymayor   the c

In [16]:

feature_extractor = Embed4All()

example_embedding = feature_extractor.embed(X_train_filtered.values[0])
print(f"Output structure: {type(example_embedding)}")
print(example_embedding)

X_train_embeddings = [feature_extractor.embed(sentence) for sentence in X_train_filtered.values]


X_val_embeddings = [feature_extractor.embed(sentence) for sentence in X_val_filtered.values]


y_train_labels = y_train_filtered.values
y_val_labels = y_val_filtered.values


100%|██████████| 45.9M/45.9M [00:03<00:00, 13.2MiB/s]


Output structure: <class 'list'>
[-0.04516274109482765, 0.12774068117141724, -0.03049377351999283, 0.000782765622716397, 0.013275389559566975, 0.016861621290445328, 0.055261604487895966, 0.008617403917014599, -0.02639164589345455, 0.007780713029205799, 0.042924538254737854, -0.01940971612930298, 0.03689299523830414, 0.03972916677594185, 0.011344856582581997, 0.01799238659441471, -0.02071196399629116, 0.016547061502933502, -0.010986067354679108, -0.02933647856116295, -0.07392047345638275, 0.046137116849422455, 0.07390392571687698, -0.002076385309919715, -0.014989830553531647, 0.014138161204755306, 0.06163441017270088, 0.040662702172994614, -0.012586784549057484, -0.01801145263016224, 0.0339195653796196, 0.046742022037506104, 0.099168561398983, 0.0014147365000098944, 0.04102880880236626, 0.007911625318229198, 0.016119474545121193, 0.02185426838696003, 0.01848200522363186, -0.09288961440324783, -0.14541377127170563, 0.0007321782759390771, 0.022364992648363113, 0.025120049715042114, -0.015

In [17]:
print(f"X_train_embeddings - type: {type(X_train_embeddings)}, shape: {len(X_train_embeddings)}")
print(f"X_val_embeddings - type: {type(X_val_embeddings)}, shape: {len(X_val_embeddings)}")

if X_train_embeddings:
    print(f"Element 0 - type: {type(X_train_embeddings[0])}, shape: {len(X_train_embeddings[0])}")

print(f"y_train_labels - type: {type(y_train_labels)}, shape: {y_train_labels.shape}")
print(f"y_val_labels - type: {type(y_val_labels)}, shape: {y_val_labels.shape}")

X_train_embeddings - type: <class 'list'>, shape: 5872
X_val_embeddings - type: <class 'list'>, shape: 1471
Element 0 - type: <class 'list'>, shape: 384
y_train_labels - type: <class 'numpy.ndarray'>, shape: (5872,)
y_val_labels - type: <class 'numpy.ndarray'>, shape: (1471,)


Now with our Embeddings ready, we can move on to the actual classification task.


In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
logreg_classifier = LogisticRegression(random_state=42)
logreg_classifier.fit(X_train_embeddings, y_train_labels)
y_val_pred_logreg = logreg_classifier.predict(X_val_embeddings)

accuracy_logreg = accuracy_score(y_val_labels, y_val_pred_logreg)
print(f"validation accuracy (Logistic Regression): {accuracy_logreg:.4f}")

print("classification report (Logistic Regression):")
print(classification_report(y_val_labels, y_val_pred_logreg))

validation accuracy (Logistic Regression): 0.7967
classification report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.81      0.83      0.82       829
           1       0.78      0.75      0.76       642

    accuracy                           0.80      1471
   macro avg       0.79      0.79      0.79      1471
weighted avg       0.80      0.80      0.80      1471



In [46]:
def predict(sentence, clf):
    '''
    Takes in a sentence and returns the predicted class along with the probability
    '''

    cleaned_sentence = normalize_sentence(sentence)
    encoded_sentence = feature_extractor.embed(cleaned_sentence)
    encoded_sentence = np.array(encoded_sentence).reshape(1, -1)

    predicted_class = clf.predict(encoded_sentence)[0]
    probability = clf.predict_proba(encoded_sentence)[:, 1][0]

    return predicted_class, probability

sentence1 = "I love the sunny weather today."
sentence2 = "There's heavy rain and thunder outside."

class1, prob1 = predict(sentence1, logreg_classifier)
class2, prob2 = predict(sentence2, logreg_classifier)

print(f"sentence 1: predicted class - {class1}, probability - {prob1}")
print(f"sentence 2: predicted class - {class2}, probability - {prob2}")


sentence 1: predicted class - 0, probability - 0.4323131968062236
sentence 2: predicted class - 1, probability - 0.8999811074198404
