# UofT FASE ML Bootcamp
#### Friday June 14, 2024
####  Word Embeddings - Properties, Meaning and Training - Lab 1, Day 5
#### Teaching team: Eldan Cohen, Alex Olson, Nakul Upadhya, Hriday Chheda
##### Based on CARTE-DSI ML Bootcamp 2023 notebook by Prof. Jonathan Rose

This lab engages you in the properties, meaning, viewing and training of word embeddings (also called word vectors). The specific learning objectives in this assignment are:

1.   To learn word embedding properties, and use them in simple ways.
2.   (optional) To translate vectors into understandable categories of meaning
3.   To understand how embeddings are created, using the Skip Gram method.

---





# 1. Experimenting and Understanding Word Embedding/Vectors
# Using the GloVe Embeddings


Word embeddings (also known as word vectors) are a way to encode the meaning of words into a set of numbers.

These embeddings are created by training a neural network model using many examples of the use of language.  These examples could be the whole of Wikipedia or a large collection of news articles.

To start, we will explore a set of word embeddings that someone else took the time and computational power to create. One of the most commonly-used pre-trained word embeddings are the **GloVe embeddings**.

## GloVe Embeddings

You can read about the GloVe embeddings here: https://nlp.stanford.edu/projects/glove/, and read the original paper describing how they work here: https://nlp.stanford.edu/pubs/glove.pdf.

There are several variations of GloVe embeddings. They differ in the text used to train the embedding, and the *size* of the embeddings.

Throughout this lab we'll use a package called `torchtext`, that is part of PyTorch.

We'll begin by loading a set of GloVe embeddings. The first time you run the code below, it will cause the download of a large file (862MB) containing the embeddings.

In [None]:
# Import the required libraries
import torch
import torchtext
import pandas as pd
torchtext.disable_torchtext_deprecation_warning()
from torchtext.vocab import GloVe

In [None]:
# The first time you run this will download a ~862MB file
glove = GloVe(name="6B", # trained on Wikipedia 2014 corpus
              dim=50)  # embedding size = 50

We can use the loaded glove embeddings to look up the embeddings of individual words.
For example, let's look at what the embedding of the word "apple" looks like:

In [None]:
glove['apple']

As we can see from the output above, the embedding of a given word is a torch tensor with dimension `(50,)`. We don't know what the meaning of each number is, but we do know that there are properties of the embeddings that can be observed. For example, `distances between embeddings` are meaningful.

## Measuring Distance

Let's consider one specific metric of distance between two embedding vectors called the **Euclidean distance**. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and
$y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute
the Euclidean distance between $x$ and $y$: $\sqrt{\sum_i (x_i - y_i)^2}$

The PyTorch function `torch.norm` computes the 2-norm of a vector for us, so we
can compute the Euclidean distance between two vectors like this:

In [None]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

In [None]:
a = glove['apple']
b = glove['orange']
torch.norm(b - a)

In [None]:
torch.norm(glove['good'] - glove['bad'])

In [None]:
torch.norm(glove['good'] - glove['water'])

In [None]:
torch.norm(glove['good'] - glove['well'])

In [None]:
torch.norm(glove['good'] - glove['perfect'])

## Cosine Similarity

An alternative and more commonly-used measure of distance is the **Cosine Similarity**. The cosine similarity measures the *angle* between two vectors, and has the property that it only considers the *direction* of the vectors, not their the magnitudes. It is computed as follows for two vectors A and B:


![picture](https://drive.google.com/uc?id=1hSaQRBjH828lx1xozJCA4F0ZhiX2S0Xt)

In [None]:
#consider two vectors x and y
#unsqueeze is used because cosine similarity wants at least 2-D inputs
x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., 2.]).unsqueeze(0)

# Calculate the cosine similarity between x and y
# Expect the cosine similarity to be 1.0 since x and y are in the same direction
torch.cosine_similarity(x, y)

The cosine similarity is actually a *similarity* measure rather than a *distance* measure, and gives a result between -1 and 1. Thus, the larger the similarity, (closer to 1) the "closer in meaning" the word embeddings are to each other.

In [None]:
z = torch.tensor([-1., -1., -1.]).unsqueeze(0)

# Calculate the cosine similarity between x and z
# Expect the cosine similarity to be -1.0 since x and z point in the opposite "direction"
torch.cosine_similarity(x, z)

In [None]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

In [None]:
a = glove['apple']
b = glove['banana']
torch.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['bad'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['water'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['well'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['perfect'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['watermelon'].unsqueeze(0),
                        glove['aeroplane'].unsqueeze(0))

Note: torch.cosine_similarity requires two dimensions to work, which is created with the unsqueeze option, illustrated in more detail below

In [None]:
x = glove['good']
print(x.shape) # [50]
y = x.unsqueeze(0) # [1, 50]
print(y.shape)

## Word Similarity

Now that we have notions of distance and similarity in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".

In [None]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, "\t%5.2f" % float(dist))

Let's do the same thing with cosine similarity:

In [None]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.cosine_similarity(glove[word].unsqueeze(0),glove[w].unsqueeze(0)) # cosine distance
    print(w, "\t%5.2f" % float(dist))

We can look through the entire **vocabulary** for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word such as "cat".

In [None]:
def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    closest_words = []
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove.itos[idx], "\t%5.2f" % difference)
        closest_words.append(glove.itos[idx])
    return closest_words

closest_words_to_cat = print_closest_words(glove["cat"], n=10)

In [None]:
closest_words_to_dog = print_closest_words(glove['dog'])

In [None]:
closest_words_to_nurse = print_closest_words(glove['nurse'])

In [None]:
closest_words_to_computer = print_closest_words(glove['computer'])

In [None]:
#You can also try printing closest words to any other words of your choice here:



We could also look at which words are closest to the midpoints of two words:

In [None]:
closest_to_mid_1 = print_closest_words((glove['happy'] + glove['sad']) / 2)

In [None]:
closest_to_mid_2 = print_closest_words((glove['lake'] + glove['building']) / 2)

In [None]:
closest_to_mid_3 = print_closest_words((glove['bravo'] + glove['michael']) / 2)

##1.1
1.1.1 Write a new function, similar to print_closest_words called print_closest_cosine_words that prints out the N-most (where N is an input parameter) similar words using cosine similarity rather than euclidean distance.

The documentation for the [sorted](https://python-reference.readthedocs.io/en/latest/docs/functions/sorted.html) method in python might help



In [None]:
def print_closest_cosine_words(vec, n=5):
  # TODO
  sims =   # compute similarities to all words
  lst =  # sort by similarity (descending order, remember higher similarity score, closer the word)
  closest_words = []
  # take the top n
  # TODO
  return closest_words

1.1.2 Create a table that compares the 10-most cosine-similar words to the word 'dog', in order, alongside to the 10 closest

In [None]:
closest_euclidean_words = print_closest_words(glove['dog'])
print("\n")
closest_cosine_words = print_closest_cosine_words(glove['dog'])
print("\n")
table = pd.DataFrame()
table["Euclidean"] = closest_euclidean_words
table["Cosine"] = closest_cosine_words
print(table)

In [None]:
# Compute the same table for word "computer"
# TODO

1.1.3 Looking at the two lists, does one of the metrics (cosine similarity or euclidean distance) seem to be better than the other?

TODO


## 1.2 Analogies

One surprising aspect of word embeddings is that the *directions* in the embedding space can be meaningful. For example, some analogy-like relationships like this tend to hold:

$$ king - man + woman \approx queen $$

Analogies show us how relationships between pairs of words that is captured in the learned vectors

In [None]:
print_closest_words(glove['king'] - glove['man'] + glove['woman'])

The top result is a reasonable answer like "queen",  and the name of the queen of england.

We can flip the analogy around and it works:

In [None]:
_ = print_closest_words(glove['queen'] - glove['woman'] + glove['man'])

In [None]:
_ = print_closest_words(glove['king'] - glove['prince'] + glove['princess'])

In [None]:
_ = print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])

In [None]:
_ = print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])

In [None]:
_ = print_closest_words(glove['old'] - glove['young'] + glove['father'])

We can also move an embedding towards the direction of "goodness" or "badness":

In [None]:
_ = print_closest_words(glove['good'] - glove['bad'] + glove['programmer'])

In [None]:
_ = print_closest_words(glove['bad'] - glove['good'] + glove['programmer'])

1.2.1 Consider now the word pair relationships given in Figure 1 below, which comes from Table 1 of the Mikolov [[link](https://arxiv.org/abs/1301.3781)] paper. Choose one of these relationships, but not one of the ones already shown above, and report which one you chose. Write and run code that will generate the second word given the first word. Generate 10 more examples of the same relationship from 10 other words, and comment on the quality of the results.

![picture](https://drive.google.com/uc?id=1O7Zizu63jj5aoZkGkK0sz93CZSEsBDuW)



In [None]:
# TODO
# Choose one of the relationships from the table above and generate 10 examples



## 1.3 Change Embedding Dimension
Now we change the embedding dimension (also known as the vector size) from 50 to 300 and re-run all the examples from above including the new cosini similarity function. Answer the following questions:
1.   How does the euclidean distance change between the various words when switching from d=50 to d=300?
2.   How does the cosine similarity change?
3.   Does the ordering of nearness change?
4.   Is it clear that the larger size vectors give better results - why or why not?



In [None]:
# The first time you run this will download a ~862MB file
glove = GloVe(name="6B", # trained on Wikipedia 2014 corpus
              dim=300)  # embedding size = 300

Let's check the euclidean distances for embedding dimension of 300

In [None]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

In [None]:
a = glove['apple']
b = glove['orange']
torch.norm(b - a)

In [None]:
torch.norm(glove['good'] - glove['bad'])

In [None]:
torch.norm(glove['good'] - glove['water'])

In [None]:
torch.norm(glove['good'] - glove['well'])

In [None]:
torch.norm(glove['good'] - glove['perfect'])

1.3.1 Compare to the euclidean distances from above (embedding dimension of 50) and answer question 1. How does the euclidean distance change between the various words when switching from d=50 to d=300?

Answer: TODO

Next, lets look at cosine similarity with embedding dimension 300

In [None]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

In [None]:
a = glove['apple']
b = glove['banana']
torch.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['bad'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['water'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['well'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['good'].unsqueeze(0),
                        glove['perfect'].unsqueeze(0))

In [None]:
torch.cosine_similarity(glove['watermelon'].unsqueeze(0),
                        glove['aeroplane'].unsqueeze(0))

1.3.2 Compare to the cosine similarities from above (embedding dimension of 50) and answer question 2. How does the cosine similarity change when switching from d=50 to d=300?

Answer: TODO

Next, we will look at the nearness of words

In [None]:
closest_euclidean_words = print_closest_words(glove['dog'])
closest_cosine_words = print_closest_cosine_words(glove['dog'])
table = pd.DataFrame()
table["Euclidean"] = closest_euclidean_words
table["Cosine"] = closest_cosine_words
print(table)

1.3.3 Compare to the near words from above (embedding dimension of 50) and answer question 3. Does the ordering of nearness words change?

Answer: TODO

1.3.4 Is it clear that the larger size vectors give better results - why or why not?

Answer: TODO

# 2. (Optional): Computing Meaning from Word Embeddings

Note: Attempt this section after you have finished the rest of the lab!

Now that we’ve seen some of the power of word embeddings, we can also feel the frustration that the individual elements/numbers in each word vector do not have a meaning that can be interpreted or understood by humans. It would have preferable that each position in the vector correspond to a specific axis of meaning that we can understand based on our ability to comprehend language.

For example the "amount" the word related to *colour* or *temperature* or *politics*. This is not the case, because the numbers are the result of an optmization process that does not drive each vector element toward human-understandable meaning.

We can, however, make use of the methods shown in Section 1 above to measure the amount of meaning in specific categories of our choosing, such as colour. Suppose that we want to know how much a particular word/embedding relates to colour. One way to measure that could be to determine the cosine similarity between the word embedding for colour and the word of interest. We might expect that a word like ‘sky’ or ‘grass’ might have elements of colour in it, and that ‘purple’ would have more. However, it may also be true that there are multiple meanings to a single word, such as ‘colour’, and so it might be better to define a category of meaning by using several words that, all together, define it with more precision.

For example, a way to define a category such as colour would be to use that word itself, and to- gether with several examples, such ‘red’, ‘green’, ‘blue’, ‘yellow.’ Then, to measure the “amount” of colour in a specific word (like ‘sky’) you could compute the average cosine similarity between sky and each of the words in the category. Alternatively, you could average the vectors of all the words in the category, and compute the cosine similarity between the embedding of sky and that average embedding. In this section, use the d=50 GlOVe embeddings that you used in Section 1.



In [None]:
# Load GloVe embeddings using torchtext
glove = GloVe(name='6B', dim=50)
embedding_size = 50  # Size of GloVe embeddings

##2.1
Write a PyTorch-based function called compare words to category that takes as input:
* The meaning category given by a set of words (as discussed above) that describe the category, and
* A given word to ‘measure’ against that category.


The function should compute the cosine similarity of the given word in the category in two ways: \\
(a) By averaging the cosine similarity of the given word with every word in the category, and \\
(b) By computing the cosine similarity of the word with the average embedding of all of the words in the category.

In [None]:
# Function to convert a word to its embedding
def word_to_embedding(word, glove):
    if word in glove.stoi:
        return glove[word]
    else:
        return torch.zeros(embedding_size)

def compare_words_to_category(word, category_words, glove):
    word_embedding = word_to_embedding(word, glove).unsqueeze(0)
    category_embeddings = torch.stack([word_to_embedding(w, glove) for w in category_words])

    # Method (a): Average cosine similarity of the given word with every word in the category
    cosine_similarities = # TODO
    avg_cosine_similarity = cosine_similarities.mean().item()

    # Method (b): Cosine similarity of the word with the average embedding of the category words
    avg_category_embedding = # TODO
    avg_category_cosine_similarity = torch.cosine_similarity(word_embedding, avg_category_embedding).item()

    return avg_cosine_similarity, avg_category_cosine_similarity

##2.2
Let’s define the colour meaning category using these words: “colour”, “red”, “green”, “blue”, “yellow.” Compute the similarity (using both methods (a) and (b) above) for each of these words: “greenhouse”, “sky”, “grass”, “purple”, “scissors”, “microphone”, “president” and present them in a table.

In [None]:
category_words = ["colour", "red", "green", "blue", "yellow"]
words_to_measure = ["greenhouse", "sky", "grass", "purple", "scissors",
                    "microphone", "president"]

results = []
for word in words_to_measure:
    avg_cosine_similarity, avg_category_cosine_similarity = # TODO
    results.append((word, avg_cosine_similarity, avg_category_cosine_similarity))

# Create a DataFrame to present the results in a table
df_results = pd.DataFrame(results, columns=["Word", "Avg Cosine Similarity", "Cosine Similarity with Avg Embedding"])
print(df_results)

2.2.1 Do the results for each method make sense? Why or why not? What is the apparent difference between method 1 and 2?

Answer: TODO

# 3. Training A Word Embedding Using the Skip-Gram Method on a Small Corpus

So far in this notebook we've used the pre-trained GloVe embeddings. The lecture this morning described the Skip Gram method of training word embeddings. In this section you are going to review code to use that method to train a very small embedding, for a very small vocabulary on a very small corpus of text. The goal is to gain some insight into the general notion of how embeddings are produced. The corpus you are going to use is in the file SmallSimpleCorpus.txt, and was also shown in the lecture.

NOTE: First we need to upload the file SmallSimpleCorpus.txt to the Colab environment. Download the data files from lab_5_1 folder on the github page and then navigate to the folder icon on the left hand side of this page and click the "upload to session storage" button to upload the data files to the colab session.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import spacy

In [None]:
# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

First, read the file SmallSimpleCorpus.txt so that you see what the sequence of sentences is. Recalling the notion “you shall know a word by the company it keeps,” find three pairs of words that this corpora implies have similar or related meanings. For example, ‘he’ and ‘she’ are one such example – which you cannot use in your answer!

Answer: TODO

In [None]:
# Upload the file to colab and read in the corpus
with open('./SmallSimpleCorpus.txt', 'r') as file:
    corpus = file.read()

# Preprocess the text
def prepare_texts(corpus):
    doc = nlp(corpus)
    lemmas = [token.lemma_ for token in doc if token.is_alpha]
    return lemmas

Read the prepare_texts function in the code given to you fulfills several key functions in text processing, a little bit simplified for this simple corpus. Rather than full tokenization it only lemmatizes the corpus,
which means converting words to their root - for example the word “holds” becomes “hold”, whereas the word  “hold” itself stays the same.
The prepare_texts function performs lemmatization using the [spaCy](https://spacy.io/models/en) library.
Review the code of prepare_texts to make sure you understand what it is doing. Review the code that reads the corpus SmallSimpleCorpus.txt, and run the prepare_texts on it to return the text (lemmas) that will be used next.


In [None]:
#lematize the corpus and create the vocabulary
lemmas = #TODO
vocab = set(lemmas)
v2i = {v: i for i, v in enumerate(vocab)} # dictionary to lookup word to index
i2v = {i: v for v, i in v2i.items()} # dictionary to lookup index to word

Check that the vocabulary size is 11. \\
Which is the most frequent word in the corpus, and the least frequent word? \\
What purpose do the v2i and i2v dictionaries serve?

In [None]:
#TODO


The function `tokenize_and_preprocess_text` takes the lemmatized small corpus as input, along with `v2i` (which serves as a simple, lemma-based tokenizer) and a window size window. Its output should be the Skip Gram training dataset for this corpus: pairs of words in the corpus that “belong” together, in the Skip Gram sense.
That is, for every word in the corpus a set of training examples are generated with that word serving as the (target) input to the predictor,
and all the words that fit within a window of size window surrounding the word would be predicted to be in the “context” of the given word.
The words are expressed as tokens (numbers).
 Add a little code so that you can see the dataset that is produced.

In [None]:
# Tokenize and preprocess the text
def tokenize_and_preprocess_text(lemmas, v2i, window=3):
    data = []
    for i in range(len(lemmas)):
        target = v2i[lemmas[i]]
        context = []
        for j in range(i - window // 2, i + window // 2 + 1):
            if j != i and j >= 0 and j < len(lemmas):
                context.append(v2i[lemmas[j]])
        for c in context:
            data.append((target, c))
    return data

In [None]:
# Create the Skip gram dataset with window size of 5
window_size = # TODO
data = tokenize_and_preprocess_text(lemmas, v2i, window_size)


Review the code in Word2vecModel. Part of this model ultimately provides the trained embeddings/vectors, and you can see these are defined and initialized to random numbers in the line `self.embedding = torch.nn.Parameter(torch.rand(
            vocab_size, embedding_size))`

In [None]:
# The Word2Vec model
class Word2VecModel(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(Word2VecModel, self).__init__()
        self.embedding = torch.nn.Parameter(torch.rand(
            vocab_size, embedding_size))
        self.fc = nn.Linear(embedding_size, vocab_size)

    def forward(self, x):
        x = self.embedding[x]
        x = self.fc(x)
        return x

In [None]:
# Set the vocab size
vocab_size = # TODO
embedding_size = 2 # Size of the embedding vector
# Initialize the Word2Vec model
model = # TODO

Review the code for training the model. It uses Cross Entropy loss function described in the lecture, a batch size of 4, a window size of 5, and 50 Epochs of training. It uses the Adam optimizer, and a learning rate of 0.001.

In [None]:
# Training the model
def train_word2vec(model, data, epochs=50, batch_size=4, learning_rate=0.001):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        np.random.shuffle(data)
        losses = []
        for i in range(0, len(data), batch_size):
            batch = data[i:i+batch_size]
            inputs, labels = zip(*batch)
            inputs = torch.tensor(inputs, dtype=torch.long)
            labels = torch.tensor(labels, dtype=torch.long)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            losses.append(loss.item())

        print(f'Epoch {epoch+1}, Loss: {np.mean(losses):.4f}')

train_word2vec(model, data)

Run the code below that displays each of the embeddings in a 2-dimensional plot using Matplotlib.


In [None]:
# Visualize the embeddings
import matplotlib.pyplot as plt

def plot_embeddings(model, i2v):
    embeddings = model.embedding.data.numpy()
    plt.figure(figsize=(10, 10))
    for i, word in i2v.items():
        x, y = embeddings[i]
        plt.scatter(x, y)
        plt.text(x + 0.02, y + 0.02, word, fontsize=12)
    plt.show()

plot_embeddings(model, i2v)

Do the results make sense, and confirm your choices from part 1 of this Section?
Answer: TODO

What would happen when the window size is too large?
Answer: TODO

At what value would window become too large for this corpus?
Answer: TODO

# 4. Training A Single-Neuron Classifier to Determine if a Sentence is Objective or Subjective

The purpose of this exercise is to review the code for training a simple network (just a single neuron) to determine if a sentence is objective or subjective.

In [None]:
!pip install gradio

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
import spacy
from torchtext.vocab import GloVe
import gradio as gr

In [None]:
# Upload the file data.tsv in colab and Load the data
df = pd.read_csv('data.tsv', sep='\t', header=None, names=['sentence', 'label'])
df = df.loc[1:]

Take a quick look at the file data.tsv to see which sentences were labelled subjective (1) and which objective (0). (The 1’s are in the first half of the file)

Read through each of the code blocks, getting a rough sense of what is going on by reading the comments.
You see code functions that split the dataset, in the tab-separated file data.tsv into training, validation and test sets.
Perhaps look closest at the code block call “Classifier model” class where you can see the torch.nn.Linear class being used to instantiate a single neuron with embedding_size inputs and just 1 output.

In [None]:
# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Load GloVe embeddings using torchtext
glove = GloVe(name='6B', dim=50)
embedding_size = 50  # Size of GloVe embeddings

# Function to convert a sentence to an embedding by averaging word embeddings
def sentence_to_embedding(sentence, glove):
    tokens = [token.text for token in nlp(sentence) if token.is_alpha]
    embeddings = [glove[token] for token in tokens if token in glove.stoi]
    if embeddings:
        return torch.mean(torch.stack(embeddings), dim=0)
    else:
        return torch.zeros(embedding_size)

# Convert all sentences to embeddings at once
def convert_sentences_to_embeddings(sentences, glove):
    embeddings = [sentence_to_embedding(sentence, glove) for sentence in sentences]
    return torch.stack(embeddings)

# Splitting the data into train, validation, and test sets
train_sentences, test_sentences, train_labels, test_labels = train_test_split(
    df['sentence'], df['label'], test_size=0.2, random_state=42)
train_sentences, val_sentences, train_labels, val_labels = train_test_split(
    train_sentences, train_labels, test_size=0.125, random_state=42)  # 0.125 * 0.8 = 0.1

train_embeddings = convert_sentences_to_embeddings(train_sentences, glove)
val_embeddings = convert_sentences_to_embeddings(val_sentences, glove)
test_embeddings = convert_sentences_to_embeddings(test_sentences, glove)

In [None]:
# Custom dataset class
class TextDataset(Dataset):
    def __init__(self, embeddings, labels):
        self.embeddings = embeddings
        self.labels = labels

    def __len__(self):
        return len(self.embeddings)

    def __getitem__(self, idx):
        embedding = self.embeddings[idx]
        label = self.labels[idx]
        return embedding, torch.tensor(float(label), dtype=torch.float32)

# Convert to dataset
train_dataset = TextDataset(train_embeddings, train_labels.tolist())
val_dataset = TextDataset(val_embeddings, val_labels.tolist())
test_dataset = TextDataset(test_embeddings, test_labels.tolist())

# Create data loaders
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
# Define the classifier model
class ClassifierModel(nn.Module):
    def __init__(self, input_size):
        super(ClassifierModel, self).__init__()
        self.linear = nn.Linear(input_size, 1)

    def forward(self, x):
        x = self.linear(x)
        return torch.sigmoid(x)


In [None]:
#Initialize the model
model = ClassifierModel(embedding_size)

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training the model
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=10):
    train_losses = []
    val_losses = []
    val_accuracies = []

    for epoch in range(num_epochs):
        model.train()
        batch_train_losses = []
        for embeddings, labels in train_loader:
            labels = labels.view(-1, 1)

            optimizer.zero_grad()
            outputs = model(embeddings)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            batch_train_losses.append(loss.item())

        model.eval()
        batch_val_losses = []
        correct = 0
        total = 0
        with torch.no_grad():
            for embeddings, labels in val_loader:
                labels = labels.view(-1, 1)

                outputs = model(embeddings)
                loss = criterion(outputs, labels)

                batch_val_losses.append(loss.item())

                predicted = (outputs > 0.5).float()
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        train_loss = np.mean(batch_train_losses)
        val_loss = np.mean(batch_val_losses)
        val_accuracy = correct / total

        train_losses.append(train_loss)
        val_losses.append(val_loss)
        val_accuracies.append(val_accuracy)

        print(f'Epoch {epoch+1}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}')

    return train_losses, val_losses, val_accuracies

num_epochs = 20
train_losses, val_losses, val_accuracies = train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs)

# Plotting training and validation losses
plt.figure(figsize=(5, 2.5))
plt.plot(range(num_epochs), train_losses, label='Training Loss')
plt.plot(range(num_epochs), val_losses, label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Plotting validation accuracy
plt.figure(figsize=(5, 2.5))
plt.plot(range(num_epochs), val_accuracies, label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Define the function to evaluate the model
def evaluate_model(model, test_loader):
    model.eval()
    test_losses = []
    correct = 0
    total = 0
    with torch.no_grad():
        for embeddings, labels in test_loader:
            labels = labels.view(-1, 1)

            outputs = model(embeddings)
            loss = criterion(outputs, labels)
            test_losses.append(loss.item())

            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total
    print(f'Test Loss: {np.mean(test_losses):.4f}, Accuracy: {accuracy:.4f}')

print("\n")
evaluate_model(model, test_loader)

Run the remaining code, which installs the gradio package and sets up an easy-to-use inter- face with the trained model, that you can type in any sentence and have the model decide if it is subjective or objective. To use the interface, just type in the sentence and click the classify button. You may find that it is better at classifying longer sentences, as that is the nature of the dataset it was trained on.

In [None]:
# Define the Gradio interface
def classify_sentence(sentence):
    embedding = sentence_to_embedding(sentence, glove)
    embedding = embedding.unsqueeze(0)
    output = model(embedding)
    label = 'Subjective' if output > 0.5 else 'Objective'
    return label

gr.Interface(fn=classify_sentence, inputs="text", outputs="text").launch()