# Word Embeddings

In [1]:
%pip install datasets

import numpy as np

import gensim.downloader
from gensim.models import Word2Vec
from gensim.utils import tokenize, simple_preprocess

import nltk
from nltk.corpus import brown, stopwords

from tqdm import tqdm
from datasets import load_dataset
from typing import Any

%pip install torchmetrics
import torch 
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchmetrics.classification import BinaryAccuracy

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


Note: you may need to restart the kernel to use updated packages.


## Exercise 1: Training a Word2Vec model

### Preparing the Corpus and Stopwords

To train our own **Word2Vec embeddings**, we first need a text corpus.  
Here we’ll use the **Brown Corpus**, a classic collection of English texts available through NLTK.  

- We download the `brown` corpus and the list of common English **stopwords**.  
- Stopwords (like *the*, *and*, *is*) carry little semantic meaning, so we’ll filter them out before training.  

This ensures our Word2Vec model focuses on more informative words when learning embeddings.


In [2]:
nltk.download("brown")
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\malik\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\malik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Exploring the Brown Corpus

Before training Word2Vec, let’s take a look at the data we’ll be working with.  

- The **Brown Corpus** is divided into categories (e.g., news, fiction, humor).  
- We print the available categories and the total number of sentences in the corpus.  
- For illustration, we also display the first few sentences so we can see how the raw data looks.  

This step helps us understand the **structure and style of the text** before preprocessing and training embeddings.


In [3]:
# Show some info about the corpus
print("Categories:", brown.categories())
print("Total sentences:", len(brown.sents()))

# Get 100 senteces from the humor category
dataset = brown.sents()

for i in range(3):
    print(dataset[i])

Categories: ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
Total sentences: 57340
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']
['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possi

### Cleaning and Preprocessing the Text

Raw sentences from the Brown Corpus need to be **preprocessed** before they can be used for training Word2Vec.  

Steps applied:  
1. **Join and tokenize** each sentence into words.  
2. **Lowercasing & deaccenting** (`deacc=True` removes punctuation/accents).  
3. **Filter short words** (`min_len=2` keeps words with at least 2 characters).  
4. **Remove stopwords** (like *the*, *and*, *is*) to focus on meaningful content.  

The result is a list of cleaned, tokenized sentences.  
We print a few examples to see how the preprocessing transforms the text.


In [14]:
cleaned_sentences = []
for i in range(len(dataset)):
    cleaned_sentence = " ".join(dataset[i])
    cleaned_sentence = simple_preprocess(cleaned_sentence,True,2)
    cleaned_sentence = [token for token in cleaned_sentence if not token in stop_words]
    cleaned_sentences.append(cleaned_sentence)

print(cleaned_sentences[:20])

[['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', 'atlanta', 'recent', 'primary', 'election', 'produced', 'evidence', 'irregularities', 'took', 'place'], ['jury', 'said', 'term', 'end', 'presentments', 'city', 'executive', 'committee', 'charge', 'election', 'deserves', 'praise', 'thanks', 'city', 'atlanta', 'manner', 'election', 'conducted'], ['september', 'october', 'term', 'jury', 'charged', 'fulton', 'superior', 'court', 'judge', 'durwood', 'pye', 'investigate', 'reports', 'possible', 'irregularities', 'hard', 'fought', 'primary', 'mayor', 'nominate', 'ivan', 'allen', 'jr'], ['relative', 'handful', 'reports', 'received', 'jury', 'said', 'considering', 'widespread', 'interest', 'election', 'number', 'voters', 'size', 'city'], ['jury', 'said', 'find', 'many', 'georgia', 'registration', 'election', 'laws', 'outmoded', 'inadequate', 'often', 'ambiguous'], ['recommended', 'fulton', 'legislators', 'act', 'laws', 'studied', 'revised', 'end', 'modernizing', 'improvin

### Training a Word2Vec Model

Now we train our own **Word2Vec embeddings** on the Brown Corpus.  

Key parameters:  
- `vector_size=50`: each word is represented as a 50-dimensional vector.  
- `window=3`: the model looks at 3 words to the left and right for context.  
- `min_count=1`: keep all words (even rare ones).  
- `sg=1`: use the **skip-gram** approach, which works well for smaller datasets.  
- `epochs=20`: make multiple passes over the data to learn stronger embeddings.  

After training, we can:  
- Inspect the learned vector for a specific word (here, `"engineer"`).  
- Find its **nearest neighbors** in the embedding space, i.e., words with similar meanings or usage contexts.  

This demonstrates how Word2Vec captures **semantic similarity** directly from the text we trained on.


In [None]:
model = Word2Vec(...
)

word = "jury"
# Inspect a word vector
print(f"Vector for '{word}':")
print(model.wv[word][:10])   # show first 10 dimensions

# Check nearest neighbors
print(f"\nMost similar to '{word}':")
print(model.wv.most_similar(word, topn=5))

## Exercise 2: Using GloVe Embeddings to train a classifier

### Loading Pre-trained Word Embeddings

Instead of learning word vectors from scratch, we can use **pre-trained embeddings** that capture semantic relationships between words.  
Here we download the **GloVe embeddings** (Global Vectors for Word Representation) trained on a large Wikipedia + Gigaword corpus.  

- Each word is mapped to a **100-dimensional vector**.  
- Words that appear in similar contexts (e.g., *king* and *queen*) will have vectors that are close to each other in this space.  

These embeddings will serve as the foundation for representing text in our binary classification task.


In [None]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-100')

### Loading and Preparing the Dataset

For this example, we’ll use the **Sentiment140 Twitter dataset**, which contains tweets labeled for **sentiment (positive or negative)**.  
This is a common benchmark dataset for text classification tasks.  

- We load the dataset using the 🤗 **`datasets`** library.  
- To keep the demo lightweight and fast, we only take a **subset** of the data:  
  - 50,000 tweets for training  
  - 10,000 tweets for testing  

Finally, we print the dataset sizes to confirm our selection.


In [None]:
ds = load_dataset("adilbekovich/Sentiment140Twitter")

train = ds["train"].select(range(50_000))
test = ds["test"].select(range(10_000))
print(f"Training set size: {len(train)}")
print(f"Test set size: {len(test)}")

### Converting Tweets into Embeddings

Machine learning models can’t directly process raw text, so we need to convert each tweet into a **numeric vector**.  

We define a function `sentence_embedding` that:  
1. **Tokenizes** the sentence into words (removing punctuation and making everything lowercase).  
2. Looks up the **GloVe vector** for each word.  
3. Computes the **average** of all word vectors to create a single fixed-length representation of the entire sentence.  
   - If a sentence has no known words, we assign a zero vector.  

Then we:  
- Apply this function to every tweet in the train and test sets.  
- Store the result in a new column called `"embeddings"`.  
- Format the dataset so that `"embeddings"` and `"label"` are ready to be used as **PyTorch tensors** for training a classifier.


In [None]:
def sentence_embedding(sentence: str, model: Any) -> np.ndarray:
    # TODO: implement this function
    ...
    
train = # TODO: Map the training set to add embeddings
test = # TODO: Map the test set to add embeddings

train.set_format(type="torch", columns=["embeddings", "label"])
test.set_format(type="torch", columns=["embeddings", "label"])

### Building a Simple Neural Network Classifier

Now that we have numeric embeddings for each tweet, we can train a **neural network** to classify sentiment.  

We define a PyTorch model `SentimentClassifier` with the following structure:  

1. **Input layer**: takes in the 100-dimensional GloVe embedding for each tweet.  
2. **Hidden layer**: a fully connected layer with 256 units and a **ReLU activation**, which introduces non-linearity.  
3. **Output layer**: a single neuron that predicts the probability of the tweet being **positive** (values between 0 and 1).  
   - We use a **sigmoid activation** to squash the output.  

This simple feed-forward network is powerful enough to learn sentiment patterns from our averaged word embeddings.


In [None]:
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        # TODO: Define the layers and the activation functions
        ...
    
    def forward(self, x):
        # TODO: Implement the forward pass
        ...

model = ... # TODO: Initialize the model with the correct input dimension (hint: extract dimension from glove_vectors)


### Setting Up Training Parameters

Before training the model, we need to define some key components:  

- **Batch size (64)**: the number of samples processed at once before updating the model’s parameters.  
- **Epochs (30)**: how many times the model will see the entire training dataset.  
- **Loss function**: we use **Binary Cross-Entropy Loss (`BCELoss`)**, which is standard for binary classification tasks.  
- **Optimizer**: we use **Stochastic Gradient Descent (SGD)** with a learning rate of 0.01 to update model weights during training.  

We also wrap our dataset into **DataLoaders**:  
- `train_dataloader`: feeds batches of tweets into the model, shuffling to avoid order bias.  
- `test_dataloader`: used for evaluation (no shuffling needed).  

This setup prepares us for the training loop.


In [None]:
BATCH_SIZE = 64
EPOCHS = 30
loss_fn = nn.BCELoss()
acc_fn = BinaryAccuracy()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

train_dataloader = DataLoader(train, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test, batch_size=BATCH_SIZE)

### Training the Classifier

Now we train our sentiment classifier using the training data.  

For each **epoch** (full pass over the training set):  
1. **Batching**: the `DataLoader` gives us a batch of tweet embeddings and labels.  
2. **Forward pass**: the embeddings are passed through the model to get predictions.  
3. **Loss calculation**: compare predictions with the true labels using **binary cross-entropy**.  
4. **Backward pass**: compute gradients of the loss with respect to model parameters.  
5. **Optimizer step**: update model weights using **SGD**.  
6. **Repeat** for all batches in the epoch.  

At the end of each epoch, we print the **average training loss**, which shows how well the model is learning over time.


In [None]:
model.train()
for epoch in range(EPOCHS):
    epoch_loss = 0
    for i,  batch in enumerate(train_dataloader):

        # Get the inputs and labels from the batch
        inputs = batch['embeddings']
        labels = batch['label']

        # Forward pass
        outputs = ... # TODO: Get model outputs

        # Compute the loss
        loss = ... #TODO: Compute the loss
        epoch_loss += loss

        # Backward pass and optimization
        # TODO: Zero gradients, perform backward pass, and update weights
        ...

    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {epoch_loss/len(train_dataloader)}")

### Evaluating the Model on Test Data

After training, we switch the model to **evaluation mode** (`model.eval()`) to test its performance.  

For each batch in the test set:  
1. **Forward pass**: compute predictions for the embeddings.  
2. **Loss calculation**: measure how far predictions are from the true labels.  
3. **Thresholding**: since outputs are probabilities (0–1), we assign  
   - `1` if prediction > 0.5 (positive sentiment)  
   - `0` otherwise (negative sentiment).  
4. **Accuracy**: compare predictions with labels to compute the percentage of correct classifications.  

Finally, we average the results across all batches and print the **test accuracy**, which tells us how well the model generalizes to unseen tweets.


In [None]:
batch_loss = 0
batch_acc = 0

model.eval()
for i,  batch in enumerate(test_dataloader):
    # TODO: Load the inputs and the labels from the batch, then run the forward pass, compute loss, predictions and accuracy
    # get the inputs
    inputs = ...
    labels = ...
    
    # run forward pass
    outputs = ...
    
    # Compute and print loss
    test_loss = ...
    predictions = ...
    
    
    acc = ...
    
    batch_loss += test_loss.item()
    batch_acc += acc.item()

test_loss = batch_loss / len(test_dataloader)
test_acc = batch_acc / len(test_dataloader)

print(f'Test Acc: {test_acc*100:.2f}%')

## 3. MCQ

### 3.1. Purpose of Word Embeddings

What is the main purpose of word embeddings in NLP?

A. To convert words into high-dimensional one-hot vectors<br>
B. To map words into continuous vector spaces that capture semantic meaning<br>
C. To remove stopwords from text before processing<br>
D. To reduce the training time of convolutional networks<br>

**Answer:** 

---

### 3.2. One-Hot vs. Embeddings

Compared to one-hot encoding, word embeddings:

A. Have the same dimensionality as the vocabulary size<br>
B. Provide dense, low-dimensional representations that capture similarities<br>
C. Are always manually designed by experts<br>
D. Cannot be trained with neural networks<br>

**Answer:** 

---

### 3.3. Word2Vec Models

The Skip-gram model in Word2Vec is designed to:

A. Predict the context words given a target word<br>
B. Predict the target word given the context words<br>
C. Cluster words into fixed categories<br>
D. Remove rare words from the corpus<br>

**Answer:** 

---

### 3.4. Embedding Matrix Shape

In a neural network with vocabulary size $V$ and embedding dimension $d$, the embedding matrix has shape:

A. $(d \times V)$ <br>
B. $(V \times d)$<br>
C. $(V \times V)$<br>
D. $(d \times d)$<br>

**Answer:** 

---

### 3.5. Semantic Relationships

Word embeddings can capture analogies such as:

A. king – man + woman ≈ queen<br>
B. dog – cat + car ≈ airplane<br>
C. apple – red + fast ≈ running<br>
D. chair – table + sky ≈ cloud<br>

**Answer:** 

---

### 3.6. Contextual vs. Static Embeddings

How do contextual embeddings (e.g., BERT) differ from static embeddings (e.g., Word2Vec)?

A. They assign the same vector to a word regardless of context<br>
B. They assign different vectors to a word depending on its context<br>
C. They are always lower-dimensional than static embeddings<br>
D. They do not require pretraining on large corpora<br>

**Answer:** 

---

### 3.7. Sparse vs. Dense Representations

Compared to Bag-of-Words (BoW) vectors, neural word embeddings are:

A. Higher dimensional and sparse<br>
B. Always binary representations<br>
C. Lower dimensional and sparse<br>
D. Lower dimensional and dense<br>

**Answer:** 

---

### 3.8. Semantic Similarity

What is the main advantage of word embeddings over one-hot encodings?

A. They guarantee perfect accuracy in classification tasks<br>
B. They eliminate the need for training neural networks<br>
C. They capture semantic similarity between words in vector space<br>
D. They automatically remove stopwords from text<br>

**Answer:** 

---

### 3.9. CBOW vs. Skip-gram

The Continuous Bag of Words (CBOW) model aims to:

A. Predict the target word given its surrounding context words<br>
B. Assign unique one-hot vectors to words<br>
C. Predict the context words given a target word<br>
D. Cluster words into topics using SVD<br>

**Answer:** 

---

### 3.10. Distributional Semantics

The idea that “you shall know a word by the company it keeps” refers to:

A. Overfitting in NLP models<br>
B. Context-based learning of embeddings<br>
C. Sentence segmentation<br>
D. Stopword removal<br>

**Answer:** 