# Hands-On 2 - Word Embeddings and Downstream Tasks


In [None]:
!pip install nltk
!pip install gensim
!pip install sklearn

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import re
from collections import defaultdict
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from datasets import load_dataset

# import stopwords
from nltk.corpus import stopwords

%matplotlib inline

## 1. Introduction to Word Embeddings

### Why Neural Networks for NLP?

Neural networks have revolutionized NLP due to their ability to learn complex patterns and relationships in language data. Here's why they are particularly well-suited for NLP tasks:

1. **Handling Sequential Data:** Text is inherently sequential, and neural networks, especially recurrent neural networks (RNNs) and transformers, excel at processing sequential information. They can capture the dependencies and context between words in a sentence or document.
2. **Learning Complex Representations:** Word embeddings, generated using neural network-based techniques like Word2Vec and GloVe, capture semantic relationships between words. These representations are far richer than traditional one-hot encodings, allowing neural networks to better understand the meaning of text.
3. **Generalization:** Neural networks can generalize well to unseen data, making them robust for a wide range of NLP tasks. They learn underlying linguistic patterns that can be applied to new and diverse text inputs.
4. **Adaptability:** Neural network architectures can be adapted and fine-tuned for specific downstream tasks, such as text classification, sentiment analysis, machine translation, and question answering. This flexibility makes them a powerful tool for various NLP applications.
5. **Continuous Improvement:** With the availability of large datasets and advancements in deep learning techniques, neural networks continue to improve their performance on NLP tasks, pushing the boundaries of what's possible in natural language understanding.

By leveraging the power of neural networks, we can develop sophisticated models that can effectively process, analyze, and generate human language. This has led to significant advancements in various NLP applications, including chatbots, virtual assistants, and text summarization tools.

In this hands-on session, we'll explore how neural networks are used to create word embeddings and how these embeddings can be utilized for various downstream tasks.

### Why Word Embeddings?
In NLP, word embeddings refers to the processo of transforming words into a mathematical format that neural networks can understand. In this hands-on session we will explore the pros and cons of the basic use of neural networks in NLP. This proces, lead the way to the modern LLM and trasformers architectures.
<center>
    <img src="./schema.png" width="1000" height="700"/>
</center>



### One-hot Encoding


In [None]:
# Sample vocabulary  (obtained from tokenization)
vocab = [
    "king",
    "queen",
    "man",
    "woman",
    "apple",
    "orange",
    "banana",
    "grape",
    "lion",
    "tiger",
    "dog",
    "cat",
    "car",
    "bike",
    "train",
    "plane",
]


In [None]:
# One-hot encoding
one_hot_encodings = {
    word: [1 if i == idx else 0 for i in range(len(vocab))]
    for idx, word in enumerate(vocab)
}


In [None]:
# Display one-hot encodings
for word, encoding in one_hot_encodings.items():
    print(f"{word}: {encoding}")


In [None]:
### Exercise: Explore Distances in One-Hot Encoding

#  Calculate the Euclidean distance between "king" and "queen" in the one-hot encoding.
def euclidean_distance(vec1, vec2):
    return sum((x - y) ** 2 for x, y in zip(vec1, vec2)) ** 0.5


In [None]:
king = one_hot_encodings["king"]
queen = one_hot_encodings["queen"]


In [None]:
for word in vocab:
    print(
        f"Euclidean distance between 'king' and '{word}': {euclidean_distance(king, one_hot_encodings[word])}"
    )

# Discuss: Does this distance reflect the relationship between the words?
# Yes, the distance between "king" and "queen" is 1, which is the same as the distance between "king" and "apple" or "king" and "lion". This is because the one-hot encoding treats all words as independent and does not capture any relationships between them.


## 2. N-Grams and Contextual Representations
### Distributed Representations and Co-Occurrence
Word embeddings leverage co-occurrence patterns to create dense, low-dimensional vectors that capture word relationships. This idea, known as Firth's Hypothesis (**"You shall know a word by the company it keeps"**__**), forms the foundation of distributed word embeddings. Co-occurence are also referred as n-grams, which are sequences of n words that appear together in a text. The co-occurrence matrix is a square matrix where the rows and columns represent words, and the cell values indicate how often two words appear together in a given context window. By analyzing these co-occurrence patterns, we can learn meaningful representations of words that capture their semantic relationships.

N-grams were popularized by the Google N-gram dataset, which contains n-grams extracted from a large corpus of text. These n-grams have been used in various NLP tasks, such as language modeling, machine translation, and sentiment analysis. By analyzing the co-occurrence patterns of words, we can extract valuable insights about their meanings and relationships. 

__Resources:__

- [Google N-gram](https://research.google/blog/all-our-n-gram-are-belong-to-you/)

- [The Zipfs Mistery](https://www.youtube.com/watch?v=fCn8zs912OE)

In [None]:
# Example corpus
corpus = ["king and queen", "man and woman", "apple and fruit"]


In [None]:
# Build a co-occurrence matrix (simple example, ignoring NLP processing for simplicity)
from collections import defaultdict
import numpy as np

vocab_set = set(word for sentence in corpus for word in sentence.split())
vocab_list = list(vocab_set)
vocab_dict = {word: idx for idx, word in enumerate(vocab_list)}


In [None]:
co_occurrence_matrix = np.zeros((len(vocab_list), len(vocab_list)))


In [None]:
# Fill co-occurrence matrix
for sentence in corpus:
    words = sentence.split()
    for i, word in enumerate(words):
        for j in range(i + 1, len(words)):
            co_occurrence_matrix[vocab_dict[word]][vocab_dict[words[j]]] += 1
            co_occurrence_matrix[vocab_dict[words[j]]][vocab_dict[word]] += 1


In [None]:
print("Vocabulary:", vocab_list)
print("Co-occurrence Matrix:\n", co_occurrence_matrix)


In [None]:
### Exercise: Examine the Co-Occurrence Matrix
# Choose two words and inspect their co-occurrence counts. Are words with similar contexts close in the matrix?

word1 = "king"
word2 = "queen"
idx1 = vocab_dict[word1]
idx2 = vocab_dict[word2]
print(
    f"Co-occurrence count between '{word1}' and '{word2}': {co_occurrence_matrix[idx1][idx2]}"
)


In [None]:
idx1 = vocab_dict["king"]
idx2 = vocab_dict["apple"]

print(
    f"Co-occurrence count between 'king' and 'apple': {co_occurrence_matrix[idx1][idx2]}"
)


## 3. Dummy Word Embeddings
### Word Embeddings with Neural Networks
Let's start by creating dummy word embeddings using a simple neural network architecture. We'll use a Word2Vec-like model to learn embeddings for a small vocabulary of words. This model will be trained on a synthetic dataset to demonstrate the process of generating word embeddings using neural networks.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define vocabulary and toy dataset
vocab = ["king", "queen", "man", "woman", "apple", "fruit"]
vocab_size = len(vocab)
word_to_ix = {word: i for i, word in enumerate(vocab)}


In [None]:
# Dummy dataset of (input_word, target_word) pairs
data = [("king", "queen"), ("man", "woman"), ("apple", "fruit")]

In [None]:
# Define the feed-forward neural network
class WordEmbedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(WordEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, word):
        return self.embedding(word)


In [None]:
# Initialize the model, loss, and optimizer
embedding_dim = 100
model = WordEmbedding(vocab_size, embedding_dim)
loss_function = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)


In [None]:
# Train the model
for epoch in range(3000):
    total_loss = 0
    for context, target in data:
        context_idx = torch.tensor([word_to_ix[context]], dtype=torch.long)
        target_idx = torch.tensor([word_to_ix[target]], dtype=torch.long)

        model.zero_grad()
        context_embedding = model(context_idx)
        target_embedding = model(target_idx)

        loss = loss_function(context_embedding, target_embedding)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {total_loss}")


## 4. Evaluating Word Embeddings
The quality of embeddings can be tested by checking if they reflect relationships in the vector space, such as "king - man + woman ≈ queen".


In [None]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [None]:
# Retrieve embeddings
king_vec = model.embedding(torch.tensor(word_to_ix["king"])).detach().numpy()
queen_vec = model.embedding(torch.tensor(word_to_ix["queen"])).detach().numpy()
# Cosine similarity
print("Cosine similarity (king, queen):", cosine_similarity(king_vec, queen_vec))


In [None]:
man_vec = model.embedding(torch.tensor(word_to_ix["man"])).detach().numpy()
woman_vec = model.embedding(torch.tensor(word_to_ix["woman"])).detach().numpy()
print("Cosine similarity (man, woman):", cosine_similarity(man_vec, woman_vec))


In [None]:
print("Cosine similarity (king, woman):", cosine_similarity(king_vec, woman_vec))

## 5. Pre-trained Word2Vec Embeddings
Pre-trained embeddings like Word2Vec, GloVe, and FastText are available for various languages and domains. These embeddings capture rich semantic relationships and can be used in downstream tasks without the need for training from scratch.

The ground-breaking Word2Vec model, developed by Tomas Mikolov and colleagues at Google was one of the first neural network-based methods for learning word embeddings. Word2Vec uses a shallow neural network to learn word representations from large text corpora. The model is trained to predict the context words given a target word (Skip-gram model) or predict the target word given the context words (Continuous Bag of Words model). The resulting word embeddings capture semantic relationships between words and are widely used in NLP tasks. 

The model was important because it showed that neural networks could learn meaningful representations of words from raw text data. In addition it was the first model to show the importance of pre-trained embeddings in NLP tasks. Nowdays all the state-of-the-art models are based on pre-trained embeddings.

[Word2Vec Paper](https://arxiv.org/abs/1301.3781) .

### 5.1. Exploring Pre-trained Word2Vec Embeddings

In [None]:
# Import necessary libraries
import gensim.downloader as api  # For loading pretrained Word2Vec embeddings
import nltk  # Natural Language Toolkit (NLTK) for NLP tasks
from nltk.corpus import (
    stopwords,
)  # For removing common words that don't add much meaning
from datasets import load_dataset  # Hugging Face library to load datasets
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For plotting
from sklearn.decomposition import PCA  # For dimensionality reduction
from sklearn.ensemble import RandomForestClassifier  # Model for sentiment analysis
from sklearn.metrics import (
    accuracy_score,
    classification_report,
)  # For evaluating model performance
from sklearn.model_selection import train_test_split  # For splitting the data


In [None]:

# Download and set up necessary nltk datasets
nltk.download("stopwords")
nltk.download("punkt")

# 1. Load Pretrained Word2Vec Model Using Gensim
# Here, we'll use the pretrained 'word2vec-google-news-300' embeddings from Gensim.
print("Loading pretrained Word2Vec model...")
word2vec_model = api.load("word2vec-google-news-300")
print("Word2Vec model loaded.")


In [None]:
word = "happy"
num_similar = 10

# Find words most similar to the given word based on cosine similarity
similar_words = word2vec_model.most_similar(word, topn=num_similar)
similar_words = [(word, 1.0)] + similar_words  # Include the original word itself

# Collect embeddings of the word and its most similar words
embeddings = np.array([word2vec_model[w[0]] for w in similar_words])
words = [w[0] for w in similar_words]


In [None]:
# Reduce dimensionality to 2D for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot the embeddings in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], marker="o", color="blue")

# Annotate each point with the corresponding word
for i, word in enumerate(words):
    plt.annotate(word, xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=12)

plt.title(f"2D Visualization of '{word}' and its Most Similar Words")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()


Let's now explore a very famous property of word2vec embeddings: the linear relationships between words. The linear rappresentation means that the words can be represented as a linear combination of other words. In other words we can see that:
$$
\mathrm{embedding(\text{king})} + \mathrm{embedding(\text{woman})} - \mathrm{embedding(\text{man})} = \mathrm{embedding(\text{queen})} 
$$

In [None]:
word2vec_model["king"]

In [None]:
# Check the linear relationship between word embeddings (king + woman - man = queen)
queen_vec = word2vec_model["queen"]
king_vec = word2vec_model["king"]
man_vec = word2vec_model["man"]
woman_vec = word2vec_model["woman"]


In [None]:
linear_queen = king_vec + woman_vec - man_vec

In [None]:
# Calculate cosine similarity between the expected and calculated vectors
cosine_similarity_queen = cosine_similarity(queen_vec, linear_queen)
print("Cosine similarity (queen, king):", cosine_similarity_queen)

### 5.2. Using Pre-trained Word2Vec Embeddings in Downstream Tasks

In [None]:
# 2. Sentiment Analysis Using Word Embeddings and IMDB Dataset from Hugging Face
# We'll use the IMDB movie reviews dataset provided by Hugging Face, which contains 50,000 movie reviews labeled
# as either 'positive' or 'negative'.

# Load the IMDB dataset from Hugging Face
print("Loading IMDB dataset...")
imdb_dataset = load_dataset("imdb")
print("IMDB dataset loaded.")


In [None]:
# Preprocess the dataset: tokenize reviews, remove stop words, and average word vectors to get sentence embeddings
stop_words = set(stopwords.words("english"))


def preprocess_review(review):
    """
    Convert a review to a single vector by averaging the Word2Vec embeddings of the words in the review.

    Parameters:
    review (str): The review text.

    Returns:
    numpy.ndarray: The averaged Word2Vec vector for the review.
    """
    # Tokenize the review into words and filter out stopwords
    words = nltk.word_tokenize(review.lower())
    word_vectors = [
        word2vec_model[word]
        for word in words
        if word in word2vec_model and word not in stop_words
    ]

    # Return the mean of all word vectors in the review to get a single vector representation for the review
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        # If no valid word vectors are found, return a zero vector (300-dimensional)
        return np.zeros(word2vec_model.vector_size)


In [None]:
# Prepare the dataset by extracting features and labels
def prepare_data(dataset):
    """
    Prepare the dataset by converting reviews to word vector averages and extracting labels.

    Parameters:
    dataset (Dataset): Hugging Face Dataset containing IMDB reviews.

    Returns:
    np.ndarray, np.ndarray: Arrays of review vectors and labels.
    """
    reviews = []
    labels = []

    for item in dataset:
        # Preprocess each review and add to the list
        review_vector = preprocess_review(item["text"])
        reviews.append(review_vector)
        labels.append(1 if item["label"] == 1 else 0)  # 1 for positive, 0 for negative

    return np.array(reviews), np.array(labels)


In [None]:
# Split dataset into train and test sets and prepare data
train_data = imdb_dataset["train"]
test_data = imdb_dataset["test"]

X_train, y_train = prepare_data(train_data)
X_test, y_test = prepare_data(test_data)


In [None]:
# Train a classifier: Random Forest Classifier
print("\nTraining sentiment analysis model...")
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)
print("Model trained.")


In [None]:
# Evaluate the model on the test set
print("\nEvaluating model performance...")
y_pred = classifier.predict(X_test)

# Calculate accuracy and print a classification report
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


## 6. Implement your own Word2Vec model

**!! Disable Co-Pilot for this section !!**

Now it's time to implement your own Word2Vec model. The needed steps are:
1. Data Preprocessing: First, we need to preprocess the text data by tokenizing the sentences and creating a vocabulary. 
2. Data Generation: Next, we generate training data for the Word2Vec model using the skip-gram approach.
3. Model Architecture: We define a neural network architecture to learn word embeddings from the training data.
4. Training: We train the Word2Vec model using the training data and optimize it using backpropagation.

In [None]:
# First let's load the data
data = load_dataset("Daniele/dante-corpus", split="train[:100]")
print(data)

### Step 1: Data Preprocessing
Let's start by preprocessing the text input to create a vocabulary. You have to implement the following steps:
- Tokenize the sentences into words.
- Populate a vocabulary with unique words from the text data.
- Assign an index to each word in the vocabulary.
- Store all the tokenized sentences as a list of lists of word indices.

In [None]:
word_to_ix = {}  # Dictionary to convert words to indices for example {'king': 0, 'queen': 1, ...}
ix_to_word = {}  # Dictionary to convert indices to words for example {0: 'king', 1: 'queen', ...}
tokenized_dataset = []  # List to store the tokenized


# Your code here

print("Vocabulary size:", len(word_to_ix))

Now we have to generate the dataset for training. The dataset should be a list of tuples where each tuple is a pair of the target word and the context word. The context words are the words that appear in a window of size 2 around the target word. So if the sentence is __"the quick brown fox jumps over the lazy dog"__ and the target word is "fox" the context words are ["quick", "brown", "jumps", "over"]. Other example are:

<center>
    <img src="./context_window.png"/>
</center>

In [None]:
context_window = 2

train_dataset = []
# your code here


# Split the dataset into training and validation
train_dataset, val_dataset = train_test_split(train_dataset, test_size=0.1)


Now implement the function to map any word to one-hot encoding:

In [None]:
# one-hot encode the words
# You can choose to one-hot encode words.
vocab_size = len(word_to_ix)


def one_hot_encoding(word: str) -> torch.Tensor:
    pass  # your code here


def one_hot_decoding(one_hot: torch.Tensor) -> str:
    pass  # your code here


It's now time to implement the model. The model is a very simple neural network with two linear layers. The input is the one-hot encoding of the target word and the output is the one-hot encoding of the context word:
<center>
    <img src="./word2vec.png"/>
</center>

In [None]:
import torch
import torch.nn as nn


class SkipGramModel(nn.Module):
    pass
    # your code here


Now let's define the loss function. We can use
1. Softmax
2. CrossEntropyLoss
3. Noise Contrastive Estimation (NCE) Loss
4. Negative Sampling (in the original Word2Vec paper)

For now, we will use just use the CrossEntropyLoss. If you want to try the other loss functions, you can check this [blog](https://lilianweng.github.io/posts/2017-10-15-word-embedding/#loss-functions)

In [None]:
loss_function = nn.CrossEntropyLoss()

In [None]:
# define the optimizer and the model
optimizer = optim.SGD(model.parameters(), lr=0.01)

model = SkipGramModel(vocab_size, 30)

Now implement the training loop. The training loop should:
1. Get the target and context words from the dataset
2. Get the one-hot encoding of the target word
3. Get the one-hot encoding of the context word
4. Fed the target word to the model
5. Compute the loss
6. Optimize the model using backpropagation

In [None]:
from tqdm import tqdm

# train the model
epochs = 300
losses = []
for epoch in tqdm(range(epochs)):
    # your code here

In [None]:
# Plot the loss
import matplotlib.pyplot as plt

plt.plot(losses)