# ? Notebook 06: Word Embeddings - The Building Blocks of NLP

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## üéØ Objectives

Welcome to the world of word embeddings! Before a neural network can understand text, we need to convert words into numbers. But how we do this is critically important. This notebook explores the evolution of representing text, from simple but flawed methods to the powerful vector representations that underpin modern NLP.

By the end of this session, you will understand:

1.  **The "Why":** The limitations of basic text representations like one-hot encoding.
2.  **The "How":** The theory behind distributional semantics‚Äîthe idea that "a word is characterized by the company it keeps."
3.  **The Classics:** How to build foundational embedding models like **Word2Vec** from scratch.
4.  **The Powerhouses:** How to leverage pre-trained embeddings like **GloVe** and contextual embeddings from **Transformers (BERT)**.
5.  **The Application:** How to visualize and interpret these embeddings to reveal semantic relationships in text data.

**Estimated Time:** 3 hours

---

## ? Why are Embeddings So Important?

At their core, **embeddings are numerical representations of words (or sentences) in a low-dimensional vector space.** A good embedding captures the semantic meaning of a word, placing it in the vector space such that similar words are close to each other.

For example, the vectors for "cat" and "kitten" should be very close, while the vectors for "cat" and "car" should be far apart. This allows neural networks to learn patterns and relationships based on meaning, not just arbitrary token IDs. They are the fundamental input layer for nearly all deep learning models that work with text.

---

### üìú Agenda

1.  **Part 1: The Problem with Numbers - One-Hot Encoding**
    *   Representing words as sparse, high-dimensional vectors.
    *   Demonstrating their key limitation: no notion of similarity.
2.  **Part 2: A Better Way - Distributional Semantics & Co-occurrence Matrices**
    *   Building a co-occurrence matrix from a corpus.
    *   Using dimensionality reduction (PCA) to create dense embeddings.
3.  **Part 3: Learning Embeddings - Word2Vec (Skip-Gram) from Scratch**
    *   Implementing the Skip-Gram with Negative Sampling algorithm in PyTorch.
    *   Training our own embeddings on a custom corpus.
    *   Visualizing the learned vector space with t-SNE.
4.  **Part 4: Using Pre-trained Embeddings - GloVe**
    *   Loading and using powerful, pre-trained word vectors from `torchtext`.
5.  **Part 5: The State of the Art - Contextual Embeddings with Transformers**
    *   Understanding the difference between static (Word2Vec, GloVe) and contextual embeddings (BERT).
    *   Using a pre-trained Transformer to generate sentence embeddings that understand context.

---

In [None]:
# --- Basic Setup and Imports ---
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import math
from collections import Counter
import re # For tokenization

# --- Plotting and Device Configuration ---
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Set a seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Set the device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'‚úÖ Using device: {device}')
print(f'PyTorch version: {torch.__version__}')

## üß† Part 1: One-Hot Encoding - A Flawed Start

The most basic way to convert a word into a vector is **one-hot encoding**.

Imagine we have a vocabulary of size `V`. For any given word, we create a vector of length `V` that is all zeros, except for a single `1` at the index corresponding to that word.

**Example Vocabulary:** `['temperature', 'vibration', 'pressure']` (Size V=3)
*   `temperature` -> `[1, 0, 0]`
*   `vibration`   -> `[0, 1, 0]`
*   `pressure`    -> `[0, 0, 1]`

**Key Characteristics:**
*   **Sparse:** The vectors are mostly zeros.
*   **High-Dimensional:** The vector length equals the vocabulary size, which can be huge (e.g., >50,000).
*   **Orthogonal:** The dot product of any two different one-hot vectors is always 0. This is their biggest flaw‚Äîit implies that there is **no shared similarity** between any two words. The model has no way of knowing that "vibration" is more similar to "pressure" than it is to "apple".

Let's see this in action.

In [None]:
# Define a small, relevant vocabulary
vocabulary = ['temperature', 'vibration', 'pressure', 'coolant', 'bearing', 'leakage']
vocab_size = len(vocabulary)

# Create a mapping from word to its index
word_to_idx = {word: i for i, word in enumerate(vocabulary)}
print(f"Vocabulary mapping: {word_to_idx}\n")

def one_hot_encode(word, word_to_idx):
    """Creates a one-hot vector for a given word."""
    # Create a vector of zeros with length equal to the vocabulary size
    one_hot_vector = np.zeros(len(word_to_idx))
    # Get the index of the word
    idx = word_to_idx[word]
    # Set the element at that index to 1
    one_hot_vector[idx] = 1
    return one_hot_vector

# --- Demonstrate the flaw ---
# Encode two words
hot_temp = one_hot_encode('temperature', word_to_idx)
hot_vibration = one_hot_encode('vibration', word_to_idx)

print(f"One-hot vector for 'temperature': {hot_temp}")
print(f"One-hot vector for 'vibration':   {hot_vibration}")

# Calculate the dot product. For one-hot vectors, this is a proxy for similarity.
# A dot product of 0 means the vectors are orthogonal (completely unrelated).
similarity_score = np.dot(hot_temp, hot_vibration)
print(f"\nDot product (similarity) between 'temperature' and 'vibration': {similarity_score}")
print("‚ùå This is meaningless! The model can't learn that these concepts might be related.")

## üß† Part 2: Distributional Semantics & Co-occurrence Matrices

To overcome the limitations of one-hot encoding, we turn to the **Distributional Hypothesis**: "a word is characterized by the company it keeps."

This means we can infer a word's meaning by looking at the words that frequently appear near it. A simple way to capture this is with a **co-occurrence matrix**.

**How it works:**
1.  **Define a Context Window:** Choose a window size (e.g., 2 words to the left and 2 to the right).
2.  **Slide and Count:** Slide this window across your entire text corpus. For each word in the center of the window, count which other words appear inside its window.
3.  **Build the Matrix:** Create a square matrix of size `V x V` (where V is your vocabulary size). The entry `(word1, word2)` in the matrix stores the number of times `word2` appeared in the context window of `word1`.

The rows of this matrix can be considered our first attempt at dense embeddings! They are no longer orthogonal; words that appear in similar contexts will have similar-looking vectors.

Let's build one from a small corpus of synthetic maintenance reports.

In [None]:
# A small corpus of text data
corpus = [
    'High temperature spike detected in the main furnace chamber.',
    'Vibration analysis shows an increase near the primary bearing housing.',
    'A sudden pressure drop indicates a coolant circulation issue.',
    'Coolant leakage was observed directly below the hydraulic press.',
    'The main bearing is overheating, which threatens a production halt.',
    'A sensor outage is causing a data gap in the plant historian.',
    'The hydraulic pump is showing signs of cavitation, producing audible noise.',
    'The lubrication schedule was missed for the gearbox assembly.',
    'Conveyor belt torque variation is impacting the line speed.',
    'An unexpected voltage surge has tripped the main safety relay.'
]

def tokenize_corpus(corpus):
    """A simple tokenizer that converts to lowercase and removes punctuation."""
    tokenized_docs = []
    for doc in corpus:
        # Remove punctuation and split by space
        tokens = re.sub(r'[^\w\s]', '', doc.lower()).split()
        tokenized_docs.append(tokens)
    return tokenized_docs

# --- Build the Vocabulary and Co-occurrence Matrix ---
tokenized_corpus = tokenize_corpus(corpus)
# Create a vocabulary of all unique words
vocab = sorted(list(set(word for doc in tokenized_corpus for word in doc)))
vocab_size = len(vocab)
word_to_idx = {word: i for i, word in enumerate(vocab)}

# Initialize a V x V matrix of zeros
co_occurrence_matrix = np.zeros((vocab_size, vocab_size), dtype=np.float32)

# Define the context window size
window_size = 2

# Populate the matrix
for doc in tokenized_corpus:
    for i, target_word in enumerate(doc):
        target_idx = word_to_idx[target_word]
        # Define the start and end of the context window
        start = max(i - window_size, 0)
        end = min(i + window_size + 1, len(doc))
        
        # Iterate through the context window
        for j in range(start, end):
            if i == j:
                continue # Don't count the target word itself
            context_word = doc[j]
            context_idx = word_to_idx[context_word]
            # Increment the count for the co-occurring pair
            co_occurrence_matrix[target_idx, context_idx] += 1

# --- Visualize the Matrix ---
# For better visualization, let's wrap it in a pandas DataFrame
co_occurrence_df = pd.DataFrame(co_occurrence_matrix, index=vocab, columns=vocab)

plt.figure(figsize=(14, 12))
sns.heatmap(co_occurrence_df, cmap='viridis', annot=False)
plt.title('Word Co-occurrence Matrix (Window Size = 2)', fontsize=16, fontweight='bold')
plt.xlabel('Context Words')
plt.ylabel('Target Words')
plt.tight_layout()
plt.show()

# Example: Find co-occurrences for the word 'bearing'
print("Co-occurrence vector for 'bearing':")
print(co_occurrence_df.loc['bearing'][co_occurrence_df.loc['bearing'] > 0])

### From Co-occurrence to Embeddings: Dimensionality Reduction with PCA

The rows of our co-occurrence matrix are dense vectors, but they are still very high-dimensional (equal to the vocabulary size). This makes them computationally expensive and prone to noise.

A common technique to make these vectors more manageable and robust is to reduce their dimensionality using methods like **Principal Component Analysis (PCA)**. PCA finds the directions of maximum variance in the data and projects the data onto a new, lower-dimensional subspace.

These lower-dimensional vectors are our first real "embeddings"! Let's project our `V x V` matrix down to `V x 2` so we can visualize the word relationships in a 2D plot. Words that appeared in similar contexts should now cluster together.

In [None]:
# Initialize PCA to reduce dimensions to 2
pca = PCA(n_components=2, random_state=42)

# Fit PCA on the co-occurrence matrix and transform it
# The rows of the matrix are our high-dimensional vectors
embeddings_2d = pca.fit_transform(co_occurrence_matrix)

# --- Visualize the 2D Embeddings ---
plt.figure(figsize=(12, 10))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], color='skyblue', s=50)

# Annotate each point with its corresponding word
for i, word in enumerate(vocab):
    plt.text(
        embeddings_2d[i, 0] + 0.03, # Add a small offset for readability
        embeddings_2d[i, 1] + 0.03,
        word,
        fontsize=10
    )

plt.title('2D Word Embeddings via PCA on Co-occurrence Matrix', fontsize=16, fontweight='bold')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.axhline(0, color='grey', linewidth=0.5)
plt.axvline(0, color='grey', linewidth=0.5)
plt.tight_layout()
plt.show()

# --- Interpretation ---
# Look for clusters! For example, 'bearing' and 'housing' might be close,
# as might 'coolant', 'leakage', and 'pressure'. This shows that even
# this simple method can start to capture semantic relationships.

## üß† Part 3: Learning Embeddings - Word2Vec (Skip-Gram) from Scratch

While co-occurrence matrices are intuitive, they have drawbacks:
*   They can become very large and sparse for big vocabularies.
*   The resulting embeddings are not always optimal for downstream tasks.

A more powerful approach is to **learn** the embeddings directly by training a neural network. **Word2Vec** is a family of models that does exactly this. We will implement the **Skip-Gram** variant.

### The Skip-Gram Model

The core idea of Skip-Gram is the inverse of our co-occurrence counting: instead of counting which words appear next to a target word, we train a model to **predict the context words given a target word**.

**The Process:**
1.  **Generate Training Data:** We slide a window over our corpus and create `(target, context)` word pairs. For a window size of 2, the sentence "pressure drop indicates a coolant issue" would yield pairs like `(indicates, pressure)`, `(indicates, drop)`, `(indicates, a)`, `(indicates, coolant)`.
2.  **Define the Model:** We create a simple neural network with one hidden layer. The weights of this hidden layer will become our word embeddings!
    *   The input is a one-hot encoded vector of the target word.
    *   The output is a probability distribution over the entire vocabulary, representing the likelihood of each word being a context word.
3.  **The "Negative Sampling" Trick:** Training a model to predict over the entire vocabulary (which could be 50,000+ words) is extremely slow. **Negative Sampling** provides a clever and efficient approximation. Instead of updating weights for all words, for each `(target, context)` pair (a "positive" sample), we randomly select a few "negative" samples‚Äîwords that are *not* in the target's context. The model's task then becomes much simpler: predict `1` for the true context word and `0` for the random negative words.

Let's build it step-by-step. First, we'll create a PyTorch `Dataset` to generate our `(target, context)` pairs.

In [None]:
class SkipGramDataset(Dataset):
    """
    A PyTorch Dataset to generate (target, context) pairs for Skip-Gram training.
    """
    def __init__(self, tokenized_docs, word_to_idx, window_size=2):
        """
        Args:
            tokenized_docs (list of list of str): The corpus, already tokenized.
            word_to_idx (dict): Mapping from words to their integer indices.
            window_size (int): The size of the context window (words to the left and right).
        """
        self.pairs = []
        # Iterate through each document in the corpus
        for doc in tokenized_docs:
            # Convert the document's words to their corresponding indices
            indexed_doc = [word_to_idx[word] for word in doc if word in word_to_idx]
            
            # Iterate through each word in the indexed document to treat it as a target
            for i, target_word_idx in enumerate(indexed_doc):
                # Define the start and end of the context window
                start = max(i - window_size, 0)
                end = min(i + window_size + 1, len(indexed_doc))
                
                # Iterate through the context window
                for j in range(start, end):
                    if i == j:
                        continue # The target word is not its own context
                    context_word_idx = indexed_doc[j]
                    # Add the (target, context) pair to our list
                    self.pairs.append((target_word_idx, context_word_idx))

    def __len__(self):
        """Returns the total number of (target, context) pairs."""
        return len(self.pairs)

    def __getitem__(self, idx):
        """
        Retrieves a single (target, context) pair by its index.
        The DataLoader will use this to create batches.
        """
        target, context = self.pairs[idx]
        return torch.tensor(target, dtype=torch.long), torch.tensor(context, dtype=torch.long)

# --- Create the Dataset and DataLoader ---
# We'll use the same tokenized corpus and vocabulary from Part 2
skipgram_dataset = SkipGramDataset(tokenized_corpus, word_to_idx, window_size=2)
# The DataLoader will handle batching and shuffling for us during training
skipgram_loader = DataLoader(skipgram_dataset, batch_size=128, shuffle=True)

print(f"‚úÖ Created Skip-Gram dataset with {len(skipgram_dataset)} (target, context) pairs.")

# Let's inspect a few pairs
print("\n--- Sample Pairs ---")
for i in range(5):
    target_idx, context_idx = skipgram_dataset[i]
    target_word = list(word_to_idx.keys())[target_idx]
    context_word = list(word_to_idx.keys())[context_idx]
    print(f"Pair {i+1}: (target='{target_word}', context='{context_word}')")

### The Skip-Gram Model Architecture

Our model is surprisingly simple. It consists of two embedding layers:

1.  **Target Embeddings:** This is an embedding matrix where each row corresponds to the vector representation of a word when it's the *target* word. This is the matrix we'll ultimately use as our word embeddings.
2.  **Context Embeddings:** A second embedding matrix where each row corresponds to a word's vector when it's a *context* word.

**Forward Pass Logic (with Negative Sampling):**

For a given `(target, positive_context)` pair:
1.  Look up the embedding for the `target` word from the `target_embeddings` matrix.
2.  Look up the embedding for the `positive_context` word from the `context_embeddings` matrix.
3.  Compute the dot product of these two vectors. A high dot product means they are similar. We want to maximize this, so we pass it through a sigmoid function and aim for a target of `1`.
4.  Randomly sample a few `negative_context` words from the vocabulary.
5.  Look up their embeddings from the `context_embeddings` matrix.
6.  Compute the dot product between the `target` embedding and each of the `negative_context` embeddings. We want to minimize these, so we pass them through a sigmoid and aim for a target of `0`.
7.  The total loss is the sum of the losses from the positive and negative samples.

This process efficiently trains the model to push the vectors of true context pairs closer together while pushing the vectors of random pairs apart.

In [None]:
class SkipGramModel(nn.Module):
    """
    A PyTorch implementation of the Skip-Gram model with Negative Sampling.
    """
    def __init__(self, vocab_size, embed_dim):
        """
        Args:
            vocab_size (int): The total number of unique words in the vocabulary.
            embed_dim (int): The desired dimensionality of the word embeddings.
        """
        super().__init__()
        # The embedding layer for target words (the one we'll keep)
        self.target_embeddings = nn.Embedding(vocab_size, embed_dim)
        # The embedding layer for context words
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)

    def forward(self, target, context):
        """
        This forward pass is designed for calculating the loss.
        It computes the dot product between target and context embeddings.
        """
        # Get the embedding vectors for the batch of target and context words
        target_embeds = self.target_embeddings(target)   # Shape: (batch_size, embed_dim)
        context_embeds = self.context_embeddings(context) # Shape: (batch_size, embed_dim)
        
        # Compute the dot product between target and context vectors
        # We use element-wise multiplication and sum along the dimension of the embeddings
        scores = torch.sum(target_embeds * context_embeds, dim=1)
        return scores

def train_skipgram(model, loader, optimizer, epochs=100, num_negative_samples=5):
    """
    The main training loop for the Skip-Gram model using negative sampling.
    """
    # For negative sampling, we need a "noise distribution" to sample from.
    # Words that appear more frequently should be sampled more often.
    # The formula uses a power of 0.75 to smooth the distribution slightly.
    word_counts = Counter(word for doc in tokenized_corpus for word in doc)
    freqs = [word_counts[word] for word in vocab]
    freqs_tensor = torch.tensor(freqs, dtype=torch.float32)
    noise_dist = freqs_tensor.pow(0.75)
    noise_dist /= torch.sum(noise_dist)

    loss_history = []
    model.train() # Set the model to training mode
    
    print("üîÑ Starting Word2Vec (Skip-Gram) training...")
    for epoch in range(1, epochs + 1):
        total_loss = 0
        for target_words, context_words in loader:
            # Move data to the selected device
            target_words = target_words.to(device)
            context_words = context_words.to(device)
            
            optimizer.zero_grad()

            # --- 1. Positive Loss ---
            # Get the scores for the true (positive) context pairs
            positive_scores = model(target_words, context_words)
            # We want the model to predict 1 for these, so we use log-sigmoid
            positive_loss = F.logsigmoid(positive_scores).mean()

            # --- 2. Negative Loss ---
            # Sample random negative words from the noise distribution
            batch_size = target_words.shape[0]
            negative_words = torch.multinomial(
                noise_dist, batch_size * num_negative_samples, replacement=True
            ).to(device)
            
            # Get the scores for the negative samples
            # We need to reshape the target words to match the number of negative samples
            target_for_neg = target_words.repeat_interleave(num_negative_samples)
            negative_scores = model(target_for_neg, negative_words)
            # We want the model to predict 0 for these, so we use log-sigmoid on the negative scores
            negative_loss = F.logsigmoid(-negative_scores).mean()

            # --- 3. Total Loss ---
            # The goal is to maximize positive scores and minimize negative scores.
            # Maximizing log(sigmoid(x)) + log(sigmoid(-y)) is equivalent to minimizing -(...).
            loss = -(positive_loss + negative_loss)
            
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(loader)
        loss_history.append(avg_loss)
        if epoch % 20 == 0 or epoch == 1:
            print(f"Epoch {epoch:03d}/{epochs} | Average Loss: {avg_loss:.4f}")
            
    print("‚úÖ Training complete!")
    return loss_history

# --- Hyperparameters and Training ---
EMBED_DIM = 50 # The dimensionality of our learned embeddings
LEARNING_RATE = 0.01
EPOCHS = 200

# Initialize the model and optimizer
skipgram_model = SkipGramModel(vocab_size, EMBED_DIM).to(device)
optimizer = optim.Adam(skipgram_model.parameters(), lr=LEARNING_RATE)

# Run the training
loss_history = train_skipgram(skipgram_model, skipgram_loader, optimizer, epochs=EPOCHS)

In [None]:
# Plot the training loss over epochs
plt.figure(figsize=(10, 5))
plt.plot(loss_history, color='deepskyblue', linewidth=2)
plt.title('Word2Vec (Skip-Gram) Training Loss', fontsize=16, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Average Negative Sampling Loss')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

### Visualizing the Learned Embeddings with t-SNE

Now for the exciting part! After training, the `target_embeddings` layer in our model contains the learned word vectors. Just like we did with PCA, we can use a dimensionality reduction technique to visualize them.

**t-SNE (t-Distributed Stochastic Neighbor Embedding)** is a more sophisticated technique than PCA, particularly well-suited for visualizing high-dimensional datasets in low dimensions (like 2D or 3D). It works by preserving local similarities, so points that are close in the high-dimensional space are mapped to points that are close in the low-dimensional space.

Let's extract the weights from our model and plot them. We should see meaningful clusters emerge.

In [None]:
# Extract the learned embedding vectors from the model
# We use .detach() to remove them from the computation graph
learned_embeddings = skipgram_model.target_embeddings.weight.detach().cpu().numpy()

# Use t-SNE to reduce the 50-dimensional embeddings to 2 dimensions
# Perplexity is a key hyperparameter; it's roughly the number of close neighbors each point has.
tsne = TSNE(n_components=2, random_state=42, perplexity=10)
embeddings_2d_tsne = tsne.fit_transform(learned_embeddings)

# --- Visualize the 2D t-SNE Embeddings ---
plt.figure(figsize=(12, 10))
plt.scatter(embeddings_2d_tsne[:, 0], embeddings_2d_tsne[:, 1], color='coral', s=50)

# Annotate each point with its corresponding word
for i, word in enumerate(vocab):
    plt.text(
        embeddings_2d_tsne[i, 0] + 0.05,
        embeddings_2d_tsne[i, 1] + 0.05,
        word,
        fontsize=10
    )

plt.title('2D t-SNE Visualization of Learned Word2Vec Embeddings', fontsize=16, fontweight='bold')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

# --- Interpretation ---
# Compare this plot to the PCA plot. The clusters should be more distinct.
# 'hydraulic', 'pump', 'press' might form a group.
# 'bearing', 'housing', 'overheating' might form another.
# This demonstrates that the neural network has learned meaningful semantic relationships from the data.

### Querying for Semantic Similarity

The ultimate test of our embeddings is to see if they can find words with similar meanings. We can do this by calculating the **cosine similarity** between the vector of a query word and all other vectors in the vocabulary.

**Cosine Similarity** measures the cosine of the angle between two vectors.
*   A value of `1` means the vectors point in the exact same direction (maximum similarity).
*   A value of `0` means they are orthogonal.
*   A value of `-1` means they point in opposite directions.

It's the standard way to measure similarity between embedding vectors.

In [None]:
def find_most_similar(query_word, embeddings, word_to_idx, top_k=5):
    """
    Finds the most similar words to a query word based on cosine similarity.
    """
    if query_word not in word_to_idx:
        print(f"Error: '{query_word}' not in vocabulary.")
        return

    # Get the embedding vector for the query word
    query_vector = embeddings[word_to_idx[query_word]]
    
    # Calculate cosine similarity between the query vector and all other vectors
    # Cosine Similarity = (A . B) / (||A|| * ||B||)
    dot_products = np.dot(embeddings, query_vector)
    norms = np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_vector)
    similarities = dot_products / norms
    
    # Get the indices of the top_k most similar words (excluding the query word itself)
    # We use argsort to get the indices that would sort the array in ascending order,
    # then we take the last few indices for descending order.
    top_indices = np.argsort(similarities)[-top_k-1:-1][::-1]
    
    print(f"üîé Words most similar to '{query_word}':")
    for i in top_indices:
        word = list(word_to_idx.keys())[i]
        similarity = similarities[i]
        print(f"  - {word:15s} (Similarity: {similarity:.3f})")

# --- Perform Similarity Queries ---
find_most_similar('vibration', learned_embeddings, word_to_idx)
print("-" * 30)
find_most_similar('coolant', learned_embeddings, word_to_idx)
print("-" * 30)
find_most_similar('hydraulic', learned_embeddings, word_to_idx)

## üß† Part 4: Using Pre-trained Embeddings (GloVe)

Training your own embeddings is great for domain-specific tasks, but it requires a large corpus of text. For general-purpose language understanding, it's often better to start with **pre-trained embeddings** that have been trained on massive datasets (like the entire web).

**GloVe (Global Vectors for Word Representation)** is another popular model for learning embeddings. While Word2Vec is a "predictive" model, GloVe is a "count-based" model that is trained directly on the statistics of a global co-occurrence matrix.

The `torchtext` library makes it easy to download and use pre-trained GloVe vectors. We can load them and see how they represent our vocabulary, even without training on our specific corpus.

**Note:** This requires an internet connection to download the GloVe vectors the first time you run it. The files are several hundred megabytes.

In [None]:
# This cell is optional and requires an internet connection.
# If you have issues, you can skip to the next section.
try:
    from torchtext.vocab import GloVe
    
    # Load the pre-trained GloVe embeddings.
    # '6B' means it was trained on 6 billion tokens.
    # 'dim=100' means we'll get 100-dimensional vectors.
    glove = GloVe(name='6B', dim=100)
    print("‚úÖ Successfully loaded pre-trained GloVe (6B, 100d) vectors.")

    # Let's get the GloVe vector for a word
    word = 'pressure'
    glove_vector = glove[word]
    print(f"\nGloVe vector for '{word}':")
    print(glove_vector)
    print(f"Vector dimension: {glove_vector.shape[0]}")

    # Find words similar to 'pressure' using GloVe
    def find_glove_similar(query_word, glove_model, top_k=5):
        if query_word not in glove_model.stoi:
            print(f"'{query_word}' not in GloVe vocabulary.")
            return
        
        query_vector = glove_model[query_word]
        
        # Calculate cosine similarity against the entire GloVe vocabulary
        # This can be slow, so we'll just demonstrate the principle
        similarities = {}
        for word, idx in glove_model.stoi.items():
            vec = glove_model.vectors[idx]
            # Using torch.cosine_similarity for efficiency
            sim = F.cosine_similarity(query_vector.unsqueeze(0), vec.unsqueeze(0)).item()
            similarities[word] = sim
            
        # Sort and get top_k
        sorted_sims = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
        
        print(f"\nüîé Words most similar to '{query_word}' in GloVe:")
        for word, sim in sorted_sims[1:top_k+1]:
            print(f"  - {word:15s} (Similarity: {sim:.3f})")

    find_glove_similar('pressure', glove)

except ImportError:
    print("‚ö†Ô∏è torchtext not found. Skipping GloVe section.")
    print("Install it with: pip install torchtext")
except Exception as e:
    print(f"Could not download GloVe vectors. Skipping section. Error: {e}")

## üß† Part 5: Contextual Embeddings with Transformers (BERT)

Word2Vec and GloVe produce **static embeddings**. This means that a word like "pressure" has the *exact same* vector regardless of its context.
*   "The tire **pressure** is low."
*   "He felt the **pressure** of the deadline."

Clearly, the meaning is different in each sentence. This is a major limitation that static embeddings cannot overcome.

**Contextual embeddings**, generated by models like **BERT (Bidirectional Encoder Representations from Transformers)**, solve this problem. BERT processes the entire sentence at once, and the embedding it generates for a word is a function of all the other words in that sentence.

This allows the model to create dynamic, context-aware word vectors. We'll use the popular `transformers` library from HuggingFace to easily load a pre-trained BERT model and see this in action.

**Note:** This section requires the `transformers` library (`pip install transformers`) and an internet connection to download the model.

In [None]:
# This cell is optional and requires the transformers library and an internet connection.
try:
    from transformers import AutoTokenizer, AutoModel

    # Load a pre-trained tokenizer and model.
    # 'distilbert-base-uncased' is a smaller, faster version of BERT.
    model_name = 'distilbert-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to(device)
    print(f"‚úÖ Successfully loaded pre-trained '{model_name}' model.")

    # --- Compare embeddings for the same word in different contexts ---
    sentence1 = "The tire pressure is low."
    sentence2 = "He felt the pressure of the deadline."

    def get_word_embedding(sentence, word, tokenizer, model):
        """
        Gets the contextual embedding for a specific word in a sentence.
        """
        # Tokenize the sentence
        inputs = tokenizer(sentence, return_tensors='pt').to(device)
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        
        # Find the index of our target word
        try:
            word_index = tokens.index(word)
        except ValueError:
            print(f"Word '{word}' not found as a single token in the sentence.")
            return None
            
        # Get the model's output (the hidden states)
        with torch.no_grad():
            outputs = model(**inputs)
            last_hidden_states = outputs.last_hidden_state

        # Extract the embedding for our target word
        word_embedding = last_hidden_states[0, word_index, :].cpu()
        return word_embedding

    # Get the two different embeddings for "pressure"
    embedding1 = get_word_embedding(sentence1, 'pressure', tokenizer, model)
    embedding2 = get_word_embedding(sentence2, 'pressure', tokenizer, model)

    if embedding1 is not None and embedding2 is not None:
        # Calculate the cosine similarity between the two vectors
        similarity = F.cosine_similarity(embedding1.unsqueeze(0), embedding2.unsqueeze(0)).item()

        print(f"\n--- Contextual Embedding Comparison for the word 'pressure' ---")
        print(f"Sentence 1: '{sentence1}'")
        print(f"Sentence 2: '{sentence2}'")
        print(f"\nCosine Similarity between the two embeddings: {similarity:.3f}")

        if similarity < 0.9:
            print("\n‚úÖ The embeddings are different! BERT has captured the different contexts.")
        else:
            print("\nü§î The embeddings are very similar. This can happen with some models/sentences.")

except ImportError:
    print("‚ö†Ô∏è transformers library not found. Skipping BERT section.")
    print("Install it with: pip install transformers")
except Exception as e:
    print(f"Could not download/run BERT model. Skipping section. Error: {e}")

## üéâ Summary & Key Takeaways

Excellent work! You've journeyed from the most basic form of word representation to the state-of-the-art, building a foundational understanding of how modern NLP models process language.

### The Evolution of Embeddings:

1.  **One-Hot Encoding:**
    *   **Concept:** A sparse vector with a single `1`.
    *   **Flaw:** Assumes all words are completely unrelated (orthogonal vectors). Useless for capturing meaning.

2.  **Co-occurrence Matrices:**
    *   **Concept:** Count how often words appear near each other. The rows of this matrix are simple embeddings.
    *   **Improvement:** Captures the distributional hypothesis ("you are known by the company you keep"). Similar words have similar vectors.
    *   **Flaw:** High-dimensional, sparse, and doesn't scale well.

3.  **Learned Static Embeddings (Word2Vec, GloVe):**
    *   **Concept:** Train a neural network to *learn* low-dimensional, dense embeddings.
    *   **Improvement:** Computationally efficient, dense, and captures complex semantic relationships (e.g., analogies like "king - man + woman = queen").
    *   **Flaw:** Static‚Äîone word has only one vector, regardless of context.

4.  **Contextual Embeddings (BERT, Transformers):**
    *   **Concept:** Generate embeddings for a word based on the entire sentence it appears in.
    *   **Improvement:** Dynamic and context-aware. Solves the problem of words with multiple meanings (polysemy).
    *   **Current State-of-the-Art:** The foundation for all modern Large Language Models.

### What's Next?

Now that you understand how to turn text into meaningful numbers, you're ready to build powerful NLP models. In the next notebook, we'll dive into the **HuggingFace Transformers library**, the go-to toolkit for working with pre-trained models like BERT for tasks like sentiment analysis, question answering, and more.

<div align="center">
<b>You've mastered the building blocks. Now, let's build something powerful. üöÄ</b>
</div>