# 🧲 Notebook 06: Embeddings & Vector Semantics

**Week 3-4: Deep Learning & NLP Foundations**  
**Gen AI Masters Program**

---

## 📋 Objectives

By the end of this notebook, you will master:
1. ✅ One-hot vs dense embeddings
2. ✅ Co-occurrence matrices and dimensionality reduction
3. ✅ Training Word2Vec (skip-gram) from scratch in PyTorch
4. ✅ Visualizing manufacturing vocabulary in embedding space
5. ✅ Using pre-trained GloVe embeddings
6. ✅ Extracting contextual embeddings with Transformers

**Estimated Time:** 4 hours

---

## 🧠 Why Embeddings Matter

Embeddings map discrete tokens to continuous vectors that capture:
- Semantic similarity ("temperature" ↔ "heat")
- Relationships ("pump" ↔ "hydraulic")
- Manufacturing context ("vibration" ↔ "bearing")

Embeddings power attention, transformers, and downstream generative models.

In [None]:
# Imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import math
from collections import Counter, defaultdict
from typing import List, Tuple

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'✅ Using device: {device}')
print(f'PyTorch version: {torch.__version__}')

## 1️⃣ One-Hot vs Dense Representations

### One-Hot Encoding
- Vector of vocabulary size
- Exactly one "1"; rest are zeros
- No notion of similarity

### Dense Embeddings
- Low-dimensional vectors (e.g., 100)
- Learned from data
- Capture semantics

In [None]:
manufacturing_vocab = ['temperature', 'vibration', 'pressure', 'coolant', 'bearing', 'leakage']
vocab_to_idx = {word: idx for idx, word in enumerate(manufacturing_vocab)}

def one_hot(word):
        vec = np.zeros(len(manufacturing_vocab))
        vec[vocab_to_idx[word]] = 1
        return vec

hot_temp = one_hot('temperature')
hot_vibration = one_hot('vibration')

print('One-hot for temperature:', hot_temp)
print('Dot product (temperature · vibration):', np.dot(hot_temp, hot_vibration))

### Learned Embeddings
We'll create a toy embedding matrix and visualize distances.

In [None]:
embedding_dim = 4
embeddings = nn.Embedding(len(manufacturing_vocab), embedding_dim)
torch.nn.init.xavier_uniform_(embeddings.weight)

temp_vec = embeddings(torch.tensor(vocab_to_idx['temperature'])).detach().numpy()
vibration_vec = embeddings(torch.tensor(vocab_to_idx['vibration'])).detach().numpy()
print('Embedding temperature:', np.round(temp_vec, 3))
print('Embedding vibration   :', np.round(vibration_vec, 3))
print('Cosine similarity     :', np.dot(temp_vec, vibration_vec) / (np.linalg.norm(temp_vec)*np.linalg.norm(vibration_vec)))

## 2️⃣ Co-occurrence Matrix

We'll build a co-occurrence matrix from synthetic maintenance reports.

In [None]:
maintenance_corpus = [
    'temperature spike detected in furnace chamber',
    'vibration increase near main bearing housing',
    'pressure drop indicates coolant circulation issue',
    'coolant leakage observed below hydraulic press',
    'bearing overheating threatens production halt',
    'sensor outage causing data gap in historian',
    'hydraulic pump cavitation producing audible noise',
    'lubrication schedule missed for gear assembly',
    'conveyor torque variation impacts line speed',
    'unexpected voltage surge tripped safety relay'
]

window_size = 2
tokenized_docs = [doc.lower().split() for doc in maintenance_corpus]
vocab = sorted({word for doc in tokenized_docs for word in doc})
idx = {word: i for i, word in enumerate(vocab)}
co_matrix = np.zeros((len(vocab), len(vocab)), dtype=np.float32)

for doc in tokenized_docs:
        for i, word in enumerate(doc):
            word_idx = idx[word]
            start = max(i - window_size, 0)
            end = min(i + window_size + 1, len(doc))
            for j in range(start, end):
                if i != j:
                    context_idx = idx[doc[j]]
                    co_matrix[word_idx, context_idx] += 1

co_df = pd.DataFrame(co_matrix, index=vocab, columns=vocab)
plt.figure(figsize=(12, 10))
sns.heatmap(co_df, cmap='Blues')
plt.title('Co-occurrence Matrix (Window=2)', fontweight='bold')
plt.tight_layout()
plt.show()

### Dimensionality Reduction (PCA)

In [None]:
pca = PCA(n_components=2)
co_embeddings = pca.fit_transform(co_matrix)

plt.figure(figsize=(10, 8))
for word, (x, y) in zip(vocab, co_embeddings):
        plt.scatter(x, y, color='steelblue')
        plt.text(x + 0.02, y + 0.02, word, fontsize=10)
plt.title('PCA Projection of Co-occurrence Embeddings')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 3️⃣ Training Word2Vec (Skip-Gram) from Scratch

We'll train on the maintenance corpus to learn domain-specific embeddings.

In [None]:
class SkipGramDataset(Dataset):
    def __init__(self, tokenized_docs: List[List[str]], vocab_to_idx: dict, window_size: int = 2):
        self.pairs = []
        for doc in tokenized_docs:
            indexed = [vocab_to_idx[word] for word in doc]
            for i, center in enumerate(indexed):
                start = max(i - window_size, 0)
                end = min(i + window_size + 1, len(indexed))
                for j in range(start, end):
                    if i != j:
                        context = indexed[j]
                        self.pairs.append((center, context))

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        return torch.tensor(self.pairs[idx][0]), torch.tensor(self.pairs[idx][1])


skipgram_dataset = SkipGramDataset(tokenized_docs, idx, window_size=2)
skipgram_loader = DataLoader(skipgram_dataset, batch_size=64, shuffle=True)
print(f'Total skip-gram pairs: {len(skipgram_dataset)}')

### Skip-Gram Model

In [None]:
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int):
        super().__init__()
        self.target_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embed_dim)

    def forward(self, center_words, context_words):
        center_vecs = self.target_embeddings(center_words)
        context_vecs = self.context_embeddings(context_words)
        scores = torch.sum(center_vecs * context_vecs, dim=1)
        return scores


EMBED_DIM = 100
model = SkipGramModel(len(vocab), EMBED_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

# Negative sampling: create noise distribution
word_freqs = np.array([vocab.count(word) for word in vocab], dtype=np.float32)
word_freqs = word_freqs / word_freqs.sum()
noise_distribution = torch.tensor(word_freqs ** 0.75 / np.sum(word_freqs ** 0.75))

def negative_sampling_loss(model, center_words, pos_context, neg_sample_size=5):
    center_embeds = model.target_embeddings(center_words)
    pos_embeds = model.context_embeddings(pos_context)

    pos_scores = torch.sum(center_embeds * pos_embeds, dim=1)
    pos_loss = F.logsigmoid(pos_scores)

    batch_size = center_words.size(0)
    neg_words = torch.multinomial(noise_distribution, batch_size * neg_sample_size, replacement=True).to(device)
    neg_words = neg_words.view(batch_size, neg_sample_size)
    neg_embeds = model.context_embeddings(neg_words)

    neg_scores = torch.bmm(neg_embeds, center_embeds.unsqueeze(2)).squeeze()
    neg_loss = F.logsigmoid(-neg_scores).sum(dim=1)

    return -(pos_loss + neg_loss).mean()


EPOCHS = 200
loss_history = []
print('🔄 Training Word2Vec skip-gram embeddings...')
for epoch in range(1, EPOCHS + 1):
    epoch_loss = 0
    for centers, contexts in skipgram_loader:
        centers = centers.to(device)
        contexts = contexts.to(device)

        loss = negative_sampling_loss(model, centers, contexts, neg_sample_size=10)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item() * centers.size(0)

    avg_loss = epoch_loss / len(skipgram_dataset)
    loss_history.append(avg_loss)
    if epoch % 40 == 0 or epoch == 1:
        print(f'Epoch {epoch:03d} | Loss: {avg_loss:.4f}')

print('✅ Training complete!')

In [None]:
plt.figure(figsize=(12, 4))
plt.plot(loss_history, color='blue', linewidth=2)
plt.title('Skip-Gram Training Loss', fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Negative Sampling Loss')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Visualizing Learned Embeddings

In [None]:
with torch.no_grad():
    word_vectors = model.target_embeddings.weight.cpu().numpy()

two_d = TSNE(n_components=2, random_state=42, perplexity=5).fit_transform(word_vectors)

plt.figure(figsize=(10, 8))
for word, (x, y) in zip(vocab, two_d):
    plt.scatter(x, y, color='darkorange')
    plt.text(x + 0.05, y + 0.05, word, fontsize=10)
plt.title('t-SNE Visualization of Manufacturing Word2Vec Embeddings')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Similarity Queries

In [None]:
def most_similar(word: str, top_k: int = 5):
    if word not in idx:
        raise ValueError('Word not in vocabulary')
    vector = word_vectors[idx[word]]
    similarities = []
    for other_word in vocab:
        sim = np.dot(vector, word_vectors[idx[other_word]]) / (np.linalg.norm(vector) * np.linalg.norm(word_vectors[idx[other_word]]) + 1e-9)
        similarities.append((other_word, sim))
    similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
    return similarities[1:top_k+1]

print('Top neighbors for 
:')
for word, score in most_similar('vibration'):
    print(f'  {word:15s} -> {score:.3f}')

## 4️⃣ Using Pre-trained GloVe Embeddings

GloVe (Global Vectors) captures global co-occurrence statistics.

We'll demonstrate how to load embeddings using `torchtext` (requires internet to download at runtime).

In [None]:
# OPTIONAL: Uncomment to download pre-trained GloVe vectors (requires internet)
# from torchtext.vocab import GloVe
# glove = GloVe(name='6B', dim=100)  # 100-dimensional embeddings
#
# def glove_vector(word: str):
#     return glove[word]
#
# manufacturing_terms = ['sensor', 'factory', 'hydraulic', 'temperature', 'safety', 'robot']
# for term in manufacturing_terms:
#     vec = glove_vector(term)
#     print(f'{term:12s} | vector norm: {vec.norm().item():.4f}')

### Projecting GloVe Embeddings (If Available)

In [None]:
# OPTIONAL: Visualize subset of GloVe vectors
# subset = torch.stack([glove[word] for word in manufacturing_terms])
# tsne_glove = TSNE(n_components=2, random_state=42).fit_transform(subset.numpy())
#
# plt.figure(figsize=(8, 6))
# for (x, y), word in zip(tsne_glove, manufacturing_terms):
#     plt.scatter(x, y, color='teal')
#     plt.text(x + 0.02, y + 0.02, word, fontsize=12)
# plt.title('t-SNE of GloVe Embeddings (Manufacturing Terms)')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

## 5️⃣ Contextual Embeddings with Transformers

Static embeddings assign one vector per word. Contextual embeddings (BERT, RoBERTa, DistilBERT) change with sentence context.

We'll generate sentence embeddings for maintenance logs using HuggingFace Transformers.

In [None]:
# OPTIONAL: Requires transformers library and internet to download the model
# from transformers import AutoTokenizer, AutoModel
#
# model_name = 'distilbert-base-uncased'
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# bert_model = AutoModel.from_pretrained(model_name).to(device)
#
# maintenance_sentences = [
#     'Hydraulic pump failure triggered an emergency shutdown',
#     'Sensor drift detected during calibration routine',
#     'Operators reported excessive vibration near cutter assembly',
#     'Thermal runaway prevented by automatic coolant injection',
#     'Preventive maintenance completed without incidents'
# ]
#
# def encode_sentences(sentences: List[str]):
#     encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to(device)
#     with torch.no_grad():
#         outputs = bert_model(**encoded)
#     # Use CLS token (first token) as sentence embedding
#     cls_embeddings = outputs.last_hidden_state[:, 0, :]
#     return cls_embeddings.cpu().numpy()
#
# contextual_vectors = encode_sentences(maintenance_sentences)
# tsne_contextual = TSNE(n_components=2, random_state=42).fit_transform(contextual_vectors)
#
# plt.figure(figsize=(9, 6))
# for (x, y), sentence in zip(tsne_contextual, maintenance_sentences):
#     plt.scatter(x, y, color='crimson')
#     plt.text(x + 0.02, y + 0.02, sentence[:30] + '...', fontsize=9)
# plt.title('Contextual Embeddings (DistilBERT CLS)')
# plt.grid(True, alpha=0.3)
# plt.tight_layout()
# plt.show()

### Semantic Similarity with Contextual Embeddings

In [None]:
# OPTIONAL: Compute cosine similarity between sentence embeddings
# from sklearn.metrics.pairwise import cosine_similarity
# sim_matrix = cosine_similarity(contextual_vectors)
#
# plt.figure(figsize=(6, 5))
# sns.heatmap(sim_matrix, annot=True, cmap='Purples', xticklabels=False, yticklabels=False)
# plt.title('Contextual Embedding Similarity (DistilBERT)')
# plt.tight_layout()
# plt.show()

## 6️⃣ Embedding Best Practices

1. **Domain adaptation**: Fine-tune embeddings on manufacturing corpora
2. **Out-of-vocabulary (OOV)**: Use subword tokenization (BPE, WordPiece)
3. **Dimensionality**: Balance expressiveness vs compute (128-768 for transformers)
4. **Normalization**: Normalize vectors for cosine comparisons
5. **Storage**: Persist embeddings for downstream predictive maintenance

## 🎉 Summary

Awesome work embedding knowledge into vectors!

### Key Concepts
- ✅ One-hot vs dense embeddings
- ✅ Co-occurrence matrices & PCA
- ✅ Training Word2Vec (skip-gram with negative sampling)
- ✅ Visualizing manufacturing semantics
- ✅ Loading pre-trained GloVe vectors
- ✅ Extracting contextual embeddings with transformers

### What You Built
1. 🧮 Co-occurrence matrix for maintenance corpora
2. 🧠 Skip-gram Word2Vec trainer in PyTorch
3. 🔍 t-SNE visualization of manufacturing vocabulary
4. 🧱 Cosine similarity explorer
5. 🧰 Optional hooks into GloVe & DistilBERT

### Manufacturing Insights
- Group similar failure terms for root-cause libraries
- Use embeddings to cluster maintenance logs and highlight emerging issues
- Initialize downstream classifiers with domain-specific vectors

### Next Steps
Advance to **Notebook 07: HuggingFace Transformers** to integrate these embeddings into production workflows.

<div align="center">
<b>Embeddings secured! Onward to HuggingFace pipelines. 🤝🚀</b>
</div>