# 🧮 Week 5-6 · Notebook 09 · A Deep Dive into Vector Embeddings

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Vector embeddings are the foundation of modern semantic search and Retrieval-Augmented Generation (RAG). They are numerical representations of text that capture its semantic meaning. In this notebook, we'll explore what embeddings are, how to create them, and why they are the key to making our Manufacturing Copilot's knowledge base searchable.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
1. ✅ **Explain Vector Embeddings:** Describe what embeddings are and how they capture the meaning of text.
2. ✅ **Generate and Compare Embeddings:** Use a Hugging Face model to create embeddings and measure their similarity.
3. ✅ **Visualize Embeddings:** Use dimensionality reduction (PCA) to visualize the relationships between documents in 2D space.
4. ✅ **Evaluate Embedding Quality:** Understand the importance of evaluating embeddings and implement a basic recall metric.

## ⚙️ Setup: Installing Libraries

We'll need `sentence-transformers` to create embeddings and `scikit-learn` for visualization.

In [None]:
# !pip install -q sentence-transformers pandas numpy scikit-learn matplotlib

from sentence_transformers import SentenceTransformer, util
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

## 🧠 What are Vector Embeddings?

Imagine a giant library where books are organized not by title, but by their *content*. Books about similar topics are placed close together. Vector embeddings do this for text. An embedding model converts a piece of text (a sentence, a paragraph, or a whole document) into a list of numbers (a vector).

The magic is that texts with similar meanings will have vectors that are "close" to each other in this high-dimensional space. This is what allows us to find relevant documents for a user's query.

Let's create some embeddings for a few sample manufacturing documents.

In [None]:
# A more realistic set of documents for our knowledge base
documents = pd.DataFrame([
    {
        "label": "maintenance", "text": "The hydraulic press requires an oil change every 2000 hours of operation to prevent wear."},
    {
        "label": "maintenance", "text": "Weekly inspection of the conveyor belt's tension is mandatory for all shift supervisors."},
    {
        "label": "incident", "text": "Incident Report #451: CNC machine #3 stalled due to a coolant leak. Downtime was 35 minutes."},
    {
        "label": "incident", "text": "Incident Report #452: Operator noticed unusual vibrations from the main compressor unit."},
    {
        "label": "safety", "text": "Safety Alert: Always perform lockout-tagout before servicing any machinery with moving parts."},
    {
        "label": "quality", "text": "Quality Control: Part #A-103 is showing surface defects. Calibrate the vision system immediately."}
])

# Load a pre-trained model from Hugging Face
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Generate embeddings for our documents
embeddings = model.encode(documents.text.tolist(), convert_to_tensor=True)

print(f"Generated {embeddings.shape[0]} embeddings, each with {embeddings.shape[1]} dimensions.")
documents

## 📏 Measuring Similarity: Cosine Similarity

How do we measure the "closeness" of two vectors? The most common method is **Cosine Similarity**. It measures the cosine of the angle between two vectors. 

- A value of **1** means the vectors are identical in orientation (very similar).
- A value of **0** means they are orthogonal (unrelated).
- A value of **-1** means they are opposite (dissimilar).

Let's calculate the similarity between all of our documents.

# Calculate the cosine similarity matrix
similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()

# Display it as a DataFrame for readability
sim_df = pd.DataFrame(similarity_matrix, index=documents.label, columns=documents.label)

print("--- Cosine Similarity Matrix ---")
# Highlighting values > 0.5 for clarity
sim_df.style.applymap(lambda x: 'background-color: yellow' if x > 0.5 and x < 1.0 else '')

**Interpretation:** Notice the yellow boxes! The two `maintenance` documents are similar to each other, as are the two `incident` reports. The `safety` and `quality` documents are less similar to the others, which makes sense. This is the power of semantic search.

## 🎨 Visualizing Embeddings with PCA

It's hard to visualize vectors with 384 dimensions. We can use a technique called **Principal Component Analysis (PCA)** to reduce the dimensions to 2, allowing us to plot them on a graph. This helps build intuition about how the model is organizing the data.

# Reduce dimensions to 2D using PCA
pca = PCA(n_components=2)
coords = pca.fit_transform(embeddings.cpu().numpy())

documents['pca_x'] = coords[:, 0]
documents['pca_y'] = coords[:, 1]

# Plot the 2D coordinates
plt.figure(figsize=(10, 8))
for label, group in documents.groupby('label'):
    plt.scatter(group.pca_x, group.pca_y, label=label, alpha=0.7)

# Annotate points
for i, row in documents.iterrows():
    plt.text(row.pca_x + 0.01, row.pca_y, f"Doc {i}", fontsize=9)

plt.title("2D Visualization of Document Embeddings using PCA")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid(True)
plt.show()

**Interpretation:** As you can see, the model naturally clusters similar documents together. The `maintenance` and `incident` documents form distinct groups. This clustering is what enables a vector store to quickly find relevant information.

## 🧪 Evaluating Embedding Quality

Not all embedding models are created equal. For a real-world application, you must evaluate how well your chosen model performs on *your* data. A common metric is **Recall@k**.

**Recall@k:** For a given question, does the correct (relevant) document appear in the top `k` retrieved results? If yes, the score is 1, otherwise 0.

Let's run a simple evaluation.

In [None]:
# Define a few test questions and the ID of the document that should answer them
eval_questions = {
    "How often to change hydraulic oil?": 0, # Should match doc 0
    "What happened to the CNC machine?": 2, # Should match doc 2
    "What is the rule for servicing machines?": 4 # Should match doc 4
}

def evaluate_recall(k=1):
    correct = 0
    for question, doc_id in eval_questions.items():
        # Embed the question
        query_embedding = model.encode(question, convert_to_tensor=True)
        
        # Find the top_k most similar documents in our corpus
        hits = util.semantic_search(query_embedding, embeddings, top_k=k)[0]
        
        # Check if the correct document ID is in the retrieved hits
        retrieved_ids = [hit['corpus_id'] for hit in hits]
        if doc_id in retrieved_ids:
            correct += 1
            
    return correct / len(eval_questions)

recall_at_1 = evaluate_recall(k=1)
recall_at_2 = evaluate_recall(k=2)

print(f"Recall@1: {recall_at_1:.2%}") # Is the correct doc the #1 result?
print(f"Recall@2: {recall_at_2:.2%}") # Is the correct doc in the top 2 results?

## ✅ Congratulations! You've Completed the LLM & RAG Module!

This notebook concludes our deep dive into the core components of modern LLM applications. You now understand:

1.  **LLMs and Transformers:** The fundamental models that power generative AI.
2.  **Prompt Engineering:** How to guide LLMs to produce desired outputs.
3.  **Few-Shot Learning:** How to provide examples to improve performance on specific tasks.
4.  **Retrieval-Augmented Generation (RAG):** The architecture for connecting LLMs to external knowledge.
5.  **Vector Embeddings:** The technology that makes retrieval possible.

You have all the conceptual tools needed to build the knowledge core for the Manufacturing Copilot. In the next module, we will explore how to productionize these ideas and build a complete, end-to-end application.