# 🧮 Week 5-6 · Notebook 09 · Vector Embeddings

Understand how embeddings translate manufacturing text into math, evaluate model choices, and wire them into semantic search + RAG pipelines.

## 🎯 Learning Objectives
- Explain how embeddings capture semantics for maintenance, quality, and safety text.
- Compare open-source embedding models on manufacturing corpora.
- Visualize similarity relationships to spot clusters and outliers.
- Persist embeddings in vector stores and optimize search parameters.
- Design evaluation harnesses for recall@k, precision, and latency.

## 🧠 Concept Recap
- Embeddings map text into high-dimensional vectors where proximity ≈ semantic similarity.
- Cosine similarity measures angle similarity; dot product scales with magnitude.
- Domain-specific embeddings (e.g., maintenance logs) reduce OOV terminology.
- Hybrid search pairs dense vectors with keywords for precision + recall.

## 🏭 Manufacturing Use Cases
| Scenario | Embedding Application |
| --- | --- |
| Maintenance ticket routing | Match new incidents to historical fixes |
| Spare parts search | Map component descriptions to supplier catalogs |
| Shift summaries | Cluster similar incidents for reporting |
| Quality deviations | Retrieve similar NCRs for containment plans |
| EHS knowledge base | Surface relevant policies for incidents |


In [None]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

samples = pd.DataFrame([
    {"text": "Change filter on paint booth every 2 shifts.", "label": "maintenance"},
    {"text": "Replace hydraulic oil in press 12 quarterly.", "label": "maintenance"},
    {"text": "Inspect conveyor belt tension weekly.", "label": "maintenance"},
    {"text": "Calibrate torque wrench before use.", "label": "quality"},
    {"text": "Supplier shipped incorrect fasteners for line 8.", "label": "supply"},
    {"text": "Record OEE drop to 71% after unplanned downtime.", "label": "operations"},
])

embeddings = model.encode(samples.text.tolist(), convert_to_tensor=True)
embeddings.shape

In [None]:
similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()
sim_df = pd.DataFrame(similarity_matrix, columns=samples.text, index=samples.text)
sim_df.round(3)

## 🔍 Interpreting Similarities
- Values close to 1 indicate strong semantic overlap (e.g., maintenance tasks).
- Cross-domain pairs (maintenance vs. supply) show lower similarity, helpful for routing.
- Use thresholds (e.g., ≥0.6) to filter relevant neighbors in retrieval.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
coords = pca.fit_transform(embeddings.cpu().numpy())

plt.figure(figsize=(6, 6))
for (x, y), label in zip(coords, samples.label):
    plt.scatter(x, y, label=label)
    plt.text(x + 0.02, y + 0.02, label, fontsize=9)
plt.title("Embedding PCA Projection")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.show()

## 🧪 Model Comparison Cheatsheet
| Model | Dim | Strengths | Notes |
| --- | --- | --- | --- |
| `all-MiniLM-L6-v2` | 384 | Fast, multilingual-lite | Great baseline for prototypes |
| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | 768 | Multilingual support | Higher latency |
| `intfloat/multilingual-e5-large` | 1024 | Strong recall | Needs more memory |
| Custom fine-tuned model | varies | Captures plant-specific jargon | Requires labelled pairs |

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

store = Chroma.from_texts(samples.text.tolist(), HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))
results = store.similarity_search("How do we maintain the press?", k=2)
[(r.page_content, r.metadata) for r in results]

## 📏 Evaluation Metrics
| Metric | Definition | Target |
| --- | --- | --- |
| Recall@k | % of relevant docs in top-k | ≥ 0.85 |
| MRR | Mean reciprocal rank | ≥ 0.75 |
| Latency | Average retrieval time | < 200 ms |
| Memory footprint | Model + index size | Fits GPU/CPU budget |
| Drift | Cosine distance shift over time | < 10% monthly |

Log these metrics per deployment to build confidence in retrieval performance.

In [None]:
eval_questions = pd.DataFrame([
    {"question": "When do we change hydraulic oil?", "relevant": samples.text[1]},
    {"question": "What is daily vision maintenance?", "relevant": samples.text[2]},
])

def recall_at_k(question: str, relevant: str, k: int = 3) -> float:
    hits = store.similarity_search(question, k=k)
    return 1.0 if any(hit.page_content == relevant for hit in hits) else 0.0

eval_questions["recall@3"] = eval_questions.apply(lambda row: recall_at_k(row.question, row.relevant), axis=1)
eval_questions

## 🛠️ Best Practices
- Normalize text (units, casing) before embedding to reduce noise.
- Store metadata like equipment type, shift, language for filtered search.
- Periodically re-embed documents after SOP updates.
- Version embedding models and vector indexes.
- Combine dense + keyword filters for high-precision tasks.

## 🧪 Lab Assignment
1. Collect 200 historical tickets and preprocess (normalization, unit expansion).
2. Benchmark at least three embedding models for recall@5 and latency.
3. Visualize clusters (PCA/UMAP) and highlight edge cases.
4. Document hardware requirements and cost for production deployment.
5. Publish evaluation report + recommendations for the RAG team.

## ✅ Checklist
- [ ] Embedding model selected and documented
- [ ] Preprocessing pipeline implemented
- [ ] Evaluation metrics computed and logged
- [ ] Vector store integration tested
- [ ] Governance plan for re-embedding schedule

## 📚 References
- SentenceTransformers documentation
- FAISS cookbook
- *Evaluating Embeddings in Industrial NLP* (2024)
- Week 08 RAG Implementation notebook

## Embedding Stores
- **Chroma / FAISS**: prototypes and small-scale deployments.
- **Milvus / Pinecone**: scalable, managed options.
- **Elasticsearch / OpenSearch**: hybrid dense + keyword search.

## Exercise
Benchmark three embedding models on your maintenance FAQ set. Compare recall@5 and latency.