# Lab 17: Embeddings & Vectors Explained

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab17_embeddings_vectors.ipynb)

Understand how AI "understands" meaning through vector representations.

## Learning Objectives
- Understand what embeddings are and why they matter
- Create and visualize text embeddings
- Measure similarity between security concepts
- Build a simple semantic search system

**Next:** Lab 06 (Security RAG)

In [None]:
#@title Install dependencies (Colab only)
#@markdown Run this cell to install required packages in Colab

%pip install -q sentence-transformers scikit-learn numpy matplotlib

In [None]:
import numpy as np
from typing import List, Tuple
from sklearn.metrics.pairwise import cosine_similarity

print("✅ Libraries loaded!")

## Step 1: Understanding Vectors

Before we use real embeddings, let's understand vectors with simple examples.

In [None]:
# Simple example: Representing security concepts as 2D vectors
# (Real embeddings have 384-1536 dimensions, but principle is same)

# Let's say our two dimensions are:
# Dimension 1: How "attack-related" (0 = defensive, 1 = offensive)
# Dimension 2: How "network-related" (0 = host, 1 = network)

security_concepts = {
    "malware":       [0.9, 0.2],  # Offensive, host-focused
    "ransomware":    [0.95, 0.1], # Very offensive, host-focused
    "phishing":      [0.8, 0.6],  # Offensive, some network
    "firewall":      [0.1, 0.9],  # Defensive, network-focused
    "antivirus":     [0.1, 0.2],  # Defensive, host-focused
    "IDS":           [0.2, 0.95], # Defensive, network-focused
    "C2_traffic":    [0.9, 0.9],  # Offensive, network-focused
}

print("Security Concepts as 2D Vectors:")
print("-" * 50)
for concept, vector in security_concepts.items():
    print(f"  {concept:15} → {vector}")

## Step 2: Measuring Similarity with Cosine Similarity

Cosine similarity measures the angle between two vectors:
- **1.0** = Identical direction (same meaning)
- **0.0** = Perpendicular (unrelated)
- **-1.0** = Opposite direction (opposite meaning)

In [None]:
# Calculate similarity between security concepts

def get_similarity(concept1: str, concept2: str) -> float:
    """Calculate cosine similarity between two concepts."""
    vec1 = np.array([security_concepts[concept1]])
    vec2 = np.array([security_concepts[concept2]])
    return cosine_similarity(vec1, vec2)[0][0]

# Compare some pairs
comparisons = [
    ("malware", "ransomware"),     # Should be very similar
    ("firewall", "IDS"),           # Should be similar (both defensive)
    ("malware", "antivirus"),      # Should be different (opposite purposes)
    ("phishing", "C2_traffic"),    # Both attacks, somewhat similar
]

print("Similarity Comparisons:")
print("-" * 60)
for c1, c2 in comparisons:
    sim = get_similarity(c1, c2)
    interpretation = "Very similar" if sim > 0.8 else "Similar" if sim > 0.5 else "Different"
    print(f"  {c1:15} vs {c2:15} = {sim:.3f}  ({interpretation})")

## Step 3: Using Real Embeddings (sentence-transformers)

Now let's use a real embedding model that captures semantic meaning!

In [None]:
# Load a real embedding model
# all-MiniLM-L6-v2 is fast, free, and good for learning

from sentence_transformers import SentenceTransformer

print("Loading embedding model (may take a moment)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"✅ Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Security-focused text for embedding comparison
security_texts = [
    "Malware using PowerShell for execution",
    "Ransomware encrypting files on the system",
    "Phishing email with malicious attachment",
    "Attacker stealing passwords from memory",
    "Quarterly financial report for Q4 2024",
    "Team meeting scheduled for Monday",
    "C2 beacon establishing connection",
    "Mimikatz used to dump credentials",
]

# Create embeddings for all texts
print("Creating embeddings for security texts...")
embeddings = model.encode(security_texts)

print(f"\\n✅ Created {len(embeddings)} embeddings")
print(f"   Each embedding has {len(embeddings[0])} dimensions")
print(f"   First 5 values of first embedding: {embeddings[0][:5].round(3)}")

## 🔬 What's Actually INSIDE an Embedding?

This is the question everyone wonders about but few explain well!

### The Short Answer

Each dimension captures **some aspect of meaning**, but:
- We don't know exactly what each dimension represents
- The model learned these representations from millions of examples
- They're not human-interpretable like "dimension 1 = attack-related"

### The Longer Answer: How Embeddings Are Trained

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HOW EMBEDDING MODELS LEARN                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TRAINING DATA: Millions of text examples                                   │
│                                                                             │
│  "The malware encrypted files using AES-256"                               │
│  "Ransomware typically encrypts data and demands payment"                  │
│  "The firewall blocked suspicious outbound traffic"                        │
│  ... millions more ...                                                      │
│                                                                             │
│  TRAINING PROCESS:                                                          │
│  ──────────────────                                                         │
│  The model learns: "malware" often appears near "encrypt", "file",         │
│  "execute", "payload" → these should have SIMILAR vectors                  │
│                                                                             │
│  The model also learns: "malware" rarely appears with "quarterly",          │
│  "meeting", "vacation" → these should have DIFFERENT vectors               │
│                                                                             │
│  RESULT: Similar concepts end up in similar locations in vector space!     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Analogy: The Map of Meaning

Think of embeddings as GPS coordinates on a **map of meaning**:

```
            Network-focused
                  ↑
                  │    🔥 DDoS attack
                  │    🔥 C2 traffic
                  │    🛡️ Firewall
                  │    🛡️ IDS
 Defensive ←──────┼──────→ Offensive
                  │    🛡️ Antivirus
                  │    🔥 Ransomware
                  │    🔥 Malware
                  │
                  ↓
             Host-focused

Real embeddings have 384+ dimensions, capturing:
- Topic (security vs business vs personal)
- Sentiment (threat vs benign)
- Specificity (general vs technical)
- Entity types (tool vs technique vs actor)
- And hundreds more nuances we can't name!
```

### What The 384 Dimensions Might Capture

| Dimension Range | Possible Meaning (speculative) |
|-----------------|-------------------------------|
| Dims 1-50 | Basic syntax and structure |
| Dims 50-150 | Topic and domain (security, finance, etc.) |
| Dims 150-250 | Sentiment and intent |
| Dims 250-350 | Entity relationships |
| Dims 350-384 | Fine-grained distinctions |

**Important**: These are educated guesses! The actual dimensions are learned, not designed.

### Why This Matters for Security

The magic is that **you don't need to understand every dimension** - you just need to know that:

1. Similar threats → similar embeddings → easy to find
2. Novel threats → unusual embeddings → anomaly detection
3. Related IOCs → close embeddings → threat correlation

In [None]:
# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings)

print("Similarity Matrix (security vs business texts):")
print("=" * 70)

# Show select comparisons
test_pairs = [
    (0, 1),  # Malware vs Ransomware
    (3, 7),  # Password stealing vs Mimikatz (same concept!)
    (0, 4),  # Malware vs Quarterly report (unrelated)
    (2, 6),  # Phishing vs C2 beacon (both attacks)
    (4, 5),  # Business texts (both non-security)
]

for i, j in test_pairs:
    sim = similarity_matrix[i][j]
    interpretation = "🟢 Similar" if sim > 0.5 else "🔴 Different"
    print(f"{interpretation} ({sim:.3f})")
    print(f"   '{security_texts[i][:40]}...'")
    print(f"   '{security_texts[j][:40]}...'")
    print()

## Step 5: Building Semantic Search

The real power: find related content by meaning, not just keywords!

In [None]:
def semantic_search(query: str, documents: List[str], top_k: int = 3) -> List[Tuple[str, float]]:
    """
    Find most similar documents to a query using embeddings.

    This is the foundation of RAG (Retrieval Augmented Generation)!

    Args:
        query: Search query
        documents: List of documents to search
        top_k: Number of results to return

    Returns:
        List of (document, similarity) tuples
    """
    # Embed the query
    query_embedding = model.encode([query])

    # Embed all documents
    doc_embeddings = model.encode(documents)

    # Calculate similarities
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    return [(documents[i], similarities[i]) for i in top_indices]

# Test semantic search
threat_docs = [
    "The attacker used Mimikatz to extract credentials from LSASS memory",
    "Ransomware encrypted all files with .locked extension",
    "Phishing email contained a malicious Word document with macros",
    "Lateral movement detected via PsExec to multiple hosts",
    "C2 communication established over DNS tunneling",
    "Quarterly financial results exceeded expectations",
]

print("🔍 SEMANTIC SEARCH DEMO")
print("=" * 60)

query = "attacker stealing passwords"
print(f"\\nQuery: '{query}'")
print("-" * 60)

results = semantic_search(query, threat_docs)
for i, (doc, score) in enumerate(results, 1):
    print(f"{i}. ({score:.3f}) {doc[:60]}...")

## 🎉 Key Takeaways

1. **Embeddings capture meaning** - Similar text → similar vectors
2. **Cosine similarity** - Standard way to compare embeddings (0-1)
3. **Semantic search** - Find by meaning, not exact words
4. **Dimension matters** - More dimensions = more nuance, but slower
5. **Foundation for RAG** - Embeddings power retrieval in RAG systems

### Similarity Score Guide

```
1.0  = Identical meaning
0.8+ = Very similar (synonyms, same topic)
0.5-0.8 = Related
0.3-0.5 = Loosely related
<0.3 = Unrelated
```

## Next Steps

- **Lab 06**: Build a full RAG system with ChromaDB
- **Lab 07**: Use embeddings to find similar malware patterns
- **Lab 16**: Use embeddings for threat actor clustering