# Lab 04: ML Concepts Primer

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab04_ml_concepts.ipynb)

Understand machine learning theory before coding. No coding required - just concepts!

## Learning Objectives
- Supervised vs unsupervised learning
- Features and labels
- Training, validation, and testing
- Evaluation metrics

**Next:** Lab 31 (Prompt Engineering) or Lab 21 (Hello World ML)

## 1. What is Machine Learning?

Machine learning is teaching computers to learn patterns from data instead of explicit programming.

**Traditional Programming:**
```
Rules + Data → Program → Output
```

**Machine Learning:**
```
Data + Expected Output → ML Algorithm → Model (learned rules)
```

## 2. Supervised Learning

Learning from labeled examples - **like studying with an answer key!**

### 📚 The Metaphor: Learning with a Teacher

Think of supervised learning like a student studying for an exam with a study guide that has **questions AND answers**:

```
┌─────────────────────────────────────────────────────────────────┐
│  SUPERVISED LEARNING = Learning with an Answer Key              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Student sees:                        Teacher provides:        │
│   ┌──────────────┐                    ┌──────────────┐         │
│   │ Email text   │ ─────────────────► │ "phishing"   │         │
│   │ with 5 URLs  │                    │              │         │
│   └──────────────┘                    └──────────────┘         │
│                                                                 │
│   ┌──────────────┐                    ┌──────────────┐         │
│   │ Email from   │ ─────────────────► │ "legitimate" │         │
│   │ known sender │                    │              │         │
│   └──────────────┘                    └──────────────┘         │
│                                                                 │
│   After seeing 1000s of examples, the student learns:          │
│   "Lots of URLs + urgency = probably phishing!"                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### 🎯 Real-World Analogy

| Scenario | Input (Features) | Output (Label) |
|----------|------------------|----------------|
| Medical diagnosis | Symptoms, test results | Disease name |
| Spam filter | Email text, sender | Spam / Not spam |
| Malware detection | File behavior, imports | Malware / Benign |
| Fraud detection | Transaction details | Fraudulent / Legitimate |

**The key insight**: Someone had to manually label the training data first. In security, this often means analysts spending hours classifying threats!

In [None]:
# Visualization of supervised learning
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
legitimate = np.random.randn(50, 2) + [2, 2]
phishing = np.random.randn(50, 2) + [5, 5]

plt.figure(figsize=(8, 6))
plt.scatter(legitimate[:, 0], legitimate[:, 1], c="green", label="Legitimate", alpha=0.7)
plt.scatter(phishing[:, 0], phishing[:, 1], c="red", label="Phishing", alpha=0.7)
plt.xlabel("Feature 1 (e.g., URL count)")
plt.ylabel("Feature 2 (e.g., urgency words)")
plt.title("Supervised Learning: Labeled Data")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 3. Unsupervised Learning

Finding patterns without labels - **like sorting a messy closet!**

### 📚 The Metaphor: Organizing Without Instructions

Imagine you're handed a box of 1000 random objects and told: "Sort these into groups." No one tells you what the groups should be - you just find natural groupings yourself:

```
┌─────────────────────────────────────────────────────────────────┐
│  UNSUPERVISED LEARNING = Organizing Without Instructions        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Before (raw data):           After (discovered clusters):    │
│   ┌─────────────────┐          ┌─────────────────┐             │
│   │ 🎾⚽🏀⚾🎱📱     │          │ Sports:         │             │
│   │ 💻🖱️📺🏈🎿🎮     │    →     │ 🎾⚽🏀⚾🏈🎿     │             │
│   │ 🖨️📻📱🎿🖥️      │          │                 │             │
│   └─────────────────┘          │ Electronics:    │             │
│                                │ 📱💻🖱️📺🎮🖨️📻🖥️ │             │
│                                └─────────────────┘             │
│                                                                 │
│   The algorithm discovered "sports" vs "electronics"           │
│   without anyone telling it those categories exist!            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### 🔍 Supervised vs Unsupervised: Side by Side

| Aspect | Supervised | Unsupervised |
|--------|------------|--------------|
| **Analogy** | Learning with answer key | Exploring without instructions |
| **Data needed** | Features + Labels | Features only |
| **Goal** | Predict known categories | Discover hidden patterns |
| **Example** | "Is this malware?" | "What groups exist in my data?" |
| **Use when** | You know what to look for | You don't know what's there |

### 🎯 Security Use Cases for Unsupervised Learning

1. **Threat Hunting**: Group similar network connections → find anomalies
2. **Malware Families**: Cluster samples → discover new variants
3. **User Behavior**: Group activity patterns → detect compromised accounts
4. **Log Analysis**: Find unusual log patterns you didn't know to look for

In [None]:
from sklearn.cluster import KMeans

# Generate unlabeled data
np.random.seed(42)
cluster1 = np.random.randn(30, 2) + [0, 0]
cluster2 = np.random.randn(30, 2) + [4, 4]
cluster3 = np.random.randn(30, 2) + [0, 4]
data = np.vstack([cluster1, cluster2, cluster3])

# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(data)

plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap="viridis", alpha=0.7)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    c="red",
    marker="X",
    s=200,
    label="Centers",
)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Unsupervised Learning: Clustering")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 4. Features and Labels

**Features (X):** Input variables the model learns from
- Email: word count, URL count, sender domain
- Malware: file size, imports, entropy
- Network: bytes sent, packet count, duration

**Labels (y):** What we want to predict
- Classification: phishing/legitimate, malware/benign
- Regression: threat score (0-10)

In [None]:
import pandas as pd

# Example feature table
data = {
    "email_length": [150, 500, 200, 1000],
    "url_count": [5, 1, 8, 0],
    "urgent_words": [3, 0, 5, 1],
    "label": ["phishing", "legitimate", "phishing", "legitimate"],
}

df = pd.DataFrame(data)
print("Features (X) and Labels (y):")
print(df)

## 5. Train/Test Split

**Why split data?**
- Training set: Model learns from this
- Test set: Evaluate on unseen data
- Prevents overfitting (memorizing instead of learning)

In [None]:
from sklearn.model_selection import train_test_split

# Generate sample data
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

### ⚠️ The Danger of Overfitting: Memorizing vs Learning

**Overfitting** is when a model memorizes the training data instead of learning general patterns. It's like a student who memorizes test answers without understanding the concepts.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    OVERFITTING: The Core Problem                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   GOOD MODEL (Generalization)         BAD MODEL (Overfitting)               │
│   ─────────────────────────           ────────────────────────              │
│                                                                             │
│   Training: 95% accuracy              Training: 99.9% accuracy              │
│   Test:     93% accuracy              Test:     60% accuracy                │
│                                                                             │
│   ✅ Similar performance =            ❌ Huge gap = PROBLEM!                │
│      learned real patterns                memorized training data           │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   📚 ANALOGY: The Cheating Student                                          │
│                                                                             │
│   Imagine two students studying for a phishing detection test:              │
│                                                                             │
│   Student A (Good model):                                                   │
│   "I notice phishing emails often have urgency, suspicious links,          │
│    and grammar mistakes. I'll look for those patterns."                    │
│   → Does well on practice tests AND the real exam ✅                       │
│                                                                             │
│   Student B (Overfitting):                                                  │
│   "I'll memorize every exact practice question word-for-word."             │
│   → Perfect on practice tests, fails the real exam ❌                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 🎯 Why This Matters for Security

In security ML, overfitting is **dangerous**:

| Scenario | What happens with overfitting |
|----------|------------------------------|
| Malware detection | Model memorizes training malware signatures, misses new variants |
| Phishing detection | Model memorizes specific phishing emails, misses new campaigns |
| Intrusion detection | Model learns training attack patterns, misses novel attacks |

**The solution**: Always test on data the model has never seen (the test set)!

In [None]:
# Visual demonstration of overfitting vs good fit
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

np.random.seed(42)

# Generate some noisy data (imagine: file_size vs threat_score)
X_demo = np.linspace(0, 10, 15).reshape(-1, 1)
y_demo = 2 * np.sin(X_demo).ravel() + np.random.normal(0, 0.5, 15)  # True pattern + noise

# Create test points
X_test_demo = np.linspace(0, 10, 100).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Model 1: Underfitting (too simple - straight line)
model_simple = LinearRegression()
model_simple.fit(X_demo, y_demo)
y_simple = model_simple.predict(X_test_demo)

axes[0].scatter(X_demo, y_demo, c='blue', s=60, label='Training data', zorder=5)
axes[0].plot(X_test_demo, y_simple, 'r-', linewidth=2, label='Model prediction')
axes[0].set_title('❌ UNDERFITTING\n(Too simple - misses pattern)', fontsize=11)
axes[0].set_xlabel('Feature (e.g., file size)')
axes[0].set_ylabel('Target (e.g., threat score)')
axes[0].legend(fontsize=8)
axes[0].grid(True, alpha=0.3)

# Model 2: Good fit (just right)
model_good = make_pipeline(PolynomialFeatures(3), LinearRegression())
model_good.fit(X_demo, y_demo)
y_good = model_good.predict(X_test_demo)

axes[1].scatter(X_demo, y_demo, c='blue', s=60, label='Training data', zorder=5)
axes[1].plot(X_test_demo, y_good, 'g-', linewidth=2, label='Model prediction')
axes[1].set_title('✅ GOOD FIT\n(Captures pattern, ignores noise)', fontsize=11)
axes[1].set_xlabel('Feature (e.g., file size)')
axes[1].legend(fontsize=8)
axes[1].grid(True, alpha=0.3)

# Model 3: Overfitting (too complex - memorizes noise)
model_overfit = make_pipeline(PolynomialFeatures(12), LinearRegression())
model_overfit.fit(X_demo, y_demo)
y_overfit = model_overfit.predict(X_test_demo)

axes[2].scatter(X_demo, y_demo, c='blue', s=60, label='Training data', zorder=5)
axes[2].plot(X_test_demo, y_overfit, 'orange', linewidth=2, label='Model prediction')
axes[2].set_title('❌ OVERFITTING\n(Memorizes noise, fails on new data)', fontsize=11)
axes[2].set_xlabel('Feature (e.g., file size)')
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3)
axes[2].set_ylim(-4, 4)  # Limit y-axis to show the wild oscillations

plt.tight_layout()
plt.suptitle('The Goldilocks Problem: Finding the Right Model Complexity', y=1.02, fontsize=13)
plt.show()

print("\n📖 What You're Seeing:")
print("=" * 60)
print("  • Blue dots: Training data (what the model learned from)")
print("  • Lines: What the model predicts")
print()
print("  LEFT (Underfitting): Model too simple")
print("    → Misses the underlying pattern entirely")
print("    → Bad on training data AND test data")
print()
print("  MIDDLE (Good Fit): Model complexity just right")
print("    → Captures the real pattern")
print("    → Ignores random noise in training data")
print("    → Works well on NEW data!")
print()
print("  RIGHT (Overfitting): Model too complex")  
print("    → Perfectly fits every training point (including noise!)")
print("    → Wild predictions between training points")
print("    → Fails badly on new data")

## 6. Evaluation Metrics

For classification:
- **Accuracy**: % correct predictions
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1 Score**: Balance of precision and recall

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Example predictions
y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]  # Actual labels
y_pred = [1, 1, 0, 0, 0, 1, 1, 0, 1, 0]  # Model predictions

print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=["Benign", "Malicious"]))

## 7. Security Context: Why Metrics Matter

**High Recall needed when:**
- Missing a threat is costly (malware detection)
- Better to have false alarms than miss attacks

**High Precision needed when:**
- False positives are expensive (blocking legitimate users)
- Alert fatigue is a concern

In [None]:
# Visualization of precision vs recall tradeoff
thresholds = np.linspace(0.1, 0.9, 9)
precision = [0.5, 0.55, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.98]
recall = [0.98, 0.95, 0.9, 0.85, 0.75, 0.65, 0.5, 0.35, 0.2]

plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision, "b-o", label="Precision")
plt.plot(thresholds, recall, "r-o", label="Recall")
plt.xlabel("Detection Threshold")
plt.ylabel("Score")
plt.title("Precision vs Recall Tradeoff")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Key Takeaways

1. **Supervised**: Learn from labeled examples
2. **Unsupervised**: Find patterns without labels
3. **Features**: Input data the model uses
4. **Train/Test Split**: Evaluate on unseen data
5. **Metrics**: Choose based on security context

## Next Steps
- **Lab 31**: Prompt Engineering Mastery
- **Lab 29**: Build your first classifier!