# Tutorial 01: Clarifying Requirements for ML Systems

Welcome to the second tutorial in our ML System Design series! This tutorial focuses on the critical first step in any ML system design: **clarifying requirements**.

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Master** the art of asking clarifying questions
2. **Translate** business objectives to technical requirements
3. **Document** system constraints and scale requirements
4. **Create** comprehensive requirements specifications

---

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass, field
from typing import List, Dict
from enum import Enum
from datetime import datetime

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
print("Setup complete!")

## Why Requirements Matter

**The Problem with Skipping Requirements:**
- Building the wrong system
- Over-engineering or under-engineering
- Missing critical constraints

**Benefits of Proper Requirements Gathering:**
- Clear success criteria
- Appropriate system complexity
- Feasibility assessment

In [None]:
# Visualization: Cost of fixing issues at different stages
fig, ax = plt.subplots(figsize=(10, 5))

stages = ['Requirements', 'Design', 'Development', 'Testing', 'Production']
relative_cost = [1, 5, 10, 50, 200]
colors = plt.cm.Reds(np.linspace(0.3, 0.9, len(stages)))

ax.bar(stages, relative_cost, color=colors)
ax.set_ylabel('Relative Cost to Fix Issues')
ax.set_title('Cost of Fixing Issues by Stage')
ax.set_yscale('log')

for i, cost in enumerate(relative_cost):
    ax.text(i, cost * 1.2, f'{cost}x', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

---

## 1. Business Objective Questions

The first category focuses on understanding the **WHY** - what is the business trying to achieve?

In [None]:
# Business objective questions framework
business_questions = {
    "Primary Objective": [
        "What is the main business goal this ML system should achieve?",
        "How does this align with the company's overall strategy?",
        "What happens if we don't build this system?",
    ],
    "Success Metrics": [
        "How will we measure success?",
        "What's the current baseline for these metrics?",
        "What improvement would be considered a success?",
    ],
    "Trade-offs": [
        "Revenue vs user experience - which is prioritized?",
        "Short-term gains vs long-term user retention?",
        "Accuracy vs speed - what's more important?",
    ]
}

print("Business Objective Questions Framework")
print("=" * 50)
for category, questions in business_questions.items():
    print(f"\n{category}:")
    for i, q in enumerate(questions, 1):
        print(f"   {i}. {q}")

In [None]:
# Mapping Business Objectives to ML Metrics
objective_mapping = pd.DataFrame({
    'Business Objective': ['Increase Revenue', 'Improve Engagement', 'Reduce Churn', 'Improve Safety'],
    'Proxy Metrics': ['Conversion rate, GMV', 'DAU, Session time', 'Retention rate', 'False negative rate'],
    'ML Target': ['Purchase probability', 'Engagement score', 'Churn probability', 'Risk score'],
    'Example System': ['Product recommendation', 'Content recommendation', 'Churn prediction', 'Content moderation']
})

print("Business Objective to ML Metric Mapping:")
display(objective_mapping)

---

## 2. Feature Requirements

Understand **WHAT** features the system needs to support.

In [None]:
@dataclass
class FeatureRequirement:
    """Represents a feature requirement for an ML system"""
    name: str
    description: str
    priority: str  # 'P0' (must-have), 'P1' (should-have), 'P2' (nice-to-have)
    user_interaction: str
    ml_implication: str

# Example features for a recommendation system
features = [
    FeatureRequirement("Personalized Feed", "Show personalized content on home page", "P0",
                      "User opens app", "Need user embeddings, real-time serving"),
    FeatureRequirement("Like/Dislike", "Allow explicit feedback", "P0",
                      "User clicks buttons", "Explicit feedback signals"),
    FeatureRequirement("Similar Items", "Show similar content", "P1",
                      "User views item", "Item similarity model needed"),
]

features_df = pd.DataFrame([{'Feature': f.name, 'Priority': f.priority, 
                             'ML Implication': f.ml_implication} for f in features])
print("Feature Requirements:")
display(features_df)

---

## 3. Data Questions

Understanding your data is crucial - you can't build ML without data!

In [None]:
data_questions = {
    "Data Availability": [
        "What data do we currently have?",
        "What data can we collect in the future?",
        "Are there data access restrictions?",
    ],
    "Data Labeling": [
        "Is the data labeled? If not, how can we get labels?",
        "Can we use implicit labels? (e.g., clicks as positive signals)",
        "What's the labeling cost and timeline?",
    ],
    "Data Volume": [
        "How much data do we have? (rows, size in GB/TB)",
        "What's the data growth rate?",
        "How far back do we have historical data?",
    ]
}

print("Data Questions to Ask")
print("=" * 50)
for category, questions in data_questions.items():
    print(f"\n{category}:")
    for i, q in enumerate(questions, 1):
        print(f"   {i}. {q}")

In [None]:
# Data Volume Estimation
def estimate_data_volume(users, items, interactions_per_day, days=365):
    bytes_per_interaction = 50
    total_interactions = users * interactions_per_day * days
    interaction_gb = (total_interactions * bytes_per_interaction) / (1024**3)
    user_gb = (users * 1024) / (1024**3)
    item_gb = (items * 5 * 1024) / (1024**3)
    return {'interactions': total_interactions, 'total_gb': interaction_gb + user_gb + item_gb}

scenarios = [
    {"name": "Startup", "users": 10_000, "items": 1_000, "interactions": 5},
    {"name": "Growth", "users": 1_000_000, "items": 100_000, "interactions": 10},
    {"name": "Scale", "users": 100_000_000, "items": 1_000_000, "interactions": 20},
]

results = []
for s in scenarios:
    est = estimate_data_volume(s["users"], s["items"], s["interactions"])
    results.append({'Scenario': s['name'], 'Users': f"{s['users']:,}", 
                   'Total Data (GB)': f"{est['total_gb']:,.1f}"})

print("Data Volume Estimation:")
display(pd.DataFrame(results))

---

## 4. Constraint Questions

Understanding constraints helps define what's feasible.

In [None]:
@dataclass
class SystemConstraint:
    category: str
    constraint: str
    requirement: str
    impact: str

constraints = [
    SystemConstraint("Deployment", "Cloud vs On-Device", "Cloud preferred", "Need model compression for mobile"),
    SystemConstraint("Performance", "Latency", "p99 < 200ms", "Need caching, simple models"),
    SystemConstraint("Performance", "Throughput", "10K RPS", "Horizontal scaling needed"),
    SystemConstraint("Resources", "Budget", "$50K/month", "Limits model complexity"),
]

constraints_df = pd.DataFrame([{'Category': c.category, 'Constraint': c.constraint,
                                'Requirement': c.requirement, 'Impact': c.impact} for c in constraints])
print("System Constraints:")
display(constraints_df)

In [None]:
# Trade-off Radar Chart
def plot_tradeoff_radar(requirements, title):
    categories = list(requirements.keys())
    values = list(requirements.values()) + [list(requirements.values())[0]]
    angles = [n / float(len(categories)) * 2 * np.pi for n in range(len(categories))]
    angles += angles[:1]
    
    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
    ax.plot(angles, values, 'o-', linewidth=2)
    ax.fill(angles, values, alpha=0.25)
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(categories)
    ax.set_ylim(0, 10)
    ax.set_title(title, fontweight='bold', pad=20)
    return fig

reqs = {'Latency': 9, 'Accuracy': 7, 'Personalization': 9, 'Scalability': 8,
        'Explainability': 4, 'Freshness': 6, 'Diversity': 7, 'Cost': 6}

fig = plot_tradeoff_radar(reqs, "Requirement Priorities (Higher = More Important)")
plt.show()

---

## 5. Scale Questions

Scale determines architecture decisions, infrastructure needs, and cost.

In [None]:
scale_questions = {
    "User Scale": ["How many total/daily active users?", "Expected growth rate?"],
    "Item Scale": ["How many items in catalog?", "How frequently are new items added?"],
    "Request Scale": ["Peak requests per second?", "How does traffic vary?"],
}

print("Scale Questions:")
for category, questions in scale_questions.items():
    print(f"\n{category}:")
    for q in questions:
        print(f"  - {q}")

In [None]:
# Scale Tiers
scale_tiers = pd.DataFrame({
    'Tier': ['Startup', 'Growth', 'Scale', 'Hyperscale'],
    'Users': ['<100K', '100K-1M', '1M-100M', '>100M'],
    'Architecture': ['Monolithic', 'Microservices', 'Distributed', 'Global multi-region'],
    'Model Complexity': ['Simple (sklearn)', 'Moderate (XGBoost)', 'Complex (DL)', 'Ensemble'],
    'Monthly Cost': ['$100-1K', '$1K-10K', '$10K-100K', '$100K+']
})

print("Scale Tiers and Implications:")
display(scale_tiers)

---

## 6. Performance Questions

In [None]:
performance_questions = {
    "Latency": ["Acceptable response time (p50, p99)?", "Real-time processing needed?"],
    "Freshness": ["How fresh do recommendations need to be?", "Can we use pre-computed results?"],
    "Accuracy": ["What accuracy level is acceptable?", "Cost of wrong predictions?"],
    "Reliability": ["Required uptime (99.9%, 99.99%)?", "Fallback mechanisms needed?"]
}

print("Performance Questions:")
for category, questions in performance_questions.items():
    print(f"\n{category}:")
    for q in questions:
        print(f"  - {q}")

In [None]:
# Processing Mode Comparison
processing = pd.DataFrame({
    'Aspect': ['Latency', 'Freshness', 'Cost', 'Use Cases'],
    'Real-Time': ['Milliseconds', 'Immediate', 'Higher', 'Recommendations, Fraud'],
    'Batch': ['Hours', 'Periodic', 'Lower', 'Email campaigns, Retraining'],
    'Hybrid': ['Varies', 'Mixed', 'Optimized', 'Pre-compute + real-time rank']
})

print("Processing Mode Comparison:")
display(processing)

---

## 7. Complete Requirements Template

In [None]:
@dataclass
class MLSystemRequirements:
    """Complete requirements documentation"""
    system_name: str
    business_objective: str = ""
    success_metrics: List[str] = field(default_factory=list)
    features: List[Dict] = field(default_factory=list)
    data_sources: List[Dict] = field(default_factory=list)
    deployment_type: str = ""
    latency_requirement: str = ""
    num_users: int = 0
    num_items: int = 0
    requests_per_second: float = 0
    
    def summary(self):
        print(f"\n{'='*60}")
        print(f"ML SYSTEM REQUIREMENTS: {self.system_name}")
        print(f"{'='*60}")
        print(f"\nObjective: {self.business_objective}")
        print(f"\nSuccess Metrics:")
        for m in self.success_metrics:
            print(f"  - {m}")
        print(f"\nScale: {self.num_users:,} users, {self.num_items:,} items, {self.requests_per_second:,} RPS")
        print(f"Deployment: {self.deployment_type}")
        print(f"Latency: {self.latency_requirement}")

# Example
reqs = MLSystemRequirements(
    system_name="Video Recommendation System",
    business_objective="Maximize watch time and retention",
    success_metrics=["Increase watch time by 20%", "Improve CTR by 15%"],
    deployment_type="Cloud + Edge caching",
    latency_requirement="p50 < 50ms, p99 < 200ms",
    num_users=50_000_000,
    num_items=10_000_000,
    requests_per_second=50_000
)

reqs.summary()

---

## 8. Requirements Checklist

In [None]:
checklist = """
ML SYSTEM REQUIREMENTS CHECKLIST
================================

BUSINESS REQUIREMENTS:
[ ] Primary business objective defined
[ ] Success metrics identified with baselines
[ ] Target improvement quantified
[ ] Trade-offs discussed and prioritized

FEATURE REQUIREMENTS:
[ ] User interactions documented
[ ] Content/item types specified
[ ] Personalization level defined
[ ] Priority (P0/P1/P2) assigned

DATA REQUIREMENTS:
[ ] Data sources inventoried
[ ] Data volume estimated
[ ] Labeling strategy defined
[ ] Data quality assessed

CONSTRAINTS:
[ ] Deployment type specified
[ ] Latency requirements defined
[ ] Budget constraints documented
[ ] Team capacity assessed

SCALE:
[ ] User scale documented
[ ] Item scale documented
[ ] Request volume estimated
[ ] Growth projections made

PERFORMANCE:
[ ] Accuracy targets set
[ ] Freshness requirements defined
[ ] Availability targets set
[ ] Fallback strategy planned
"""

print(checklist)

---

## Hands-On Exercise

Practice gathering requirements for a content moderation system.

In [None]:
# TODO: Complete this exercise
# Create requirements for a content moderation system

# content_moderation = MLSystemRequirements(
#     system_name="Content Moderation System",
#     business_objective="...",
#     success_metrics=["..."],
#     deployment_type="...",
#     latency_requirement="...",
#     num_users=...,
#     requests_per_second=...
# )
# content_moderation.summary()

print("Hints for Content Moderation System:")
print("- Business goal: User safety and platform trust")
print("- Key metric: False negative rate (missing harmful content)")
print("- Constraint: Very low latency (content appears quickly)")
print("- Trade-off: Precision vs Recall (over-moderation vs under-moderation)")

---

## Summary

### Key Takeaways

1. **Requirements First**: Always gather requirements before designing the system

2. **Six Categories of Questions**:
   - Business objectives
   - Feature requirements
   - Data requirements
   - Constraints
   - Scale
   - Performance

3. **Document Everything**: Use structured templates to capture requirements

4. **Clarify Trade-offs**: Understand what can be sacrificed for what

### Next Steps

In the next tutorial, we'll learn how to **Frame ML Problems** - translating these requirements into concrete ML objectives.

---

In [None]:
# Quick Quiz
quiz = [
    ("What percentage of project time should be spent on requirements?", "B", 
     ["A) 5%", "B) 15-25%", "C) 50%", "D) 75%"]),
    ("Which constraint is typically most important for recommendation systems?", "A",
     ["A) Latency", "B) Storage", "C) Team size", "D) None"]),
    ("What's the best labeling strategy when explicit labels are expensive?", "B",
     ["A) Skip labeling", "B) Use implicit signals", "C) Random labels", "D) Expert labels only"])
]

print("Quick Self-Assessment")
print("="*40)
for i, (q, a, opts) in enumerate(quiz, 1):
    print(f"\nQ{i}: {q}")
    for opt in opts:
        print(f"   {opt}")

print("\n" + "="*40)
print("Answers: 1-B, 2-A, 3-B")