# 02 - Clarifying Requirements (Meta Interview Frame)

---

## What the Chapter Says

The first step in the ML System Design framework is **Clarifying Requirements**. The chapter specifies these exact question categories:

1. **Business objective** - What is the company trying to achieve?
2. **Features system supports** - e.g., like/dislike as labels
3. **Data** - sources, size, labeled or not?
4. **Constraints** - compute, cloud vs on-device, improve automatically?
5. **Scale** - users, items, growth expectations
6. **Performance** - real-time? latency vs accuracy tradeoffs?
7. **Privacy/Ethics** - additional considerations

**Key instruction**: Align scope with interviewer + write requirements down.

---

## Meta Interview Signal

| Level | Expectations |
|-------|-------------|
| **E5** | Asks structured clarifying questions. Covers all categories. Writes down requirements. Doesn't jump to solutions. |
| **E6** | Probes deeper on scale ("how many events per second?"). Anticipates constraints not mentioned. Identifies ambiguities in business objective. Asks about failure tolerance and iteration cadence. |

---

## The Question Categories Framework

Use this exact structure in every ML System Design interview:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# Create the framework table (must match chapter)
categories = pd.DataFrame({
    'Category': [
        'Business Objective',
        'Features System Supports',
        'Data',
        'Constraints',
        'Scale',
        'Performance',
        'Privacy/Ethics'
    ],
    'Example Questions': [
        'What metric defines success? Revenue? Engagement? Safety?',
        'What actions can users take? Like/dislike? Comments? Shares?',
        'What data exists? How much? Is it labeled? Fresh or historical?',
        'Cloud or on-device? GPU available? Auto-improve or static model?',
        'How many users? Items? Requests per second? Growth rate?',
        'Real-time needed? Max latency? Accuracy vs speed tradeoff?',
        'Sensitive data? Anonymization needed? Bias concerns?'
    ],
    'Why It Matters': [
        'Defines the ML objective and success criteria',
        'Determines what labels are available naturally',
        'Shapes feature engineering and model choice',
        'Limits solution space (can\'t use GPUs if on-device)',
        'Affects architecture: batch vs streaming, sharding',
        'Drives latency requirements and model complexity',
        'May require differential privacy, fairness constraints'
    ]
})

print("=" * 80)
print("CLARIFYING REQUIREMENTS FRAMEWORK (from Chapter)")
print("=" * 80)
print(categories.to_string(index=False))

In [None]:
# Visual diagram of the categories
fig, ax = plt.subplots(figsize=(12, 8))
ax.axis('off')
ax.set_title('Clarifying Requirements: Question Categories', fontsize=14, fontweight='bold')

categories_viz = [
    ('Business\nObjective', '#BBDEFB', 0, 6),
    ('Features\nSupported', '#C8E6C9', 2, 6),
    ('Data', '#FFF9C4', 4, 6),
    ('Constraints', '#FFCCBC', 6, 6),
    ('Scale', '#E1BEE7', 1, 3.5),
    ('Performance', '#B2DFDB', 3, 3.5),
    ('Privacy/\nEthics', '#F8BBD9', 5, 3.5),
]

for (label, color, x, y) in categories_viz:
    rect = mpatches.FancyBboxPatch((x, y), 1.8, 1.5, boxstyle='round,pad=0.1',
                                    facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(x + 0.9, y + 0.75, label, ha='center', va='center', fontsize=10, fontweight='bold')

# Central box
rect = mpatches.FancyBboxPatch((2.5, 0.5), 3, 1.5, boxstyle='round,pad=0.1',
                                facecolor='white', edgecolor='red', linewidth=3)
ax.add_patch(rect)
ax.text(4, 1.25, 'WRITE\nREQUIREMENTS\nDOWN', ha='center', va='center', 
        fontsize=11, fontweight='bold', color='red')

# Arrows
arrow_props = dict(arrowstyle='->', color='gray', lw=1.5)
for (_, _, x, y) in categories_viz:
    ax.annotate('', xy=(4, 2), xytext=(x + 0.9, y), arrowprops=arrow_props)

ax.set_xlim(-0.5, 8.5)
ax.set_ylim(0, 8)
plt.tight_layout()
plt.show()

---

## Hands-On: Simulated Interview Scenarios

Let's practice the clarifying requirements step for multiple scenarios. These mirror the chapter's examples.

In [None]:
# Define interview scenarios from the chapter
scenarios = {
    'Event Ticket Selling App': {
        'business_objective': 'Increase ticket sales',
        'features_supported': ['event search', 'recommendations', 'like/save events', 'purchase'],
        'data': {
            'sources': ['user profiles', 'event catalog', 'purchase history', 'browsing logs'],
            'size': '10M users, 500K events, 50M interactions/month',
            'labeled': 'Yes - purchases are natural labels'
        },
        'constraints': {
            'compute': 'Cloud-based, GPU available for training',
            'auto_improve': 'Yes, retrain weekly'
        },
        'scale': {
            'users': '10M',
            'items': '500K events',
            'growth': '20% YoY'
        },
        'performance': {
            'latency': '< 100ms for recommendations',
            'real_time': 'Yes for search, batch OK for email campaigns'
        },
        'privacy_ethics': 'Location data sensitive, GDPR compliance needed'
    },
    'Video Streaming App': {
        'business_objective': 'Increase engagement (watch time)',
        'features_supported': ['video feed', 'like/dislike', 'watch history', 'follow creators'],
        'data': {
            'sources': ['video metadata', 'user interactions', 'watch sessions', 'creator info'],
            'size': '100M users, 10M videos, 1B watch events/day',
            'labeled': 'Yes - watch time is natural label'
        },
        'constraints': {
            'compute': 'Massive cloud infra, distributed training',
            'auto_improve': 'Continuous learning, multiple times per day'
        },
        'scale': {
            'users': '100M',
            'items': '10M videos',
            'growth': '50% YoY'
        },
        'performance': {
            'latency': '< 50ms for feed ranking',
            'real_time': 'Yes - immediate personalization'
        },
        'privacy_ethics': 'Content safety critical, age-appropriate filtering'
    },
    'Harmful Content Detection': {
        'business_objective': 'Improve platform safety',
        'features_supported': ['post creation', 'reporting', 'appeals'],
        'data': {
            'sources': ['post content (text/image/video)', 'user reports', 'moderator decisions'],
            'size': '1B posts/day, 1M flagged posts/day',
            'labeled': 'Partially - moderator labels expensive, user reports noisy'
        },
        'constraints': {
            'compute': 'Must run on every post, massive scale',
            'auto_improve': 'Yes, new harmful patterns emerge constantly'
        },
        'scale': {
            'users': '1B',
            'items': '1B posts/day',
            'growth': '10% YoY'
        },
        'performance': {
            'latency': '< 200ms per post',
            'real_time': 'Yes - must classify before post is shown'
        },
        'privacy_ethics': 'False positives = censorship concerns, false negatives = safety risk'
    }
}

# Display one scenario in detail
import json
print("=" * 80)
print("SCENARIO: Event Ticket Selling App")
print("=" * 80)
print(json.dumps(scenarios['Event Ticket Selling App'], indent=2))

In [None]:
# Create a requirements template class
class RequirementsTemplate:
    """Template for documenting ML system requirements (chapter-aligned)"""
    
    def __init__(self, system_name):
        self.system_name = system_name
        self.requirements = {
            'business_objective': None,
            'features_supported': [],
            'data': {
                'sources': [],
                'size': None,
                'labeled': None
            },
            'constraints': {
                'compute': None,
                'deployment': None,  # cloud vs on-device
                'auto_improve': None
            },
            'scale': {
                'users': None,
                'items': None,
                'qps': None,
                'growth': None
            },
            'performance': {
                'latency_requirement': None,
                'real_time': None,
                'accuracy_threshold': None
            },
            'privacy_ethics': {
                'sensitive_data': None,
                'anonymization': None,
                'bias_concerns': None
            }
        }
    
    def set_business_objective(self, objective):
        self.requirements['business_objective'] = objective
        return self
    
    def set_features(self, features):
        self.requirements['features_supported'] = features
        return self
    
    def set_data(self, sources, size, labeled):
        self.requirements['data'] = {'sources': sources, 'size': size, 'labeled': labeled}
        return self
    
    def set_constraints(self, compute, deployment, auto_improve):
        self.requirements['constraints'] = {
            'compute': compute, 'deployment': deployment, 'auto_improve': auto_improve
        }
        return self
    
    def set_scale(self, users, items, qps, growth):
        self.requirements['scale'] = {
            'users': users, 'items': items, 'qps': qps, 'growth': growth
        }
        return self
    
    def set_performance(self, latency, real_time, accuracy):
        self.requirements['performance'] = {
            'latency_requirement': latency, 'real_time': real_time, 'accuracy_threshold': accuracy
        }
        return self
    
    def set_privacy_ethics(self, sensitive, anonymization, bias):
        self.requirements['privacy_ethics'] = {
            'sensitive_data': sensitive, 'anonymization': anonymization, 'bias_concerns': bias
        }
        return self
    
    def summarize(self):
        print(f"\n{'='*60}")
        print(f"REQUIREMENTS DOCUMENT: {self.system_name}")
        print(f"{'='*60}")
        print(json.dumps(self.requirements, indent=2))
        return self.requirements

# Example: Fill out requirements for Friend Recommendation
friend_rec = RequirementsTemplate('Friend Recommendation System')
friend_rec.set_business_objective('Increase network growth (maximize formed connections)')
friend_rec.set_features(['friend suggestions', 'mutual friends display', 'accept/ignore actions'])
friend_rec.set_data(
    sources=['social graph', 'user profiles', 'interaction history', 'mutual connections'],
    size='500M users, 100B edges in graph',
    labeled='Yes - accepted requests are positive labels'
)
friend_rec.set_constraints(
    compute='Cloud, distributed graph processing',
    deployment='Cloud',
    auto_improve='Daily retraining'
)
friend_rec.set_scale(
    users='500M',
    items='500M potential friends per user',
    qps='100K friend suggestion requests/sec',
    growth='10% YoY'
)
friend_rec.set_performance(
    latency='< 200ms for top-10 suggestions',
    real_time='Batch pre-computation acceptable',
    accuracy='Precision@10 > 0.3'
)
friend_rec.set_privacy_ethics(
    sensitive='Social connections are sensitive',
    anonymization='Cannot show why recommendation was made (privacy)',
    bias='Avoid filter bubbles, ensure diverse suggestions'
)
friend_rec.summarize();

---

## Practice: Structured Question Asking

Here's how to structure your clarifying questions in the interview:

In [None]:
# Question templates for each category
question_templates = {
    'Business Objective': [
        "What is the primary business metric we're trying to optimize?",
        "How is success currently measured for this product?",
        "Are there secondary objectives we should consider?",
        "What's the current baseline performance?"
    ],
    'Features Supported': [
        "What user actions are available in the product?",
        "Which actions can serve as implicit labels (e.g., like/dislike)?",
        "Are there any upcoming features that might affect this?",
        "How do users currently interact with the system?"
    ],
    'Data': [
        "What data sources are available?",
        "How much historical data do we have?",
        "Is the data labeled? If so, how were labels obtained?",
        "How fresh is the data? Real-time or batch?",
        "What's the data quality like? Missing values? Noise?"
    ],
    'Constraints': [
        "Is this cloud-based or needs to run on-device?",
        "What compute resources are available for training?",
        "What compute resources are available for inference?",
        "Should the model improve automatically over time?",
        "Are there budget constraints?"
    ],
    'Scale': [
        "How many users does the system serve?",
        "How many items (videos/posts/events) are in the catalog?",
        "What's the expected queries per second (QPS)?",
        "What's the expected growth rate?"
    ],
    'Performance': [
        "What's the maximum acceptable latency?",
        "Does the system need real-time predictions?",
        "What's the tradeoff between latency and accuracy?",
        "Are there specific accuracy thresholds we need to meet?"
    ],
    'Privacy/Ethics': [
        "Is any of the data sensitive (PII, health, financial)?",
        "Are there anonymization requirements?",
        "Are there concerns about bias in the model?",
        "Are there regulatory requirements (GDPR, etc.)?",
        "What happens if the model makes a mistake? (Impact analysis)"
    ]
}

print("CLARIFYING QUESTION CHEAT SHEET")
print("=" * 60)
for category, questions in question_templates.items():
    print(f"\n{category.upper()}")
    print("-" * 40)
    for q in questions:
        print(f"  - {q}")

---

## Tradeoffs Discussion (Chapter-Aligned)

The chapter emphasizes these tradeoffs during requirements gathering:

In [None]:
tradeoffs = pd.DataFrame({
    'Tradeoff': [
        'Latency vs Accuracy',
        'Cloud vs On-Device',
        'Manual Labels vs Natural Labels',
        'Real-time vs Batch',
        'Precision vs Recall'
    ],
    'When to favor Option A': [
        'User-facing, interactive features',
        'High compute needs, easier updates',
        'When natural labels unavailable/noisy',
        'Time-sensitive predictions',
        'Harmful content (high-stakes false positives)'
    ],
    'When to favor Option B': [
        'Offline batch processing, reports',
        'Privacy-sensitive, no network needed',
        'When user actions = labels (engagement)',
        'Can precompute, cost-sensitive',
        'Fraud detection (catch more, review later)'
    ],
    'Interview Signal': [
        'E5: Knows tradeoff exists. E6: Quantifies (50ms vs 200ms, +2% accuracy)',
        'E5: Can list differences. E6: Discusses hybrid approaches',
        'E5: Understands both. E6: Discusses label quality over time',
        'E5: Knows when each applies. E6: Discusses freshness SLAs',
        'E5: Knows formulas. E6: Ties to business impact'
    ]
})

print("KEY TRADEOFFS TO SURFACE DURING REQUIREMENTS")
print("=" * 100)
print(tradeoffs.to_string(index=False))

---

## Hands-On: Mock Interview Simulation

In [None]:
import random

class MockInterviewer:
    """Simulates an interviewer for requirements gathering practice"""
    
    def __init__(self, scenario):
        self.scenario = scenario
        self.questions_asked = []
        self.revealed_info = set()
    
    def ask(self, question):
        """Simulate asking the interviewer a question"""
        self.questions_asked.append(question)
        question_lower = question.lower()
        
        # Determine what information to reveal based on question
        response = "Could you be more specific?"
        
        if 'business' in question_lower or 'objective' in question_lower or 'goal' in question_lower:
            response = f"Our main goal is: {self.scenario['business_objective']}"
            self.revealed_info.add('business_objective')
            
        elif 'feature' in question_lower or 'action' in question_lower or 'user' in question_lower:
            response = f"Users can: {', '.join(self.scenario['features_supported'])}"
            self.revealed_info.add('features')
            
        elif 'data' in question_lower or 'label' in question_lower:
            response = f"Data sources: {self.scenario['data']['sources']}. Size: {self.scenario['data']['size']}. Labels: {self.scenario['data']['labeled']}"
            self.revealed_info.add('data')
            
        elif 'scale' in question_lower or 'users' in question_lower or 'traffic' in question_lower:
            response = f"Scale: {self.scenario['scale']['users']} users, {self.scenario['scale']['items']} items, growing {self.scenario['scale']['growth']}"
            self.revealed_info.add('scale')
            
        elif 'latency' in question_lower or 'performance' in question_lower or 'speed' in question_lower:
            response = f"Latency requirement: {self.scenario['performance']['latency']}. Real-time: {self.scenario['performance']['real_time']}"
            self.revealed_info.add('performance')
            
        elif 'privacy' in question_lower or 'ethic' in question_lower or 'bias' in question_lower:
            response = f"Privacy/Ethics concerns: {self.scenario['privacy_ethics']}"
            self.revealed_info.add('privacy')
            
        elif 'constrain' in question_lower or 'compute' in question_lower or 'device' in question_lower:
            response = f"Constraints: {self.scenario['constraints']['compute']}. Auto-improve: {self.scenario['constraints']['auto_improve']}"
            self.revealed_info.add('constraints')
        
        print(f"\n[Interviewer]: {response}")
        return response
    
    def score(self):
        """Score the candidate's question coverage"""
        all_categories = {'business_objective', 'features', 'data', 'scale', 'performance', 'privacy', 'constraints'}
        coverage = len(self.revealed_info) / len(all_categories)
        
        print(f"\n{'='*60}")
        print("INTERVIEW SCORE")
        print(f"{'='*60}")
        print(f"Questions asked: {len(self.questions_asked)}")
        print(f"Categories covered: {len(self.revealed_info)}/{len(all_categories)} ({coverage*100:.0f}%)")
        print(f"Covered: {self.revealed_info}")
        print(f"Missed: {all_categories - self.revealed_info}")
        
        if coverage >= 0.85:
            print("\n[Feedback]: Excellent coverage! You asked about all key areas.")
        elif coverage >= 0.6:
            print("\n[Feedback]: Good coverage, but missed some important areas.")
        else:
            print("\n[Feedback]: Need to ask more structured questions across all categories.")

# Demo
print("MOCK INTERVIEW: Video Streaming Recommendation")
print("="*60)
interviewer = MockInterviewer(scenarios['Video Streaming App'])

# Simulate candidate questions
interviewer.ask("What is the business objective we're trying to optimize?")
interviewer.ask("What user actions and features does the app support?")
interviewer.ask("What data do we have available and is it labeled?")
interviewer.ask("What's the scale - how many users and videos?")
interviewer.ask("What are the latency and performance requirements?")
interviewer.ask("Are there privacy or ethical concerns I should know about?")
interviewer.ask("What are the compute constraints?")

interviewer.score()

---

## Meta Interview Signal (Detailed)

### E5 Answer Expectations

- Asks structured questions across all 7 categories
- Writes down requirements as they're revealed
- Doesn't jump to solutions before understanding the problem
- Clarifies ambiguities ("When you say engagement, do you mean watch time or interactions?")
- Summarizes requirements back to interviewer before proceeding

### E6 Additions

- **Probes deeper**: "You mentioned 100M users - what's the DAU? Peak QPS? Geographic distribution?"
- **Anticipates constraints**: "Since this is real-time video, I assume we need sub-100ms latency for ranking?"
- **Identifies ambiguities**: "Increase engagement could mean watch time, sessions, or shares - which is primary?"
- **Discusses failure tolerance**: "What happens if the recommendation service is down? Fallback to popular content?"
- **Iteration cadence**: "How often do we expect to retrain? Is there a human-in-the-loop for new patterns?"

---

## Key Chapter Instruction: Write Requirements Down

The chapter explicitly states: **"Align scope with interviewer + write requirements down"**

Here's a template to use during interviews:

In [None]:
requirements_template = """
================================================================================
ML SYSTEM REQUIREMENTS DOCUMENT
================================================================================
System: [Name]
Date: [Date]
Interviewer alignment: [ ] Confirmed

--------------------------------------------------------------------------------
1. BUSINESS OBJECTIVE
--------------------------------------------------------------------------------
Primary goal: _______________________________________________
Success metric: _______________________________________________
Current baseline: _______________________________________________

--------------------------------------------------------------------------------
2. FEATURES SUPPORTED  
--------------------------------------------------------------------------------
User actions: _______________________________________________
Available labels: _______________________________________________

--------------------------------------------------------------------------------
3. DATA
--------------------------------------------------------------------------------
Sources: _______________________________________________
Size: _______________________________________________
Labeled: [ ] Yes  [ ] No  [ ] Partial
Freshness: _______________________________________________

--------------------------------------------------------------------------------
4. CONSTRAINTS
--------------------------------------------------------------------------------
Deployment: [ ] Cloud  [ ] On-device  [ ] Hybrid
Compute: _______________________________________________
Auto-improve: [ ] Yes  [ ] No

--------------------------------------------------------------------------------
5. SCALE
--------------------------------------------------------------------------------
Users: _______________________________________________
Items: _______________________________________________
QPS: _______________________________________________
Growth: _______________________________________________

--------------------------------------------------------------------------------
6. PERFORMANCE
--------------------------------------------------------------------------------
Latency: _______________________________________________
Real-time: [ ] Yes  [ ] No (batch OK)
Accuracy vs Latency: _______________________________________________

--------------------------------------------------------------------------------
7. PRIVACY / ETHICS
--------------------------------------------------------------------------------
Sensitive data: _______________________________________________
Anonymization: _______________________________________________
Bias concerns: _______________________________________________

================================================================================
"""

print(requirements_template)

---

## Interview Drills

### Drill 1: Category Recall
From memory, list all 7 clarifying requirement categories. Time yourself - you should do this in under 30 seconds.

### Drill 2: Question Generation
For each category, write down 3 questions without looking at notes. Practice until you can do this fluently.

### Drill 3: Mock Interview
Ask a friend to give you a system design prompt. Spend exactly 5 minutes asking clarifying questions and filling out the requirements template.

### Drill 4: Tradeoff Articulation
For the scenario "Ad Click Prediction", articulate:
- The latency vs accuracy tradeoff
- The precision vs recall tradeoff (what's the cost of false positives vs false negatives?)

### Drill 5: E6 Deep Dive
Take the "Harmful Content Detection" scenario. Go beyond the basic questions:
- What failure modes should you anticipate?
- How would you handle adversarial users trying to evade detection?
- What's the iteration velocity requirement for new harmful content patterns?