# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [18]:
# imports
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display, update_display

In [19]:
# set up environment & constants
load_dotenv(override=True)
MODEL_GPT, MODEL_LLAMA = 'gpt-4o-mini', 'llama3.2'

MODEL_CONFIG = {
    MODEL_LLAMA: {"base_url": "http://localhost:11434/v1", "api_key": "ollama"},
    MODEL_GPT: {"base_url": "https://openrouter.ai/api/v1", "api_key": os.getenv("OPENROUTER_API_KEY")},
}

# Validate OpenRouter key once at startup (Ollama needs no key)
_key = MODEL_CONFIG[MODEL_GPT]["api_key"]
if not (_key and len(_key) > 10):
    print("OpenRouter API key may be missing. Check .env and troubleshooting notebook.")

In [20]:
def get_client(model):
    cfg = MODEL_CONFIG.get(model)
    if not cfg:
        raise ValueError(f"Unknown model: {model}. Use {MODEL_GPT} or {MODEL_LLAMA}")
    return OpenAI(base_url=cfg["base_url"], api_key=cfg["api_key"])

In [21]:
# Prompts & main helper below

In [22]:
# prompts — System Design Interview Expert

SYSTEM_PROMPT = """
You are a senior staff engineer conducting a system design interview at a top tech company (FAANG-level). Your role is to guide candidates through a structured, realistic system design discussion.

## Your Approach

1. **Requirements Clarification** — Start by clarifying functional and non-functional requirements. Ask about scale (DAU, QPS, storage), consistency needs, latency targets, and key use cases. Make reasonable assumptions when the user doesn't specify.

2. **High-Level Design** — Propose a top-level architecture: clients, load balancers, API servers, core services, databases, caches, message queues. Draw ASCII diagrams when helpful. Identify the main components and data flow.

3. **Deep Dive** — Zoom into 2-3 critical components: data models, sharding strategy, caching layers, replication, or consistency mechanisms. Discuss trade-offs (e.g., consistency vs availability, read vs write optimization).

4. **Scale & Bottlenecks** — Address scalability: horizontal vs vertical scaling, back-of-envelope capacity estimates (storage, bandwidth, QPS). Identify potential bottlenecks and mitigation strategies.

5. **Fault Tolerance & Operations** — Briefly cover failure modes, replication, failover, monitoring, and operational concerns.

## Output Style

- Use clear markdown: headers, bullet points, code blocks for schemas or configs.
- Include simple ASCII diagrams for architecture (e.g., Client → LB → API → DB).
- Be concise but thorough. Prioritize clarity over length.
- When making assumptions, state them explicitly (e.g., "Assuming 10M DAU...").
- Reference real-world patterns: consistent hashing, write-ahead logs, leader election, etc.

## Tone

- Professional and interview-like. Assume the "candidate" (user) is competent and engaged.
- Don't over-explain basics; focus on the non-obvious and trade-off discussions.
"""

# Template: user provides the system design question
USER_PROMPT_TEMPLATE = """
Design {question}

{optional_context}
"""

# Example system design questions you can plug in:
EXAMPLE_QUESTIONS = [
    "Design a URL shortener like bit.ly",
    "Design a rate limiter for an API",
    "Design a distributed cache (like Redis)",
    "Design a chat system (like Slack or WhatsApp)",
    "Design YouTube or Netflix (video streaming)",
    "Design a search autocomplete system",
    "Design a notification system",
]


def build_user_prompt(question: str, context: str = "") -> str:
    ctx = f"Context/Constraints:\n{context}\n\n" if context.strip() else ""
    return USER_PROMPT_TEMPLATE.format(question=question, optional_context=ctx)

In [23]:
def ask_system_design(question: str, model: str = MODEL_GPT, context: str = "", stream: bool = True):
    """Ask a system design question. Streams by default."""
    client = get_client(model)
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": build_user_prompt(question, context)},
    ]
    response = client.chat.completions.create(model=model, messages=messages, stream=stream)
    display_handle = display(Markdown(""), display_id=True)
    if not stream:
        return response.choices[0].message.content
    full = ""
    for chunk in response:
        content = chunk.choices[0].delta.content or ""
        full += content
        update_display(Markdown(full), display_id=display_handle.display_id)
    return full

In [25]:
# Demo: GPT-4o-mini (streaming)
ask_system_design("Design a URL shortener like bit.ly", model=MODEL_GPT)

Let's design a URL shortener service similar to Bit.ly. We will go through the structured approach outlined above.

### 1. Requirements Clarification

#### Functional Requirements:
- Users can submit a long URL and receive a shortened URL.
- Users can retrieve the original URL by using the shortened URL.
- Users can track the number of clicks on each shortened URL (optional).

#### Non-Functional Requirements:
- **Scale**: 
  - Assuming 10 million daily active users (DAU).
  - Assume about 100 million URL shortenings per day.
- **Performance**: 
  - Latency target: <100 ms for redirecting to original URL.
- **Consistency**: 
  - Eventual consistency for analytics and tracking.
- **Availability**: 
  - High availability, aiming for 99.99% uptime.

### 2. High-Level Design

#### Top-Level Architecture
```
Client ---> Load Balancer ---> API Servers ---> URL Shortening Service
                              |
                             \|/
                         Database
                              |
                        Cache Layer
```

#### Components:
- **Client**: Web or mobile application for users to submit and retrieve URLs.
- **Load Balancer**: Distributes incoming requests across multiple API servers for scalability.
- **API Servers**: Handle client requests, manage interactions with the URL Shortening service.
- **URL Shortening Service**: Contains the logic to create short URLs and retrieve original URLs.
- **Database**: Stores the mappings of short URLs to original URLs, along with click analytics (can use SQL or NoSQL).
- **Cache Layer**: To speed up retrieval of frequently accessed short URLs.
  
### 3. Deep Dive

#### Data Model
We can define our main entities as follows:

**URL Mapping Table**
```sql
CREATE TABLE url_mapping (
    id SERIAL PRIMARY KEY,
    original_url TEXT NOT NULL,
    short_url VARCHAR(10) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    total_clicks INT DEFAULT 0
);
```

#### Shortening Logic
- Use a base62 encoding scheme to generate short URLs. For example, each ID (incremental or random) can be converted to a short string using alphanumeric characters (`0-9`, `a-z`, `A-Z`).

#### Caching Strategy
- Use Redis or Memcached to cache recently accessed short URL mappings to reduce database load.

#### Consistency Model
- For tracking clicks, we can use a background job to update the database asynchronously:
  - On each redirect, perform the update of the total clicks in a non-blocking way.

### 4. Scale & Bottlenecks

#### Capacity Estimates
- For 100 million shortenings in a day, that’s approximately 1,157 requests per second (RPS) for create endpoints and similar for retrievals.
  
- **Database Scaling**: Use sharding based on hash of the short URL to distribute load.
- **Read vs Write Optimization**: Focus on optimizing reads via caching layers to reduce database bottlenecks.

#### Potential Bottlenecks
- **Database**: Implement sharding and replication to ensure it can handle high loads.
- **Unique Short URL Generating**: Ensure uniqueness without collisions, possibly via a dedicated service for generating short codes.

### 5. Fault Tolerance & Operations

#### Failure Modes
- Database failure can lead to loss of recent URL mappings or clicks. Implement replication and backups for fault tolerance.
  
#### Monitoring
- Use monitoring tools like Prometheus or Grafana to keep track of request per second, error rates, and click-through rates.
  
#### Operations
- Set up alerting for unusual patterns (e.g., sudden spikes in requests) and ensure a logging strategy for debugging.

### Summary
The URL shortener system has been outlined with appropriate assumptions about scale and functional requirements. Key components include an efficient database schema, caching for high read performance, and handling of click tracking with eventual consistency. After considering scalability and fault tolerance, the architecture should effectively serve a high volume of requests while maintaining performance.

"Let's design a URL shortener service similar to Bit.ly. We will go through the structured approach outlined above.\n\n### 1. Requirements Clarification\n\n#### Functional Requirements:\n- Users can submit a long URL and receive a shortened URL.\n- Users can retrieve the original URL by using the shortened URL.\n- Users can track the number of clicks on each shortened URL (optional).\n\n#### Non-Functional Requirements:\n- **Scale**: \n  - Assuming 10 million daily active users (DAU).\n  - Assume about 100 million URL shortenings per day.\n- **Performance**: \n  - Latency target: <100 ms for redirecting to original URL.\n- **Consistency**: \n  - Eventual consistency for analytics and tracking.\n- **Availability**: \n  - High availability, aiming for 99.99% uptime.\n\n### 2. High-Level Design\n\n#### Top-Level Architecture\n```\nClient ---> Load Balancer ---> API Servers ---> URL Shortening Service\n                              |\n                             \\|/\n                    

In [24]:
# Demo: Llama 3.2 via Ollama (streaming)
ask_system_design("Design a rate limiter for an API", model=MODEL_LLAMA)

# Rate Limiter Design
=====================================

## Overview
---------

This design implements a simple rate limiter for an API to prevent abuse and protect against common web attacks.

### Requirements
--------------

* Handle 100M DAU (Daily Active Users)
* Prevent excessive concurrent requests
* Allow for bursty traffic within acceptable limits
* Provide user-friendly error cases

## Architecture Summary
---------------------

```markdown
+---------------+
|  Client     |
+---------------+
          |
          | HTTP Request
          v
+---------------+
| Load Balancer |
+---------------+
          |
          |  URL/Method
          |  Headers
          v
+---------------+
| API Gateway    |
+---------------+
          |
          |
          +-----+   +-----------+     +--------+     +--------+    
          |               |                  |  API    |       |  Cache  |
+---------------+   +---------------+   | Servers |       |  Services|
|  Rate Limiter  |           |              |  (Rate   |       |  (Cache) |
|  (Backend)     |           |  (Frontend)|  Thrott.|       |  Storage)
+---------------+
```

## Components
-------------

### API Gateway

* Handles incoming HTTP requests from clients via the load balancer.
* Validates request headers and ensures consistent API versioning.

### Rate Limiter (Backend)

* Implemented as a distributed rate limiter using the following strategies:
	1. **Token Bucket**: Allocate limited window-based slots to users; available for immediate usage.
	2. **Fixed Window with Leaky Buckets**: Limit number of requests in each time window, allowing gradual increase due to `leakage`.

### API Server

* Receives validated requests from the rate limiter;
* Serves static resources when cache is present and passes on cached responses else
* Calls external services for dynamic data.

## Implementation Details
---------------------

### Rate Limiter (Backend)

```python
import time

class TokenBucket:
    def __init__(self, max_requests, time_window):
        self.max_requests = max_requests
        self.time_window = time_window
        self.remaining_tokens = max_requests
        self.last_reset_time = time.time()

    def get_token(self):
        now = time.time()
        if now - self.last_reset_time < self.time_window / 2: # If less than half-time window, return existing tokens
            self.remaining_tokens = min(max(0, self.max_requests - (self.remaining_tokens // (self.time_window // 2))), self.max_requests)
        else:
            self.remaining_tokens = max(0, self.max_requests)  # Reset remaining tokens to maximum

        new_token = min(self.max_requests, self.remaining_tokens + 1)
        self.last_reset_time = now
        return new_token - 1

# Rate Limiter Configuration
config = {
    'max_requests': 100,
    'time_window': 60  # minutes

}

# Initialize and run rate limiter
rate_limiter = TokenBucket(config['max_requests'], config['time_window'])
```
### API Server Handling

* Returns cached response with proper cache headers.
* If response is not cached in cache storage, makes request on the load balancer to pass through rate limit before calling to external services.

```
from flask import Flask, jsonify
app = Flask(__name__)

# Load Rate Limiter Configs
config = {
    'max_requests': 500,
    'time_window': 1   # minutes
}

rate_limiter_config: TokenBucket = config


@app.route('/')
def get_index():
    global rate_limiter_config

    if request.headers.get("Authorization"):
        user_id = request.headers["Authorization"].strip().split(":")[-1]
        access_tocken = requests.post(f'{GET_ACCESS_TOKEN_URL}', headers={"Content-Type": "application/json"}, json={'user_id': user_id})

        # If no returned token return an appropriate error response.
        if 'auth_token' in access_tocken:
            response['headers']['token'] = access_tocken.json['_token']
            
    resp  = rate_limiter_config.get_token()
    
     *return jsonify({"success": True}), responses.HTTP_401_UNAUTHORIZED
    
    # ... rest of server code here
```

## Potential Bottlenecks and Improvements
--------------------------------------

*   **Cache**: If cache doesn't become full due to constant increase in access tokens from clients, further improvements would be necessary.
*   **External calls**: Reducing external service call latency can make the system performant for long-running processes or heavy request volume scenarios.
*   **Security Considerations**: Protect sensitive data transmitted during authentication
*   **Additional Features**: Incorporate a retry mechanism when an error occurs to prevent excessive backlogs.



In [None]:
# With optional context (scale, constraints)
ask_system_design(
    "Design a chat system like Slack",
    model=MODEL_GPT,
    context="Scale: 50M DAU, 10B messages/day. Focus on real-time delivery and message ordering."
)