# API Fundamentals - Advanced Retry Strategies

This notebook focuses on two critical production-grade techniques for handling API failures:

1. **Exponential Backoff with Jitter** - Prevents retry storms
2. **Circuit Breaker Pattern** - Fails fast during outages

We'll use OpenAI API as our example throughout.

## Learning Goals

After this lesson, you will be able to:

- [ ] Implement exponential backoff with jitter to prevent retry storms
- [ ] Build a circuit breaker to handle API outages gracefully
- [ ] Combine both patterns for production-ready error handling
- [ ] Apply these techniques to OpenAI API calls

## Setup

First, let's import the necessary libraries and set up our OpenAI client.

In [None]:
import os
import time
import random
from openai import OpenAI
from typing import Optional, Callable, Any

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Default model
MODEL = "gpt-4o-mini"

print("‚úÖ Setup complete!")

---

## 1. Exponential Backoff with Jitter

### The Problem: Retry Storms

When multiple clients all fail at the same time (e.g., during an API outage), they all retry at the same intervals:

- All retry after 1 second ‚Üí API gets hammered again
- All retry after 2 seconds ‚Üí API gets hammered again
- All retry after 4 seconds ‚Üí API gets hammered again

This is called a **"retry storm"** and makes the problem worse!

### The Solution: Add Randomness (Jitter)

Jitter spreads out retry attempts so they don't all happen at once.

### Basic Exponential Backoff (Without Jitter)

Let's first see what happens without jitter:

In [None]:
def exponential_backoff_no_jitter(attempt, base_delay=1, max_delay=60):
    """Calculate backoff time WITHOUT jitter."""
    delay = base_delay * (2 ** attempt)
    delay = min(delay, max_delay)
    return delay

# Simulate multiple clients retrying
print("Without jitter (all clients retry at same time):")
print("-" * 50)
for attempt in range(5):
    delay = exponential_backoff_no_jitter(attempt)
    print(f"Attempt {attempt}: All clients wait {delay} seconds ‚Üí All retry together!")
    print(f"  ‚ö†Ô∏è  Problem: Retry storm at {delay}s mark")

### Exponential Backoff WITH Jitter

Now let's add jitter to spread out the retries:

In [None]:
def exponential_backoff_with_jitter(attempt, base_delay=1, max_delay=60):
    """
    Calculate backoff time with jitter.
    
    Args:
        attempt: Current retry attempt (0-indexed)
        base_delay: Base delay in seconds
        max_delay: Maximum delay in seconds
    
    Returns:
        Wait time in seconds with jitter applied
    """
    # Calculate exponential delay: base_delay * (2 ^ attempt)
    delay = base_delay * (2 ** attempt)
    
    # Cap at max_delay
    delay = min(delay, max_delay)
    
    # Add jitter: random value between 0 and delay
    # This spreads out retries so they don't all happen at once
    jittered_delay = delay * random.random()
    
    return jittered_delay

# Simulate multiple clients retrying with jitter
print("With jitter (clients retry at different times):")
print("-" * 50)
for attempt in range(5):
    # Simulate 3 clients retrying
    delays = [exponential_backoff_with_jitter(attempt) for _ in range(3)]
    print(f"Attempt {attempt}: Clients wait {[f'{d:.2f}' for d in delays]} seconds")
    print(f"  ‚úÖ Retries spread out - no storm!")

### Visual Comparison

Let's visualize the difference:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Simulate 10 clients retrying
attempts = list(range(5))
num_clients = 10

# Without jitter
no_jitter_times = []
for attempt in attempts:
    delay = exponential_backoff_no_jitter(attempt)
    no_jitter_times.extend([delay] * num_clients)

# With jitter
jitter_times = []
for attempt in attempts:
    delays = [exponential_backoff_with_jitter(attempt) for _ in range(num_clients)]
    jitter_times.extend(delays)

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Without jitter
ax1.hist(no_jitter_times, bins=20, edgecolor='black', alpha=0.7, color='red')
ax1.set_title('Without Jitter (Retry Storm!)\nAll clients retry at same times', fontsize=12)
ax1.set_xlabel('Wait Time (seconds)')
ax1.set_ylabel('Number of Retries')
ax1.grid(True, alpha=0.3)

# With jitter
ax2.hist(jitter_times, bins=20, edgecolor='black', alpha=0.7, color='green')
ax2.set_title('With Jitter (Spread Out)\nRetries distributed over time', fontsize=12)
ax2.set_xlabel('Wait Time (seconds)')
ax2.set_ylabel('Number of Retries')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Notice how jitter spreads out the retries, preventing storms!")

### Implementing Jittered Retry with OpenAI

Now let's create a practical wrapper for OpenAI API calls with jittered retry:

In [None]:
def call_openai_with_jittered_retry(
    prompt: str,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0
) -> str:
    """
    Call OpenAI API with exponential backoff and jitter.
    
    Args:
        prompt: The prompt to send to OpenAI
        max_retries: Maximum number of retry attempts
        base_delay: Base delay for exponential backoff
        max_delay: Maximum delay between retries
    
    Returns:
        The response content from OpenAI
    """
    for attempt in range(max_retries + 1):
        try:
            response = client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        
        except Exception as e:
            if attempt == max_retries:
                print(f"‚ùå Failed after {max_retries} retries: {e}")
                raise
            
            # Calculate wait time with jitter
            wait_time = exponential_backoff_with_jitter(attempt, base_delay, max_delay)
            
            print(f"‚ö†Ô∏è  Attempt {attempt + 1} failed: {type(e).__name__}")
            print(f"   Retrying in {wait_time:.2f} seconds...")
            
            time.sleep(wait_time)

print("‚úÖ Jittered retry function ready!")

### Testing the Jittered Retry

Let's test it with a real OpenAI call:

In [None]:
# Test with a simple prompt
prompt = "Say hello in one sentence."

try:
    response = call_openai_with_jittered_retry(prompt, max_retries=3)
    print(f"\n‚úÖ Success!")
    print(f"Response: {response}")
except Exception as e:
    print(f"\n‚ùå Error: {e}")

### Simulating Failures to See Jitter in Action

Let's create a mock function that simulates failures to see how jitter works:

In [None]:
class SimulatedAPIFailure:
    """Simulate API failures for testing retry logic."""
    
    def __init__(self, fail_count=2):
        self.fail_count = fail_count
        self.call_count = 0
    
    def call(self):
        """Simulate an API call that fails a few times then succeeds."""
        self.call_count += 1
        if self.call_count <= self.fail_count:
            raise Exception(f"Simulated API failure (attempt {self.call_count})")
        return "Success! API call worked."

def call_with_jittered_retry(func, max_retries=3):
    """Generic retry wrapper with jitter."""
    for attempt in range(max_retries + 1):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries:
                raise
            
            wait_time = exponential_backoff_with_jitter(attempt)
            print(f"  Attempt {attempt + 1} failed: {e}")
            print(f"  ‚è≥ Waiting {wait_time:.2f} seconds (with jitter)...")
            time.sleep(wait_time)

# Test with simulated failures
print("Testing jittered retry with simulated failures:")
print("=" * 60)

simulator = SimulatedAPIFailure(fail_count=2)

start_time = time.time()
result = call_with_jittered_retry(simulator.call, max_retries=3)
elapsed = time.time() - start_time

print(f"\n‚úÖ {result}")
print(f"‚è±Ô∏è  Total time: {elapsed:.2f} seconds")
print(f"\nüí° Notice the random wait times - this prevents retry storms!")

---

## 2. Circuit Breaker Pattern

The circuit breaker prevents your app from repeatedly calling a failing service. Think of it like an electrical circuit breaker that "trips" when there's too much load.

### Three States:

1. **Closed** (Normal): Requests go through normally
2. **Open** (Tripped): Too many failures, stop sending requests (fail fast)
3. **Half-Open** (Testing): After a timeout, try one request to see if service recovered

### Circuit Breaker Implementation

Let's build a circuit breaker class:

In [None]:
class CircuitBreaker:
    """Simple circuit breaker implementation."""
    
    def __init__(self, failure_threshold=5, timeout=60):
        """
        Initialize circuit breaker.
        
        Args:
            failure_threshold: Number of failures before opening circuit
            timeout: Seconds to wait before trying again (half-open state)
        """
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = 'closed'  # closed, open, half_open
        self.success_count = 0  # Track successes in half-open state
    
    def call(self, func):
        """
        Call a function through the circuit breaker.
        
        Args:
            func: Function to call
            
        Returns:
            Function result
        """
        # Check if we should try again after being open
        if self.state == 'open':
            if time.time() - self.last_failure_time >= self.timeout:
                print("üîÑ Circuit half-open, trying one request...")
                self.state = 'half_open'
            else:
                remaining = self.timeout - (time.time() - self.last_failure_time)
                raise Exception(
                    f"Circuit breaker is OPEN - failing fast. "
                    f"Try again in {remaining:.1f} seconds."
                )
        
        try:
            result = func()
            
            # Success! Reset the circuit breaker
            if self.state == 'half_open':
                print("‚úÖ Request succeeded, closing circuit")
            
            self.failure_count = 0
            self.state = 'closed'
            return result
            
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.failure_count >= self.failure_threshold:
                print(f"\nüî¥ Circuit breaker OPEN after {self.failure_count} failures")
                print(f"   State: {self.state} ‚Üí OPEN")
                print(f"   Will try again in {self.timeout} seconds")
                self.state = 'open'
            
            raise
    
    def get_status(self):
        """Get current circuit breaker status."""
        return {
            'state': self.state,
            'failure_count': self.failure_count,
            'threshold': self.failure_threshold
        }

print("‚úÖ Circuit breaker class ready!")

### Testing the Circuit Breaker

Let's test it with simulated failures:

In [None]:
class FailingService:
    """A service that fails multiple times then recovers."""
    
    def __init__(self, fail_until=5):
        self.call_count = 0
        self.fail_until = fail_until
    
    def call(self):
        """Simulate a service call."""
        self.call_count += 1
        if self.call_count <= self.fail_until:
            raise Exception(f"Service failure (call {self.call_count})")
        return f"Success! (call {self.call_count})"

# Create circuit breaker and service
breaker = CircuitBreaker(failure_threshold=3, timeout=10)
service = FailingService(fail_until=5)

print("Testing circuit breaker with failing service:")
print("=" * 60)

# Make several calls
for i in range(8):
    print(f"\n--- Request {i+1} ---")
    status = breaker.get_status()
    print(f"Circuit state: {status['state']}, Failures: {status['failure_count']}/{status['threshold']}")
    
    try:
        result = breaker.call(service.call)
        print(f"‚úÖ {result}")
    except Exception as e:
        print(f"‚ùå {e}")
    
    time.sleep(0.5)  # Small delay between calls

print("\n" + "=" * 60)
print("üí° Notice how the circuit opens after 3 failures,")
print("   then fails fast until the timeout period passes!")

### Circuit Breaker with OpenAI

Now let's integrate the circuit breaker with OpenAI API calls:

In [None]:
def call_openai_with_circuit_breaker(
    prompt: str,
    circuit_breaker: CircuitBreaker
) -> str:
    """
    Call OpenAI API through a circuit breaker.
    
    Args:
        prompt: The prompt to send to OpenAI
        circuit_breaker: CircuitBreaker instance
    
    Returns:
        The response content from OpenAI
    """
    def api_call():
        response = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    
    return circuit_breaker.call(api_call)

# Create a circuit breaker for OpenAI
openai_breaker = CircuitBreaker(failure_threshold=3, timeout=30)

print("‚úÖ OpenAI circuit breaker wrapper ready!")
print(f"\nCircuit breaker status: {openai_breaker.get_status()}")

### Testing OpenAI with Circuit Breaker

Let's test it (this will work if you have a valid API key):

In [None]:
# Test with a simple prompt
prompt = "Say hello in one sentence."

try:
    response = call_openai_with_circuit_breaker(prompt, openai_breaker)
    print(f"\n‚úÖ Success!")
    print(f"Response: {response}")
    print(f"\nCircuit status: {openai_breaker.get_status()}")
except Exception as e:
    print(f"\n‚ùå Error: {e}")
    print(f"Circuit status: {openai_breaker.get_status()}")

---

## 3. Combining Both Patterns

For production systems, you'll want to combine both patterns:

- **Circuit breaker** to fail fast during outages
- **Jittered retry** for transient failures when circuit is closed

Let's build a complete solution:

In [None]:
class ProductionAPIClient:
    """Production-ready API client with circuit breaker and jittered retry."""
    
    def __init__(
        self,
        failure_threshold=5,
        circuit_timeout=60,
        max_retries=3,
        base_delay=1.0,
        max_delay=60.0
    ):
        self.circuit_breaker = CircuitBreaker(failure_threshold, circuit_timeout)
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
    
    def call(self, prompt: str) -> str:
        """
        Call OpenAI API with circuit breaker and jittered retry.
        
        Args:
            prompt: The prompt to send to OpenAI
        
        Returns:
            The response content from OpenAI
        """
        def api_call():
            return client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}]
            ).choices[0].message.content
        
        # First, try through circuit breaker
        try:
            return self.circuit_breaker.call(api_call)
        except Exception as e:
            # If circuit is open, fail fast (don't retry)
            if self.circuit_breaker.state == 'open':
                raise Exception(f"Circuit breaker is open: {e}")
            
            # If circuit is closed but call failed, retry with jitter
            # (This handles transient failures)
            for attempt in range(self.max_retries):
                wait_time = exponential_backoff_with_jitter(
                    attempt, self.base_delay, self.max_delay
                )
                
                print(f"  ‚ö†Ô∏è  Retry {attempt + 1}/{self.max_retries} in {wait_time:.2f}s...")
                time.sleep(wait_time)
                
                try:
                    return self.circuit_breaker.call(api_call)
                except Exception as retry_error:
                    if attempt == self.max_retries - 1:
                        raise retry_error
                    continue
            
            raise
    
    def get_status(self):
        """Get current status of circuit breaker."""
        return self.circuit_breaker.get_status()

# Create production client
production_client = ProductionAPIClient(
    failure_threshold=3,
    circuit_timeout=30,
    max_retries=3
)

print("‚úÖ Production API client ready!")
print(f"\nStatus: {production_client.get_status()}")

### Testing the Combined Solution

Let's test the complete solution:

In [None]:
# Test with a real prompt
prompt = "Explain what a circuit breaker is in one sentence."

try:
    response = production_client.call(prompt)
    print(f"\n‚úÖ Success!")
    print(f"Response: {response}")
    print(f"\nCircuit status: {production_client.get_status()}")
except Exception as e:
    print(f"\n‚ùå Error: {e}")
    print(f"Circuit status: {production_client.get_status()}")

---

## Summary

You've learned two critical production-grade techniques:

### Exponential Backoff with Jitter
- ‚úÖ Prevents retry storms by spreading out retry attempts
- ‚úÖ Uses random jitter to avoid synchronized retries
- ‚úÖ Essential for handling transient failures

### Circuit Breaker Pattern
- ‚úÖ Fails fast when service is down
- ‚úÖ Three states: Closed ‚Üí Open ‚Üí Half-Open
- ‚úÖ Prevents cascading failures

### Combined Approach
- ‚úÖ Circuit breaker for outages (fail fast)
- ‚úÖ Jittered retry for transient failures (when circuit is closed)
- ‚úÖ Production-ready error handling

**Key Takeaways:**

1. **Jitter is essential** - Without it, retry storms can make outages worse
2. **Circuit breakers protect your app** - Fail fast instead of waiting
3. **Combine both** - Use circuit breakers for outages, jittered retry for transient failures
4. **Start simple** - These patterns are powerful but don't over-engineer

**Next Steps:**
- Practice implementing these patterns in your own projects
- Experiment with different thresholds and timeouts
- Monitor your circuit breaker states in production
- Consider adding metrics/logging to track retry patterns

## Exercises

Try these exercises to solidify your understanding:

1. **Adjust Jitter Parameters**: Modify the `exponential_backoff_with_jitter` function to use different jitter strategies (e.g., full jitter, equal jitter)

2. **Circuit Breaker Monitoring**: Add logging/metrics to track when the circuit opens and closes

3. **Rate Limit Integration**: Combine rate limiting with the circuit breaker to handle 429 errors

4. **Batch Processing**: Use the production client to process multiple prompts concurrently with proper error handling