# Lab-2.4 Part 1: Production Architecture Design

## Objectives
- Design production-grade LLM service architecture
- Plan resource requirements and capacity
- Select appropriate technology stack
- Estimate costs and performance

## Estimated Time: 60-120 minutes

---
## 1. Production Requirements Analysis

In [None]:
# Production requirements specification
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import Dict, List, Optional

print("Production Architecture Design Tool")
print("=" * 50)

In [None]:
@dataclass
class ProductionRequirements:
    """
    Production requirements specification
    """
    # Traffic requirements
    peak_requests_per_second: int
    average_requests_per_second: int
    daily_request_volume: int
    
    # Performance requirements
    max_latency_p95: float  # milliseconds
    max_latency_p99: float  # milliseconds
    target_availability: float  # 99.9% = 0.999
    
    # Business requirements
    budget_monthly_usd: float
    geographic_regions: List[str]
    compliance_requirements: List[str]
    
    # Model requirements
    model_size: str  # "7B", "13B", "70B"
    max_context_length: int
    average_response_tokens: int

# Example production requirements
requirements = ProductionRequirements(
    peak_requests_per_second=1000,
    average_requests_per_second=200,
    daily_request_volume=17_280_000,  # 200 RPS * 86400 seconds
    
    max_latency_p95=500.0,  # 500ms
    max_latency_p99=1000.0,  # 1s
    target_availability=0.999,  # 99.9%
    
    budget_monthly_usd=50_000,
    geographic_regions=["us-east-1", "eu-west-1", "ap-southeast-1"],
    compliance_requirements=["GDPR", "SOC2", "HIPAA"],
    
    model_size="7B",
    max_context_length=4096,
    average_response_tokens=150
)

print("Production Requirements:")
print(f"Peak RPS: {requirements.peak_requests_per_second:,}")
print(f"Daily volume: {requirements.daily_request_volume:,} requests")
print(f"P95 latency: {requirements.max_latency_p95}ms")
print(f"Availability: {requirements.target_availability*100}%")
print(f"Budget: ${requirements.budget_monthly_usd:,}/month")
print(f"Model: {requirements.model_size}")

---
## 2. Resource Estimation

In [None]:
class ResourceCalculator:
    """
    Calculate required resources for production deployment
    """
    
    def __init__(self):
        # Model resource requirements (per instance)
        self.model_specs = {
            "7B": {
                "gpu_memory_gb": 16,
                "cpu_cores": 8,
                "ram_gb": 32,
                "throughput_tokens_per_sec": 2000,
                "concurrent_requests": 32
            },
            "13B": {
                "gpu_memory_gb": 32,
                "cpu_cores": 16,
                "ram_gb": 64,
                "throughput_tokens_per_sec": 1500,
                "concurrent_requests": 24
            },
            "70B": {
                "gpu_memory_gb": 160,  # Multi-GPU setup
                "cpu_cores": 32,
                "ram_gb": 128,
                "throughput_tokens_per_sec": 800,
                "concurrent_requests": 16
            }
        }
    
    def calculate_required_instances(self, requirements: ProductionRequirements) -> Dict:
        """
        Calculate number of instances needed
        """
        model_spec = self.model_specs[requirements.model_size]
        
        # Calculate tokens per second needed
        tokens_per_request = requirements.average_response_tokens
        total_tokens_per_sec = requirements.peak_requests_per_second * tokens_per_request
        
        # Calculate instances needed based on throughput
        instances_for_throughput = np.ceil(
            total_tokens_per_sec / model_spec["throughput_tokens_per_sec"]
        )
        
        # Calculate instances needed based on concurrent requests
        instances_for_concurrency = np.ceil(
            requirements.peak_requests_per_second / model_spec["concurrent_requests"]
        )
        
        # Take the maximum
        min_instances = max(instances_for_throughput, instances_for_concurrency)
        
        # Add buffer for availability (20% overhead)
        recommended_instances = int(min_instances * 1.2)
        
        return {
            "min_instances": int(min_instances),
            "recommended_instances": recommended_instances,
            "tokens_per_sec_needed": int(total_tokens_per_sec),
            "tokens_per_sec_capacity": int(recommended_instances * model_spec["throughput_tokens_per_sec"]),
            "utilization_percentage": (min_instances / recommended_instances) * 100
        }
    
    def calculate_total_resources(self, requirements: ProductionRequirements) -> Dict:
        """
        Calculate total resource requirements
        """
        model_spec = self.model_specs[requirements.model_size]
        sizing = self.calculate_required_instances(requirements)
        instances = sizing["recommended_instances"]
        
        return {
            "instances": instances,
            "total_gpu_memory_gb": instances * model_spec["gpu_memory_gb"],
            "total_cpu_cores": instances * model_spec["cpu_cores"],
            "total_ram_gb": instances * model_spec["ram_gb"],
            "total_throughput_tokens_per_sec": instances * model_spec["throughput_tokens_per_sec"],
            "total_concurrent_requests": instances * model_spec["concurrent_requests"]
        }

calculator = ResourceCalculator()
sizing = calculator.calculate_required_instances(requirements)
resources = calculator.calculate_total_resources(requirements)

print("Resource Estimation:")
print(f"Minimum instances: {sizing['min_instances']}")
print(f"Recommended instances: {sizing['recommended_instances']}")
print(f"Expected utilization: {sizing['utilization_percentage']:.1f}%")
print()
print("Total Resources:")
print(f"GPU Memory: {resources['total_gpu_memory_gb']} GB")
print(f"CPU Cores: {resources['total_cpu_cores']}")
print(f"RAM: {resources['total_ram_gb']} GB")
print(f"Throughput Capacity: {resources['total_throughput_tokens_per_sec']:,} tokens/sec")

---
## 3. Architecture Patterns

In [None]:
class ArchitecturePattern:
    """
    Different production architecture patterns
    """
    
    @staticmethod
    def single_node_pattern():
        return {
            "name": "Single Node",
            "description": "Single server with multiple GPU setup",
            "pros": [
                "Simple deployment",
                "Low latency (no network hops)",
                "Easy debugging",
                "Cost effective for small scale"
            ],
            "cons": [
                "Single point of failure",
                "Limited scalability",
                "No geographic distribution",
                "Hardware limitations"
            ],
            "use_cases": [
                "Development/testing",
                "< 100 RPS",
                "Internal tools",
                "Proof of concept"
            ],
            "max_rps": 100
        }
    
    @staticmethod
    def horizontal_scaling_pattern():
        return {
            "name": "Horizontal Scaling",
            "description": "Multiple identical nodes behind load balancer",
            "pros": [
                "High availability",
                "Linear scalability",
                "Rolling updates",
                "Fault tolerance"
            ],
            "cons": [
                "More complex deployment",
                "Load balancer overhead",
                "State management issues",
                "Higher costs"
            ],
            "use_cases": [
                "Production services",
                "100-1000 RPS",
                "Customer-facing APIs",
                "24/7 availability"
            ],
            "max_rps": 1000
        }
    
    @staticmethod
    def microservices_pattern():
        return {
            "name": "Microservices",
            "description": "Separate services for different functions",
            "pros": [
                "Independent scaling",
                "Technology diversity",
                "Team autonomy",
                "Fault isolation"
            ],
            "cons": [
                "Complex networking",
                "Service discovery overhead",
                "Distributed tracing needed",
                "Higher operational complexity"
            ],
            "use_cases": [
                "Large organizations",
                "> 1000 RPS",
                "Multiple model types",
                "Different SLA requirements"
            ],
            "max_rps": 10000
        }
    
    @staticmethod
    def multi_region_pattern():
        return {
            "name": "Multi-Region",
            "description": "Deployed across multiple geographic regions",
            "pros": [
                "Global low latency",
                "Disaster recovery",
                "Compliance (data locality)",
                "Load distribution"
            ],
            "cons": [
                "Complex routing",
                "Data synchronization",
                "Higher costs",
                "Regulatory complexity"
            ],
            "use_cases": [
                "Global services",
                "Compliance requirements",
                "Mission critical systems",
                "Enterprise customers"
            ],
            "max_rps": 50000
        }

# Analyze which pattern fits our requirements
patterns = [
    ArchitecturePattern.single_node_pattern(),
    ArchitecturePattern.horizontal_scaling_pattern(),
    ArchitecturePattern.microservices_pattern(),
    ArchitecturePattern.multi_region_pattern()
]

print("Architecture Pattern Analysis:")
print(f"Required RPS: {requirements.peak_requests_per_second}")
print(f"Regions: {len(requirements.geographic_regions)}")
print(f"Compliance: {requirements.compliance_requirements}")
print()

suitable_patterns = []
for pattern in patterns:
    if pattern["max_rps"] >= requirements.peak_requests_per_second:
        suitable_patterns.append(pattern)
        print(f"✅ {pattern['name']}: {pattern['description']}")
        print(f"   Max RPS: {pattern['max_rps']}")
        print(f"   Use cases: {', '.join(pattern['use_cases'])}")
        print()
    else:
        print(f"❌ {pattern['name']}: Insufficient capacity ({pattern['max_rps']} < {requirements.peak_requests_per_second})")

# Recommend based on requirements
if len(requirements.geographic_regions) > 1:
    recommended = "Multi-Region"
elif requirements.peak_requests_per_second > 500:
    recommended = "Microservices"
elif requirements.peak_requests_per_second > 100:
    recommended = "Horizontal Scaling"
else:
    recommended = "Single Node"

print(f"🎯 Recommended Pattern: {recommended}")

---
## 4. Technology Stack Selection

In [None]:
class TechnologyStack:
    """
    Production technology stack options
    """
    
    @staticmethod
    def get_inference_engines():
        return {
            "vLLM": {
                "description": "High-performance inference with PagedAttention",
                "pros": ["Highest throughput", "Memory efficient", "Easy to use"],
                "cons": ["GPU only", "Limited model support"],
                "best_for": "High-throughput production services",
                "maturity": "Stable"
            },
            "TensorRT-LLM": {
                "description": "NVIDIA optimized inference engine",
                "pros": ["Extreme performance", "Advanced optimizations", "Enterprise support"],
                "cons": ["Complex setup", "NVIDIA GPU only", "Steep learning curve"],
                "best_for": "Maximum performance with NVIDIA hardware",
                "maturity": "Production ready"
            },
            "TGI": {
                "description": "HuggingFace Text Generation Inference",
                "pros": ["HF integration", "Wide model support", "Good documentation"],
                "cons": ["Lower performance", "Resource intensive"],
                "best_for": "Quick deployment with HuggingFace models",
                "maturity": "Stable"
            }
        }
    
    @staticmethod
    def get_container_platforms():
        return {
            "Kubernetes": {
                "description": "Industry standard container orchestration",
                "pros": ["Mature ecosystem", "Auto-scaling", "Service discovery", "Multi-cloud"],
                "cons": ["Complex", "Resource overhead", "Learning curve"],
                "best_for": "Production at scale",
                "cost_factor": 1.2
            },
            "Docker Swarm": {
                "description": "Simple Docker-native orchestration",
                "pros": ["Simple", "Docker native", "Low overhead"],
                "cons": ["Limited features", "Smaller ecosystem"],
                "best_for": "Small to medium deployments",
                "cost_factor": 1.0
            },
            "Nomad": {
                "description": "HashiCorp's orchestrator",
                "pros": ["Simple", "Flexible", "Multi-workload"],
                "cons": ["Smaller ecosystem", "Less mature"],
                "best_for": "HashiCorp stack environments",
                "cost_factor": 1.1
            }
        }
    
    @staticmethod
    def get_monitoring_stacks():
        return {
            "Prometheus + Grafana": {
                "description": "Open source monitoring stack",
                "pros": ["Free", "Powerful", "Large community", "Kubernetes native"],
                "cons": ["Setup complexity", "Storage scaling"],
                "cost_monthly": 0
            },
            "DataDog": {
                "description": "SaaS monitoring platform",
                "pros": ["Easy setup", "Rich features", "Great UI", "APM included"],
                "cons": ["Expensive", "Vendor lock-in"],
                "cost_monthly": 2000
            },
            "New Relic": {
                "description": "Application performance monitoring",
                "pros": ["APM focus", "AI insights", "Good alerting"],
                "cons": ["Expensive", "Complex pricing"],
                "cost_monthly": 1500
            }
        }

# Analyze technology choices
tech_stack = TechnologyStack()

print("Technology Stack Analysis:")
print("=" * 40)

# Inference Engine Selection
engines = tech_stack.get_inference_engines()
print("\n🚀 Inference Engines:")
for name, details in engines.items():
    print(f"\n{name}: {details['description']}")
    print(f"  Best for: {details['best_for']}")
    print(f"  Maturity: {details['maturity']}")

# Recommend inference engine
if requirements.peak_requests_per_second > 500:
    recommended_engine = "vLLM"  # High throughput
elif "NVIDIA" in str(requirements.budget_monthly_usd):  # High budget = likely NVIDIA
    recommended_engine = "TensorRT-LLM"  # Maximum performance
else:
    recommended_engine = "vLLM"  # Best balance

print(f"\n🎯 Recommended Inference Engine: {recommended_engine}")

# Container Platform Selection
platforms = tech_stack.get_container_platforms()
print("\n🐳 Container Platforms:")
for name, details in platforms.items():
    monthly_cost_factor = details['cost_factor'] * 10000  # Example base cost
    print(f"\n{name}: {details['description']}")
    print(f"  Best for: {details['best_for']}")
    print(f"  Cost factor: {details['cost_factor']}x")

# Recommend platform
if requirements.peak_requests_per_second > 1000 or len(requirements.geographic_regions) > 1:
    recommended_platform = "Kubernetes"
elif requirements.budget_monthly_usd < 20000:
    recommended_platform = "Docker Swarm"
else:
    recommended_platform = "Kubernetes"

print(f"\n🎯 Recommended Platform: {recommended_platform}")

# Monitoring Stack Selection
monitoring = tech_stack.get_monitoring_stacks()
print("\n📊 Monitoring Stacks:")
for name, details in monitoring.items():
    print(f"\n{name}: {details['description']}")
    print(f"  Monthly cost: ${details['cost_monthly']}")

# Recommend monitoring
if requirements.budget_monthly_usd > 30000:
    recommended_monitoring = "DataDog"
else:
    recommended_monitoring = "Prometheus + Grafana"

print(f"\n🎯 Recommended Monitoring: {recommended_monitoring}")

---
## 5. Cost Estimation

In [None]:
class CostCalculator:
    """
    Calculate production deployment costs
    """
    
    def __init__(self):
        # AWS pricing (approximate, varies by region)
        self.instance_pricing = {
            "g5.xlarge": {  # 1 x A10G (24GB)
                "gpu_memory_gb": 24,
                "hourly_on_demand": 1.006,
                "hourly_spot": 0.30,  # ~70% savings
                "vcpu": 4,
                "ram_gb": 16
            },
            "g5.2xlarge": {  # 1 x A10G (24GB)
                "gpu_memory_gb": 24,
                "hourly_on_demand": 1.212,
                "hourly_spot": 0.36,
                "vcpu": 8,
                "ram_gb": 32
            },
            "g5.4xlarge": {  # 1 x A10G (24GB)
                "gpu_memory_gb": 24,
                "hourly_on_demand": 1.624,
                "hourly_spot": 0.49,
                "vcpu": 16,
                "ram_gb": 64
            },
            "p3.2xlarge": {  # 1 x V100 (16GB)
                "gpu_memory_gb": 16,
                "hourly_on_demand": 3.06,
                "hourly_spot": 0.92,
                "vcpu": 8,
                "ram_gb": 61
            },
            "p4d.2xlarge": {  # 1 x A100 (40GB)
                "gpu_memory_gb": 40,
                "hourly_on_demand": 3.83,
                "hourly_spot": 1.15,
                "vcpu": 8,
                "ram_gb": 96
            }
        }
    
    def select_optimal_instance(self, gpu_memory_needed: int) -> str:
        """
        Select the most cost-effective instance type
        """
        suitable_instances = [
            (name, spec) for name, spec in self.instance_pricing.items()
            if spec['gpu_memory_gb'] >= gpu_memory_needed
        ]
        
        if not suitable_instances:
            return "p4d.2xlarge"  # Fallback to largest
        
        # Sort by cost efficiency (cost per GB GPU memory)
        sorted_instances = sorted(
            suitable_instances,
            key=lambda x: x[1]['hourly_spot'] / x[1]['gpu_memory_gb']
        )
        
        return sorted_instances[0][0]
    
    def calculate_monthly_cost(self, requirements: ProductionRequirements, 
                             resources: Dict, use_spot: bool = True) -> Dict:
        """
        Calculate total monthly costs
        """
        # Select instance type
        gpu_per_instance = calculator.model_specs[requirements.model_size]["gpu_memory_gb"]
        instance_type = self.select_optimal_instance(gpu_per_instance)
        instance_spec = self.instance_pricing[instance_type]
        
        # Calculate compute costs
        hourly_rate = instance_spec['hourly_spot'] if use_spot else instance_spec['hourly_on_demand']
        hours_per_month = 24 * 30  # 720 hours
        
        compute_cost_per_instance = hourly_rate * hours_per_month
        total_compute_cost = compute_cost_per_instance * resources['instances']
        
        # Additional costs
        load_balancer_cost = 25 * len(requirements.geographic_regions)  # ALB cost
        storage_cost = 100  # EBS for logs, configs
        data_transfer_cost = 200  # Estimated
        monitoring_cost = 200 if "Prometheus" in recommended_monitoring else 2000
        
        total_cost = (
            total_compute_cost + 
            load_balancer_cost + 
            storage_cost + 
            data_transfer_cost + 
            monitoring_cost
        )
        
        return {
            "instance_type": instance_type,
            "instances": resources['instances'],
            "hourly_rate_per_instance": hourly_rate,
            "compute_cost_monthly": total_compute_cost,
            "load_balancer_cost": load_balancer_cost,
            "storage_cost": storage_cost,
            "data_transfer_cost": data_transfer_cost,
            "monitoring_cost": monitoring_cost,
            "total_monthly_cost": total_cost,
            "cost_per_request": total_cost / requirements.daily_request_volume / 30,
            "budget_utilization": (total_cost / requirements.budget_monthly_usd) * 100
        }

# Calculate costs
cost_calc = CostCalculator()

# Compare spot vs on-demand
cost_spot = cost_calc.calculate_monthly_cost(requirements, resources, use_spot=True)
cost_on_demand = cost_calc.calculate_monthly_cost(requirements, resources, use_spot=False)

print("Cost Analysis:")
print("=" * 40)

print(f"\n📊 Resource Requirements:")
print(f"Instances needed: {resources['instances']}")
print(f"Instance type: {cost_spot['instance_type']}")
print(f"Total GPU memory: {resources['total_gpu_memory_gb']} GB")

print(f"\n💰 Cost Comparison:")
print(f"\nOn-Demand Pricing:")
print(f"  Hourly per instance: ${cost_on_demand['hourly_rate_per_instance']:.3f}")
print(f"  Monthly total: ${cost_on_demand['total_monthly_cost']:,.0f}")
print(f"  Budget utilization: {cost_on_demand['budget_utilization']:.1f}%")

print(f"\nSpot Instance Pricing:")
print(f"  Hourly per instance: ${cost_spot['hourly_rate_per_instance']:.3f}")
print(f"  Monthly total: ${cost_spot['total_monthly_cost']:,.0f}")
print(f"  Budget utilization: {cost_spot['budget_utilization']:.1f}%")
print(f"  Savings: ${cost_on_demand['total_monthly_cost'] - cost_spot['total_monthly_cost']:,.0f} ({(1 - cost_spot['total_monthly_cost']/cost_on_demand['total_monthly_cost'])*100:.0f}%)")

print(f"\n📈 Cost Breakdown (Spot):")
print(f"  Compute: ${cost_spot['compute_cost_monthly']:,.0f} ({cost_spot['compute_cost_monthly']/cost_spot['total_monthly_cost']*100:.0f}%)")
print(f"  Load balancer: ${cost_spot['load_balancer_cost']:,.0f}")
print(f"  Storage: ${cost_spot['storage_cost']:,.0f}")
print(f"  Data transfer: ${cost_spot['data_transfer_cost']:,.0f}")
print(f"  Monitoring: ${cost_spot['monitoring_cost']:,.0f}")

print(f"\n🎯 Cost per request: ${cost_spot['cost_per_request']:.6f}")

# Check if within budget
if cost_spot['budget_utilization'] <= 100:
    print(f"\n✅ Within budget (${requirements.budget_monthly_usd:,})")
else:
    print(f"\n❌ Over budget by ${cost_spot['total_monthly_cost'] - requirements.budget_monthly_usd:,.0f}")
    print("   Consider: smaller model, fewer instances, or spot instances")

---
## 6. Capacity Planning

In [None]:
# Traffic pattern analysis
import matplotlib.pyplot as plt
import numpy as np

# Simulate daily traffic pattern
hours = np.arange(24)
base_traffic = requirements.average_requests_per_second

# Typical business hours pattern
traffic_multiplier = np.array([
    0.1, 0.05, 0.05, 0.05, 0.1, 0.2,   # 00-05: Very low
    0.4, 0.7, 1.2, 1.5, 1.8, 2.0,     # 06-11: Morning ramp
    1.9, 1.7, 2.2, 2.5, 2.3, 2.0,     # 12-17: Peak hours
    1.5, 1.2, 0.8, 0.5, 0.3, 0.2      # 18-23: Evening decline
])

daily_traffic = base_traffic * traffic_multiplier
peak_hour_traffic = np.max(daily_traffic)

# Plot traffic pattern
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Daily traffic pattern
ax1.plot(hours, daily_traffic, marker='o', linewidth=2)
ax1.axhline(y=requirements.peak_requests_per_second, color='r', linestyle='--', 
           label=f'Capacity: {requirements.peak_requests_per_second} RPS')
ax1.axhline(y=peak_hour_traffic, color='orange', linestyle='--', 
           label=f'Peak hour: {peak_hour_traffic:.0f} RPS')
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Requests per Second')
ax1.set_title('Daily Traffic Pattern')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Resource utilization
capacity = resources['total_throughput_tokens_per_sec'] / requirements.average_response_tokens
utilization = daily_traffic / capacity * 100

ax2.plot(hours, utilization, marker='s', linewidth=2, color='green')
ax2.axhline(y=80, color='orange', linestyle='--', label='Target utilization: 80%')
ax2.axhline(y=100, color='red', linestyle='--', label='Max capacity: 100%')
ax2.set_xlabel('Hour of Day')
ax2.set_ylabel('Resource Utilization (%)')
ax2.set_title('Resource Utilization Pattern')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Capacity Planning Analysis:")
print(f"Peak hour traffic: {peak_hour_traffic:.0f} RPS")
print(f"Provisioned capacity: {capacity:.0f} RPS")
print(f"Peak utilization: {np.max(utilization):.1f}%")
print(f"Average utilization: {np.mean(utilization):.1f}%")

if np.max(utilization) > 90:
    print("\n⚠️  Warning: Peak utilization > 90%, consider auto-scaling")
elif np.max(utilization) < 60:
    print("\n💡 Suggestion: Low utilization, consider cost optimization")
else:
    print("\n✅ Good capacity planning")

---
## 7. Architecture Diagram

In [None]:
# Generate architecture summary
architecture_summary = f"""
Production Architecture Summary
===============================

🎯 Requirements:
  • Peak traffic: {requirements.peak_requests_per_second:,} RPS
  • Latency SLA: P95 < {requirements.max_latency_p95}ms
  • Availability: {requirements.target_availability*100}%
  • Regions: {len(requirements.geographic_regions)}
  • Model: {requirements.model_size}

🏗️  Recommended Architecture:
  • Pattern: {recommended}
  • Inference Engine: {recommended_engine}
  • Container Platform: {recommended_platform}
  • Monitoring: {recommended_monitoring}

💻 Resource Requirements:
  • Instances: {resources['instances']}
  • Instance type: {cost_spot['instance_type']}
  • Total GPU memory: {resources['total_gpu_memory_gb']} GB
  • Total CPU cores: {resources['total_cpu_cores']}
  • Total RAM: {resources['total_ram_gb']} GB

💰 Cost Estimate (Spot pricing):
  • Monthly cost: ${cost_spot['total_monthly_cost']:,.0f}
  • Budget utilization: {cost_spot['budget_utilization']:.1f}%
  • Cost per request: ${cost_spot['cost_per_request']:.6f}
  • Savings vs on-demand: {(1 - cost_spot['total_monthly_cost']/cost_on_demand['total_monthly_cost'])*100:.0f}%

📈 Performance Capacity:
  • Throughput: {resources['total_throughput_tokens_per_sec']:,} tokens/sec
  • Concurrent requests: {resources['total_concurrent_requests']}
  • Peak utilization: {np.max(utilization):.1f}%
"""

print(architecture_summary)

# Save architecture plan to file
with open('architecture_plan.txt', 'w') as f:
    f.write(architecture_summary)

print("\n✅ Architecture plan saved to 'architecture_plan.txt'")

---
## Summary

✅ **Completed**:
1. Analyzed production requirements
2. Calculated resource needs
3. Evaluated architecture patterns
4. Selected optimal technology stack
5. Estimated costs and capacity
6. Generated architecture plan

📊 **Key Decisions**:
- Architecture pattern based on scale and requirements
- Cost-optimized instance selection
- Technology stack for production readiness
- Capacity planning for traffic patterns

➡️ **Next**: In `02-Deployment_Implementation.ipynb`, we'll implement:
- Docker optimization and builds
- Kubernetes deployment configs
- Auto-scaling setup
- CI/CD pipeline design

---
## Exercises

1. **Modify requirements**: Change peak RPS to 5000 and see how architecture recommendations change
2. **Cost optimization**: Compare different instance types and spot vs on-demand pricing
3. **Multi-region**: Add more regions and analyze the cost impact
4. **Model scaling**: Compare resource needs for 7B vs 13B vs 70B models