# Day 33: Deploying a Local Model Server - Part 1

In this notebook, we'll set up a local model server using vLLM, one of the most popular frameworks for efficient LLM serving. We'll deploy a model and create a simple API endpoint for text generation.

## Overview

1. Setup and dependencies
2. Understanding vLLM architecture
3. Setting up a local model server
4. Creating a basic inference endpoint

## 1. Setup and Dependencies

Let's start by installing the necessary packages. vLLM requires PyTorch and several other dependencies.

In [None]:
!pip install -q vllm transformers fastapi uvicorn pydantic

In [None]:
import os
import time
import torch
import numpy as np

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Check vLLM installation
try:
    import vllm
    print(f"vLLM version: {vllm.__version__}")
except ImportError:
    print("vLLM not installed properly. Please check installation.")

## 2. Understanding vLLM Architecture

vLLM is a high-performance library for LLM inference and serving. It introduces several key optimizations:

1. **PagedAttention**: Memory-efficient KV cache management using a paging mechanism
2. **Continuous Batching**: Dynamic handling of requests for optimal throughput
3. **Tensor Parallelism**: Distributing model weights across multiple GPUs
4. **Optimized CUDA Kernels**: Efficient implementation of key operations

These optimizations allow vLLM to achieve significantly higher throughput compared to traditional implementations.

### 2.1 PagedAttention

PagedAttention is the core innovation in vLLM. It applies virtual memory concepts to KV caching:

1. Divides the KV cache into fixed-size blocks or "pages"
2. Allocates pages non-contiguously in physical memory
3. Uses a block table to map logical positions to physical memory locations

This approach significantly reduces memory fragmentation and enables more efficient memory utilization.

### 2.2 Continuous Batching

Continuous batching allows vLLM to dynamically handle requests:

1. New requests can join existing batches at any time
2. Completed requests can leave batches without disrupting others
3. Batches are processed efficiently regardless of varying sequence lengths

This approach maximizes GPU utilization and throughput.

## 3. Setting Up a Local Model Server

Let's set up a local model server using vLLM. We'll use a smaller model for demonstration purposes.

In [None]:
# Define the model to use
model_name = "facebook/opt-350m"  # Using a smaller model for demonstration

# Check if we're running in a notebook or script
try:
    get_ipython
    is_notebook = True
except NameError:
    is_notebook = False

print(f"Running in {'notebook' if is_notebook else 'script'} mode")

### 3.1 Loading a Model with vLLM

First, let's load a model using vLLM's `LLM` class to understand the basic usage.

In [None]:
from vllm import LLM, SamplingParams

# Initialize the model
try:
    print(f"Loading model: {model_name}")
    llm = LLM(model=model_name)
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Falling back to CPU mode for demonstration")
    # If GPU loading fails, we'll simulate the API for demonstration
    from transformers import AutoModelForCausalLM, AutoTokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

### 3.2 Basic Inference with vLLM

Let's try a simple inference with the loaded model.

In [None]:
# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100
)

# Test prompt
prompt = "Artificial intelligence will transform the future by"

# Generate text
try:
    outputs = llm.generate([prompt], sampling_params)
    
    # Print the generated text
    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {prompt}{generated_text}")
except Exception as e:
    print(f"Error during generation: {e}")
    print("Falling back to standard Hugging Face generation for demonstration")
    
    # Fallback to standard generation if vLLM fails
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=100,
            do_sample=True,
            temperature=0.7,
            top_p=0.95
        )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Generated text: {generated_text}")

## 4. Creating a FastAPI Server for Model Serving

Now, let's create a FastAPI server to serve our model. We'll define the API endpoints and request/response models.

In [None]:
# Create a file for our FastAPI server
server_code = """
import os
import time
from typing import List, Optional, Dict, Any

from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field

from vllm import LLM, SamplingParams

# Define the model name
MODEL_NAME = "facebook/opt-350m"  # Using a smaller model for demonstration

# Initialize FastAPI app
app = FastAPI(title="vLLM API Server")

# Add CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize the model
print(f"Loading model: {MODEL_NAME}")
llm = LLM(model=MODEL_NAME)
print("Model loaded successfully")

# Define request and response models
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = Field(default=100, ge=1, le=1024)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
    top_p: float = Field(default=0.95, ge=0.0, le=1.0)
    n: int = Field(default=1, ge=1, le=5)

class GenerationResponse(BaseModel):
    text: str
    usage: Dict[str, int]
    request_time: float

# Define API endpoints
@app.get("/")
async def root():
    return {"message": "vLLM API Server is running", "model": MODEL_NAME}

@app.post("/generate", response_model=GenerationResponse)
async def generate(request: GenerationRequest):
    try:
        start_time = time.time()
        
        # Set up sampling parameters
        sampling_params = SamplingParams(
            temperature=request.temperature,
            top_p=request.top_p,
            max_tokens=request.max_tokens,
            n=request.n
        )
        
        # Generate text
        outputs = llm.generate([request.prompt], sampling_params)
        generated_text = outputs[0].outputs[0].text
        
        # Calculate token counts (approximate)
        prompt_tokens = len(request.prompt.split())
        completion_tokens = len(generated_text.split())
        total_tokens = prompt_tokens + completion_tokens
        
        # Calculate request time
        request_time = time.time() - start_time
        
        return GenerationResponse(
            text=generated_text,
            usage={
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": total_tokens
            },
            request_time=request_time
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy"}

# Run the server
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
"""

# Write the server code to a file
with open("vllm_server.py", "w") as f:
    f.write(server_code)

print("Server code written to vllm_server.py")

### 4.1 Running the FastAPI Server

Now, let's run the FastAPI server. In a notebook, we'll use a background process.

In [None]:
# Run the server in the background
import subprocess
import time

def run_server():
    print("Starting vLLM server...")
    try:
        # Run the server in a separate process
        server_process = subprocess.Popen(["python", "vllm_server.py"])
        print("Server started. Waiting for it to initialize...")
        time.sleep(5)  # Give the server some time to start
        return server_process
    except Exception as e:
        print(f"Error starting server: {e}")
        return None

# Only run the server if we're in a script, not in a notebook
if not is_notebook:
    server_process = run_server()
else:
    print("In notebook mode. To run the server, execute the following in a terminal:")
    print("python vllm_server.py")

### 4.2 Testing the API

Let's test our API by sending requests to it.

In [None]:
import requests
import json

def test_api(prompt="Artificial intelligence will transform the future by"):
    try:
        # Define the API endpoint
        url = "http://localhost:8000/generate"
        
        # Define the request payload
        payload = {
            "prompt": prompt,
            "max_tokens": 100,
            "temperature": 0.7,
            "top_p": 0.95,
            "n": 1
        }
        
        # Send the request
        response = requests.post(url, json=payload)
        
        # Check if the request was successful
        if response.status_code == 200:
            result = response.json()
            print(f"Generated text: {prompt}{result['text']}")
            print(f"\nUsage: {result['usage']}")
            print(f"Request time: {result['request_time']:.2f} seconds")
        else:
            print(f"Error: {response.status_code} - {response.text}")
    except Exception as e:
        print(f"Error testing API: {e}")
        print("Make sure the server is running. If you're in a notebook, run the server in a separate terminal.")

# Test the API
# Note: This will only work if the server is running
test_api()

## 5. Measuring Server Performance

Let's measure the performance of our server by sending multiple requests and calculating throughput.

In [None]:
def benchmark_api(num_requests=5):
    try:
        # Define the API endpoint
        url = "http://localhost:8000/generate"
        
        # Define test prompts
        prompts = [
            "Artificial intelligence will transform healthcare by",
            "The future of renewable energy depends on",
            "Space exploration in the next decade will focus on",
            "Quantum computing offers advantages such as",
            "The most significant ethical concerns in technology are"
        ]
        
        # Send requests and measure time
        start_time = time.time()
        total_tokens = 0
        
        for i in range(num_requests):
            prompt = prompts[i % len(prompts)]
            
            # Define the request payload
            payload = {
                "prompt": prompt,
                "max_tokens": 50,
                "temperature": 0.7,
                "top_p": 0.95,
                "n": 1
            }
            
            # Send the request
            response = requests.post(url, json=payload)
            
            # Check if the request was successful
            if response.status_code == 200:
                result = response.json()
                total_tokens += result["usage"]["total_tokens"]
                print(f"Request {i+1}/{num_requests} completed in {result['request_time']:.2f} seconds")
            else:
                print(f"Error in request {i+1}: {response.status_code} - {response.text}")
        
        # Calculate total time and throughput
        total_time = time.time() - start_time
        throughput = total_tokens / total_time
        
        print(f"\nBenchmark Results:")
        print(f"Total requests: {num_requests}")
        print(f"Total tokens: {total_tokens}")
        print(f"Total time: {total_time:.2f} seconds")
        print(f"Throughput: {throughput:.2f} tokens per second")
        
    except Exception as e:
        print(f"Error benchmarking API: {e}")
        print("Make sure the server is running. If you're in a notebook, run the server in a separate terminal.")

# Benchmark the API
# Note: This will only work if the server is running
benchmark_api(num_requests=5)

## 6. Stopping the Server

If you started the server from this notebook, make sure to stop it when you're done.

In [None]:
def stop_server(server_process):
    if server_process:
        print("Stopping server...")
        server_process.terminate()
        server_process.wait()
        print("Server stopped.")

# Only stop the server if we started it
if not is_notebook and 'server_process' in locals():
    stop_server(server_process)

## Conclusion

In this notebook, we've explored how to set up a local model server using vLLM. We've created a FastAPI server that exposes an API endpoint for text generation and tested its performance.

Key takeaways:

1. vLLM provides significant performance improvements for LLM serving through PagedAttention and continuous batching
2. Setting up a local model server is straightforward with FastAPI and vLLM
3. The API can be easily tested and benchmarked to measure performance

In the next part, we'll extend our server to support streaming generation and explore more advanced features of vLLM.