# Day 33: Text Generation Inference (TGI) - Part 3

Text Generation Inference (TGI) is Hugging Face's solution for deploying and serving LLMs in production. It's optimized for performance with Rust backend and supports advanced features like quantization.

## Overview
1. TGI setup and installation
2. Basic model serving with TGI
3. Performance comparison

## 1. TGI Setup and Installation

TGI is typically deployed using Docker for easy setup and dependency management.

In [None]:
# Check if Docker is available
import subprocess
import os

def check_docker():
    try:
        result = subprocess.run(['docker', '--version'], capture_output=True, text=True)
        if result.returncode == 0:
            print(f"Docker available: {result.stdout.strip()}")
            return True
        else:
            print("Docker not available")
            return False
    except FileNotFoundError:
        print("Docker not installed")
        return False

docker_available = check_docker()

## 2. TGI Docker Setup

We'll create a Docker command to run TGI with a small model for demonstration.

In [None]:
# Create TGI Docker run script
tgi_script = """
#!/bin/bash

# TGI Docker run script
MODEL_NAME="microsoft/DialoGPT-small"
PORT=8080

echo "Starting TGI server with model: $MODEL_NAME"
echo "Server will be available at: http://localhost:$PORT"

docker run --gpus all \
    --shm-size 1g \
    -p $PORT:80 \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:1.4 \
    --model-id $MODEL_NAME \
    --num-shard 1 \
    --port 80 \
    --quantize bitsandbytes-nf4
"""

# Write the script to a file
with open("run_tgi.sh", "w") as f:
    f.write(tgi_script)

# Make it executable
os.chmod("run_tgi.sh", 0o755)

print("TGI run script created: run_tgi.sh")
print("To run TGI server, execute: ./run_tgi.sh")

## 3. TGI Client Implementation

Let's create a simple client to interact with the TGI server.

In [None]:
import requests
import json
import time

class TGIClient:
    def __init__(self, base_url="http://localhost:8080"):
        self.base_url = base_url
    
    def generate(self, prompt, max_new_tokens=50, temperature=0.7, do_sample=True):
        """Generate text using TGI server."""
        url = f"{self.base_url}/generate"
        
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": max_new_tokens,
                "temperature": temperature,
                "do_sample": do_sample,
                "return_full_text": False
            }
        }
        
        try:
            response = requests.post(url, json=payload)
            if response.status_code == 200:
                result = response.json()
                return result["generated_text"]
            else:
                return f"Error: {response.status_code} - {response.text}"
        except Exception as e:
            return f"Connection error: {e}"
    
    def generate_stream(self, prompt, max_new_tokens=50, temperature=0.7):
        """Generate text with streaming using TGI server."""
        url = f"{self.base_url}/generate_stream"
        
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": max_new_tokens,
                "temperature": temperature,
                "do_sample": True
            }
        }
        
        try:
            response = requests.post(url, json=payload, stream=True)
            if response.status_code == 200:
                for line in response.iter_lines():
                    if line:
                        line = line.decode('utf-8')
                        if line.startswith('data:'):
                            data = line[5:].strip()
                            if data and data != '[DONE]':
                                try:
                                    token_data = json.loads(data)
                                    yield token_data.get('token', {}).get('text', '')
                                except json.JSONDecodeError:
                                    continue
            else:
                yield f"Error: {response.status_code} - {response.text}"
        except Exception as e:
            yield f"Connection error: {e}"
    
    def health_check(self):
        """Check if TGI server is healthy."""
        try:
            response = requests.get(f"{self.base_url}/health")
            return response.status_code == 200
        except:
            return False

# Initialize client
tgi_client = TGIClient()
print("TGI client initialized")

## 4. Testing TGI Server

Let's test our TGI client (requires TGI server to be running).

In [None]:
def test_tgi_server():
    """Test TGI server functionality."""
    
    # Check server health
    print("Checking TGI server health...")
    if not tgi_client.health_check():
        print("❌ TGI server is not running or not healthy")
        print("To start TGI server, run: ./run_tgi.sh")
        return
    
    print("✅ TGI server is healthy")
    
    # Test basic generation
    print("\n=== Testing Basic Generation ===")
    prompt = "The future of artificial intelligence is"
    
    start_time = time.time()
    result = tgi_client.generate(prompt, max_new_tokens=30)
    generation_time = time.time() - start_time
    
    print(f"Prompt: {prompt}")
    print(f"Generated: {result}")
    print(f"Generation time: {generation_time:.2f} seconds")
    
    # Test streaming generation
    print("\n=== Testing Streaming Generation ===")
    prompt = "Machine learning will transform"
    print(f"Prompt: {prompt}")
    print("Streaming response: ", end="", flush=True)
    
    start_time = time.time()
    for token in tgi_client.generate_stream(prompt, max_new_tokens=20):
        print(token, end="", flush=True)
        time.sleep(0.05)  # Small delay to show streaming effect
    
    streaming_time = time.time() - start_time
    print(f"\nStreaming time: {streaming_time:.2f} seconds")

# Run the test (only if server is available)
test_tgi_server()

## 5. TGI Performance Comparison

Let's create a simple benchmark to compare TGI performance.

In [None]:
def benchmark_tgi(num_requests=5):
    """Benchmark TGI server performance."""
    
    if not tgi_client.health_check():
        print("TGI server not available for benchmarking")
        return
    
    prompts = [
        "Artificial intelligence will",
        "The benefits of renewable energy include",
        "Space exploration helps us",
        "Quantum computing enables",
        "The future of healthcare involves"
    ]
    
    print(f"Benchmarking TGI with {num_requests} requests...")
    
    total_time = 0
    total_tokens = 0
    
    for i in range(num_requests):
        prompt = prompts[i % len(prompts)]
        
        start_time = time.time()
        result = tgi_client.generate(prompt, max_new_tokens=25)
        request_time = time.time() - start_time
        
        if not result.startswith("Error"):
            tokens = len(result.split())
            total_tokens += tokens
            total_time += request_time
            
            print(f"Request {i+1}: {request_time:.2f}s, {tokens} tokens")
        else:
            print(f"Request {i+1}: {result}")
    
    if total_time > 0:
        avg_time = total_time / num_requests
        throughput = total_tokens / total_time
        
        print(f"\nBenchmark Results:")
        print(f"Average time per request: {avg_time:.2f}s")
        print(f"Total tokens generated: {total_tokens}")
        print(f"Throughput: {throughput:.2f} tokens/second")

# Run benchmark
benchmark_tgi()

## 6. TGI Configuration Options

TGI supports various configuration options for optimization:

In [None]:
# Advanced TGI configuration script
advanced_tgi_script = """
#!/bin/bash

# Advanced TGI configuration
MODEL_NAME="microsoft/DialoGPT-medium"
PORT=8080
MAX_CONCURRENT_REQUESTS=128
MAX_BEST_OF=4
MAX_STOP_SEQUENCES=4
MAX_INPUT_LENGTH=1024
MAX_TOTAL_TOKENS=2048

echo "Starting TGI server with advanced configuration..."

docker run --gpus all \
    --shm-size 1g \
    -p $PORT:80 \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:1.4 \
    --model-id $MODEL_NAME \
    --num-shard 1 \
    --port 80 \
    --max-concurrent-requests $MAX_CONCURRENT_REQUESTS \
    --max-best-of $MAX_BEST_OF \
    --max-stop-sequences $MAX_STOP_SEQUENCES \
    --max-input-length $MAX_INPUT_LENGTH \
    --max-total-tokens $MAX_TOTAL_TOKENS \
    --quantize bitsandbytes-nf4 \
    --trust-remote-code
"""

# Write advanced script
with open("run_tgi_advanced.sh", "w") as f:
    f.write(advanced_tgi_script)

os.chmod("run_tgi_advanced.sh", 0o755)

print("Advanced TGI configuration created: run_tgi_advanced.sh")
print("\nKey TGI Configuration Options:")
print("- --quantize: Enable quantization (bitsandbytes, gptq)")
print("- --num-shard: Number of GPU shards for tensor parallelism")
print("- --max-concurrent-requests: Maximum concurrent requests")
print("- --max-input-length: Maximum input sequence length")
print("- --trust-remote-code: Allow custom model code execution")

## Conclusion

Text Generation Inference (TGI) provides:

1. **High Performance**: Rust backend with optimized kernels
2. **Easy Deployment**: Docker-based deployment
3. **Advanced Features**: Quantization, streaming, tensor parallelism
4. **HuggingFace Integration**: Seamless model loading from HF Hub

TGI is ideal for production deployments requiring high throughput and easy scaling.

## Next Steps

To run TGI:
1. Execute `./run_tgi.sh` to start the server
2. Wait for model loading to complete
3. Run the test functions above to verify functionality

In the next notebook, we'll explore TensorRT-LLM for NVIDIA GPU optimization.