# Day 35: Telemetry Integration for LLM Services - Part 5

Integrating logging, metrics, and tracing into a cohesive telemetry system for comprehensive observability.

## Overview
1. Unified telemetry configuration
2. Instrumented service with all three pillars
3. Correlation between logs, metrics, and traces
4. Production-ready telemetry setup

In [None]:
# Install required packages
!pip install -q structlog prometheus-client opentelemetry-api opentelemetry-sdk

In [None]:
import time
import uuid
import json
import logging
import structlog
from typing import Dict, Any

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import Status, StatusCode

from prometheus_client import Counter, Histogram

## 1. Unified Telemetry Setup

Configure logging, metrics, and tracing in a single setup.

In [None]:
class TelemetrySystem:
    """Unified telemetry system for LLM services."""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.logger = self._setup_logging()
        self.tracer = self._setup_tracing()
        self.metrics = self._setup_metrics()
        print("Telemetry system initialized")
    
    def _setup_logging(self):
        """Configure structured logging."""
        structlog.configure(
            processors=[
                structlog.stdlib.filter_by_level,
                structlog.processors.TimeStamper(fmt="iso"),
                self._add_trace_info,  # Add trace/span IDs
                structlog.processors.JSONRenderer()
            ],
            logger_factory=structlog.stdlib.LoggerFactory(),
            wrapper_class=structlog.stdlib.BoundLogger,
        )
        logging.basicConfig(level=logging.INFO, format='%(message)s')
        return structlog.get_logger(self.service_name)
    
    def _setup_tracing(self):
        """Configure OpenTelemetry tracing."""
        trace.set_tracer_provider(TracerProvider())
        trace.get_tracer_provider().add_span_processor(
            SimpleSpanProcessor(ConsoleSpanExporter())
        )
        return trace.get_tracer(self.service_name)
    
    def _setup_metrics(self):
        """Configure Prometheus metrics."""
        return {
            'requests_total': Counter(
                'llm_requests_total', 'Total requests', ["model", "status"]
            ),
            'request_duration': Histogram(
                'llm_request_duration_seconds', 'Request duration', ["model"]
            )
        }
    
    def _add_trace_info(self, logger, method_name, event_dict):
        """Add trace and span IDs to logs."""
        current_span = trace.get_current_span()
        if current_span.is_recording():
            span_context = current_span.get_span_context()
            event_dict['trace_id'] = format(span_context.trace_id, '032x')
            event_dict['span_id'] = format(span_context.span_id, '016x')
        return event_dict

# Initialize telemetry system
telemetry = TelemetrySystem("llm-service")

## 2. Fully Instrumented Service

Create a service that uses all three observability pillars.

In [None]:
class FullyInstrumentedService:
    """LLM service with integrated telemetry."""
    
    def __init__(self, telemetry_system: TelemetrySystem):
        self.telemetry = telemetry_system
        self.model_name = "gpt-3.5-turbo"
    
    def generate(self, prompt: str) -> Dict[str, Any]:
        """Generate text with integrated telemetry."""
        
        # Start trace span
        with self.telemetry.tracer.start_as_current_span(
            "generate_request",
            attributes={"model": self.model_name, "prompt_length": len(prompt)}
        ) as span:
            
            start_time = time.time()
            request_id = str(uuid.uuid4())
            
            # Bind context to logger
            log = self.telemetry.logger.bind(request_id=request_id, model=self.model_name)
            
            try:
                # Log request start
                log.info("Request received", prompt=prompt)
                
                # Simulate work
                if not prompt.strip():
                    raise ValueError("Prompt is empty")
                
                time.sleep(0.2)  # Simulate processing
                generated_text = f"Response for: {prompt[:20]}..."
                
                # Record success metrics
                self.telemetry.metrics['requests_total'].labels(
                    model=self.model_name, status='success'
                ).inc()
                
                # Log success
                log.info("Request successful", generated_text=generated_text)
                span.set_status(Status(StatusCode.OK))
                
                return {'text': generated_text}
                
            except Exception as e:
                # Record error metrics
                self.telemetry.metrics['requests_total'].labels(
                    model=self.model_name, status='error'
                ).inc()
                
                # Log error
                log.error("Request failed", error=str(e))
                span.record_exception(e)
                span.set_status(Status(StatusCode.ERROR, str(e)))
                
                raise
            
            finally:
                # Record request duration
                duration = time.time() - start_time
                self.telemetry.metrics['request_duration'].labels(
                    model=self.model_name
                ).observe(duration)

# Initialize service
service = FullyInstrumentedService(telemetry)
print("Fully instrumented service initialized")

## 3. Testing the Integrated System

Generate requests and observe the correlated telemetry output.

In [None]:
import io
import sys

def test_telemetry_integration():
    """Test the integrated telemetry system."""
    
    # Capture log output
    log_capture = io.StringIO()
    original_stdout = sys.stdout
    sys.stdout = log_capture
    
    # Test successful request
    print("--- Testing Successful Request ---")
    try:
        result = service.generate("What is observability?")
        print(f"Success: {result}")
    except Exception as e:
        print(f"Error: {e}")
    
    # Test error request
    print("\n--- Testing Error Request ---")
    try:
        service.generate("")  # Empty prompt
    except Exception as e:
        print(f"Caught expected error: {e}")
    
    # Restore stdout and get logs
    sys.stdout = original_stdout
    log_output = log_capture.getvalue()
    
    # Print captured logs
    print("\n--- Captured Logs ---")
    print(log_output)
    
    # Analyze logs for correlation
    print("--- Log Correlation Analysis ---")
    trace_ids = set()
    for line in log_output.strip().split('\n'):
        try:
            log_entry = json.loads(line)
            if 'trace_id' in log_entry:
                trace_ids.add(log_entry['trace_id'])
        except:
            continue
    
    print(f"Found {len(trace_ids)} unique trace IDs in logs")
    for trace_id in trace_ids:
        print(f"  Trace ID: {trace_id}")
        print("    Logs for this trace:")
        for line in log_output.strip().split('\n'):
            if trace_id in line:
                print(f"      {line}")

# Run the test
test_telemetry_integration()

## 4. Production Telemetry Architecture

Here's a typical production telemetry architecture.

In [None]:
# Production architecture diagram (Mermaid)
from IPython.display import display, Markdown

mermaid_diagram = """
graph TD
    A[LLM Service] -->|Logs| B[Log Agent]
    A -->|Metrics| C[Metrics Agent]
    A -->|Traces| D[Trace Agent]
    
    B --> E[Log Aggregator e.g., Fluentd]
    C --> F[Metrics Collector e.g., Prometheus]
    D --> G[Trace Collector e.g., Jaeger/OTel Collector]
    
    E --> H[Log Storage e.g., Elasticsearch]
    F --> I[Metrics Storage e.g., Prometheus TSDB]
    G --> J[Trace Storage e.g., Jaeger Storage]
    
    H --> K[Observability Platform e.g., Grafana/Kibana]
    I --> K
    J --> K
    
    K --> L[Dashboards]
    K --> M[Alerting]
    K --> N[Analysis]
"""

display(Markdown(f"```mermaid\n{mermaid_diagram}```"))

## Conclusion

Integrating logging, metrics, and tracing provides comprehensive observability:

1. **Unified View**: Correlate logs, metrics, and traces for deep insights
2. **Root Cause Analysis**: Quickly identify the cause of issues
3. **Performance Optimization**: Find and fix performance bottlenecks
4. **Proactive Monitoring**: Set up alerts based on correlated data

**Best Practices**:
- Use a unified telemetry system like OpenTelemetry
- Include correlation IDs (trace/span IDs) in all telemetry data
- Set up dashboards that combine logs, metrics, and traces
- Automate instrumentation where possible

This completes our exploration of observability for LLM services.