# RefLex LLM - Complete OpenAI Integration Guide

RefLex LLM is an intelligent OpenAI API fallback system that automatically switches between OpenAI, Azure OpenAI, and local AI when endpoints become unavailable. It provides seamless failover capabilities while maintaining full OpenAI API compatibility. The primary intent is to use the module for testing and CI run, as local execution might be slower but also less expensive. In the future, with the possibility of spinning up a load balanced reflex kubernetes cluster, reflex could be shaped into a failsafe mechanism.

## What is RefLex LLM?

RefLex LLM acts as an intelligent middleware layer between your application and various AI providers. When your primary OpenAI endpoint fails due to rate limits, outages, or network issues, RefLex automatically detects the failure and routes your requests to alternative providers without any code changes required.

## Key Features

- **Automatic Provider Selection**:  
Intelligently chooses between OpenAI, Azure OpenAI, and local Ollama based on availability and your preferences
- **Docker Integration**:  
Automatically manages local AI containers with zero configuration
- **OpenAI Compatibility**:  
Drop-in replacement for the OpenAI Python client with identical API
- **Model Mapping**:  
Automatically maps OpenAI model names to equivalent local models
- **Configuration Management**:  
Supports file-based configuration for different environments
- **Health Monitoring**:  
Continuous health checking and automatic recovery
- **Performance Optimization**:  
Caches configurations and maintains persistent connections

## Installation and Setup

RefLex requires Python 3.8+ and Docker for local AI capabilities. The installation includes all necessary dependencies including the OpenAI client, Docker SDK, and configuration management tools.

In [None]:
!pip install reflex-llms numpy

## Provider Resolution and Basic Usage

RefLex automatically detects which AI providers are available and selects the best option based on your preference order. The system performs intelligent health checks by testing each provider in sequence and uses the first one that responds successfully.

### How Provider Testing Works

The provider resolution process involves several sophisticated steps:

1. **OpenAI Testing**: Makes test HTTP requests to api.openai.com or your custom endpoint, checking for valid API responses (200 or 401 status codes indicate a working endpoint)

2. **Azure Testing**: If Azure credentials are configured, tests the Azure OpenAI endpoint for accessibility and valid authentication

3. **RefLex Local**: Automatically starts Docker containers if needed, manages Ollama installation, and verifies local model availability

4. **Caching**: Successful configurations are cached to avoid repeated health checks and improve performance

The system is designed to be resilient and will automatically retry failed providers and handle network timeouts gracefully.

In [None]:
from reflex_llms import (
    get_openai_client, 
    get_selected_provider,
    get_module_status,
    is_using_reflex
)

# Configure client with provider preferences
client = get_openai_client(
    preference_order=["openai", "reflex"],
    openai_base_url="https://wrong.address.com/v1",  # Force fallback for demo
    timeout=5.0
)

# Display system status
status = get_module_status()
print(f"Selected provider: {get_selected_provider()}")
print(f"Using local RefLex: {is_using_reflex()}")
print(f"Config cached: {status['has_cached_config']}")
print(f"RefLex server running: {status['reflex_server_running']}")

## Chat Completions with Automatic Failover

RefLex provides identical OpenAI API functionality regardless of the underlying provider. All standard OpenAI parameters work seamlessly, including temperature, max_tokens, system messages, and advanced features like function calling. The client automatically handles provider differences behind the scenes, ensuring your application code remains unchanged.

### Response Handling and Metadata

When you make requests through RefLex, you receive standard OpenAI response objects with additional metadata about which provider was used. This transparency allows you to monitor provider usage patterns and optimize your configuration accordingly.

In [None]:
# Import display utilities
from utils import display_message, display_stream, display_embeddings

# Standard chat completion
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Explain how RefLex LLM works in simple terms."}],
    max_tokens=150
)

# Display formatted response
display_message(response, as_markdown=True)

print(f"Model used: {response.model}")
print(f"Tokens: {response.usage.total_tokens if response.usage else 'Unknown'}")
print(f"Provider: {get_selected_provider()}")

## Model Management and Automatic Mapping

One of RefLex's most powerful features is its intelligent model mapping system. When using the RefLex provider, requests for OpenAI models like "gpt-3.5-turbo" are automatically routed to compatible local models such as "llama3.2:3b". This mapping is configurable and can be customized based on your specific needs.

### Model Categories and Organization

RefLex organizes available models into logical categories to help you understand what's available and choose the right model for your task. The system automatically handles model downloads, updates, and lifecycle management.

In [None]:
# List and categorize available models
models = client.models.list()

chat_models = []
embedding_models = []
reasoning_models = []

for model in models.data:
    model_id = model.id
    if "embedding" in model_id:
        embedding_models.append(model_id)
    elif any(x in model_id for x in ["o1", "o3", "o4", "reasoning"]):
        reasoning_models.append(model_id)
    elif any(x in model_id for x in ["gpt", "llama", "gemma"]):
        chat_models.append(model_id)

print(f"Available models ({len(models.data)} total):")
print(f"Chat models: {len(chat_models)}")
print(f"Reasoning models: {len(reasoning_models)}")
print(f"Embedding models: {len(embedding_models)}")

# Show sample models
print(f"\nSample chat models: {sorted(chat_models)[:3]}")
print(f"Sample reasoning models: {sorted(reasoning_models)[:3]}")
print(f"Sample embedding models: {sorted(embedding_models)[:3]}")

## Working with Embeddings

RefLex fully supports OpenAI's embeddings API through local models, providing significant cost savings for applications that process large amounts of text. The text-embedding models are automatically mapped to compatible local alternatives like nomic-embed-text, which often provide comparable quality to OpenAI's models.

### Embedding Quality and Performance

Local embedding models can process text without network latency and API rate limits, making them ideal for batch processing, real-time applications, and privacy-sensitive use cases where data cannot leave your infrastructure.

In [None]:
# Create embeddings
embedding_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="RefLex LLM provides seamless fallback between OpenAI and local AI models."
)

# Display embedding information using utility function
display_embeddings(embedding_response, show_stats=True)

## Real-time Streaming Responses

RefLex fully supports OpenAI's streaming API, enabling real-time response generation that's essential for interactive applications like chatbots, coding assistants, and live content generation. The streaming functionality works identically across all providers, ensuring consistent user experience regardless of which backend is serving the request.

### Streaming Benefits and Use Cases

Streaming is particularly valuable for longer responses where users can start reading while the response is still being generated. This significantly improves perceived performance and user engagement, especially important for applications with real-time chat interfaces or interactive content generation.

In [None]:
# Create streaming request
stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Write a brief explanation of AI failover systems."}],
    max_tokens=200,
    stream=True,
    temperature=0.7
)

# Display streaming response
full_response = display_stream(stream)

## Advanced Reasoning Models

RefLex provides access to specialized reasoning models through the o1, o3, and o4 series. These models are specifically designed for complex problem-solving, mathematical reasoning, and step-by-step analytical tasks. They excel at breaking down complex problems, showing their work, and providing detailed explanations of their reasoning process.

### When to Use Reasoning Models

Reasoning models are particularly effective for mathematical problems, logical puzzles, code debugging, system design questions, and any task that benefits from explicit step-by-step thinking. They typically use lower temperature settings for more focused and consistent reasoning.

In [None]:
# Test reasoning model with markdown streaming
reasoning_stream = client.chat.completions.create(
    model="o4-mini",
    messages=[{"role": "user", "content": "A company has 3 servers handling 100 users each. Load increases 50% monthly. How many servers needed after 6 months?"}],
    max_tokens=300,
    stream=True,
    temperature=0.1
)

# Stream with markdown formatting
reasoning_response = display_stream(reasoning_stream, as_markdown=True)

## Server Management and Infrastructure Control

When using the RefLex provider, you gain access to powerful server management capabilities that allow you to monitor, control, and configure your local AI infrastructure. This includes container lifecycle management, model deployment, health monitoring, and resource optimization.

### Server Access and Monitoring

The server management interface provides real-time visibility into your local AI infrastructure, including container status, model availability, resource usage, and performance metrics. This transparency is crucial for production deployments where you need to ensure reliable service availability.

In [None]:
from reflex_llms import get_reflex_server

# Access RefLex server instance
server = get_reflex_server()

if server:
    print(f"API URL: {server.api_url}")
    print(f"OpenAI Compatible URL: {server.openai_compatible_url}")
    print(f"Host: {server.host}")
    print(f"Port: {server.port}")
    print(f"Container: {server.container_name}")
    print(f"Running: {server.is_running}")
    print(f"Healthy: {server.is_healthy}")
    
    # Detailed status
    status = server.get_status()
    print(f"Setup complete: {status.get('setup_complete', False)}")
    print(f"Total models: {status.get('total_models', 0)}")
    print(f"OpenAI models: {len(status.get('openai_compatible_models', []))}")
    
else:
    print(f"Not using RefLex server. Current provider: {get_selected_provider()}")

## Configuration Management and Customization

RefLex supports sophisticated configuration management through JSON files that allow you to customize every aspect of the system's behavior. This includes provider preferences, timeout settings, model mappings, server configurations, and environment-specific overrides.

### Configuration Structure and Hierarchy

The configuration system follows a clear hierarchy where function parameters override environment variables, which override configuration file settings. This flexibility allows you to maintain base configurations in files while providing runtime overrides for specific deployments.

Configuration files support:
- Provider preference orders for different environments
- Custom API endpoints and authentication settings
- RefLex server container and deployment configurations
- Model mapping customizations for specific use cases
- Performance tuning parameters like timeouts and retry logic
- Environment-specific overrides for development, staging, and production

In [None]:
from utils import display_json
display_json("reflex.json")

In [None]:
# Load client from file
client = get_openai_client(
    preference_order=["openai", "reflex"],
    openai_base_url="https://wrong.address.com/v1",
    from_file=True,
)

## Performance Optimization and Production Deployment

RefLex is designed for production use with comprehensive performance optimization features and enterprise-grade reliability. The system includes intelligent caching, connection pooling, health monitoring, and automatic recovery mechanisms.

### Performance Optimization Strategies

- **Provider Selection Optimization**: Place fastest and most reliable providers first in your preference order to minimize latency
- **Configuration Caching**: Configurations are automatically cached to avoid repeated health checks and improve response times
- **Model Selection Strategy**: Choose appropriate models for specific tasks - smaller models for simple tasks, specialized models for complex reasoning
- **Local Server Optimization**: Use minimal_setup=True for faster startup times during development
- **Timeout Configuration**: Set appropriate timeouts based on your application's latency requirements

### Production Deployment Considerations

- **Security Best Practices**: Store API keys securely in environment variables, use HTTPS for all communications, implement proper authentication
- **Reliability and High Availability**: Configure multiple fallback providers, implement health checks and alerting, use persistent storage for model data
- **Monitoring and Observability**: Track provider selection patterns, monitor response times and error rates, implement comprehensive logging
- **Scaling and Resource Management**: Use container orchestration for multiple instances, implement horizontal pod autoscaling, plan for model storage requirements

In [None]:
# Performance optimization example
optimized_config = {
    "preference_order": ["openai", "reflex"],  # Skip unused providers
    "timeout": 3.0,  # Faster timeout
    "reflex_server": {
        "model_mappings": {
            "minimal_setup": True,  # Faster startup
            "minimal_model_mapping": {
                "gpt-3.5-turbo": "llama3.2:1b",  # Smaller, faster model
                "gpt-4o-mini": "llama3.2:1b"
            }
        }
    }
}

import json
print("Optimized Configuration:")
print(json.dumps(optimized_config, indent=2))

# Environment-based configuration
import os
environment = os.getenv('ENVIRONMENT', 'development')

if environment == 'development':
    preference = ["reflex", "openai"]
elif environment == 'production':
    preference = ["openai", "azure"]
else:
    preference = ["reflex"]

print(f"Environment: {environment}")
print(f"Provider preference: {preference}")

## Error Handling and System Diagnostics

RefLex provides robust error handling and recovery mechanisms designed to handle real-world deployment scenarios. The system gracefully handles common issues like Docker unavailability, network connectivity problems, model unavailability, and port conflicts.

### Common Scenarios and Automatic Recovery

- **Docker Not Running**: System gracefully falls back to cloud providers without interruption
- **Network Connectivity Issues**: Automatic retries with configurable timeouts and exponential backoff
- **Model Unavailability**: Intelligent model mapping with automatic downloading and version management
- **Port Conflicts**: Automatic port management, container cleanup, and conflict resolution
- **Provider Rate Limiting**: Automatic failover to alternative providers when limits are reached

### System Diagnostics and Troubleshooting

The diagnostic system provides comprehensive visibility into system state, configuration status, and provider health. This information is essential for troubleshooting issues and optimizing performance in production environments.

In [None]:
from reflex_llms import clear_cache, stop_reflex_server

# System diagnostics
final_status = get_module_status()
print(f"Provider: {final_status['selected_provider']}")
print(f"RefLex available: {final_status['reflex_server_running']}")
print(f"Configuration cached: {final_status['has_cached_config']}")

# Environment checks
import os
print(f"OpenAI API Key: {'Set' if os.getenv('OPENAI_API_KEY') else 'Not Set'}")
print(f"Azure endpoint: {'Set' if os.getenv('AZURE_OPENAI_ENDPOINT') else 'Not Set'}")

# Cache management
print(f"Use clear_cache() to force provider re-resolution")
print(f"Use stop_reflex_server() to clean up local resources")

## Summary and Best Practices

RefLex LLM provides a comprehensive solution for building reliable AI-powered applications with automatic failover capabilities. The system is designed to handle the complexities of multi-provider AI deployment while maintaining the simplicity of the standard OpenAI API.

### Core Advantages and Value Proposition

- **Reliability and High Availability**: Automatic failover ensures applications remain operational even during provider outages or rate limiting
- **Cost Efficiency**: Local models significantly reduce API costs for development, testing, and batch processing workloads
- **API Compatibility**: Perfect drop-in replacement for OpenAI Python client with zero code changes required
- **Operational Flexibility**: Support for multiple providers and deployment configurations from development to enterprise scale
- **Developer Experience**: Minimal configuration required with intelligent defaults and comprehensive documentation

### Implementation Best Practices

- **Configuration Management**: Use configuration files for environment-specific settings and maintain clear separation between development, staging, and production configurations
- **Testing and Validation**: Regularly test failover scenarios in staging environments to ensure smooth operation during actual outages
- **Monitoring and Observability**: Implement comprehensive monitoring of provider selection patterns, response times, and error rates
- **Performance Optimization**: Monitor and optimize provider preferences, model selections, and timeout configurations based on actual usage patterns
- **Security and Compliance**: Follow security best practices for credential management, network access control, and data handling

RefLex LLM seamlessly bridges cloud and local AI infrastructure, ensuring applications remain operational and cost-effective across diverse deployment scenarios. The automatic failover and intelligent model mapping capabilities make it an ideal foundation for production AI applications that require both high availability and operational flexibility.