# MLflow AI Gateway: Open, Unified AI Gateway with Built-In Intelligence

MLflow AI Gateway gives you a single interface for OpenAI, Anthropic, Google, Mistral, and more. It delivers the essential table-stakes gateway features developers expect, and integrates MLflow’s GenAI capabilities like tracing, prompt versioning, and optimization.

This makes MLflow AI Gateway more than a gateway. It is the open Intelligent Control Plane for GenAI, helping teams continuously improve quality, cost, and efficiency using real production data.

## Key Capabilities

- **Unified API** — Same interface for all providers  
- **OpenAI-Compatible** — Drop-in replacement  
- **MLflow-Integrated** — Auto tracing, metrics, and logging  
- **Prompt Versioning and Optimization** — Git-like registration, reuse and optimization
- **Cost and Token Tracking** — Built in  
- **Production Primitives** — Retries, fallbacks, routing  


## Installation & Setup

In [1]:
# Install mlflowlite
%pip install -e . --force-reinstall

Obtaining file:///Users/ahmed.bilal/Desktop/gateway-oss
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
[?25hCollecting mlflow>=2.10.0 (from mlflowlite==0.1.0)
  Using cached mlflow-3.6.0-py3-none-any.whl.metadata (31 kB)
Collecting litellm>=1.30.0 (from mlflowlite==0.1.0)
  Using cached litellm-1.79.3-py3-none-any.whl.metadata (30 kB)
Collecting pydantic>=2.0.0 (from mlflowlite==0.1.0)
  Using cached pydantic-2.12.4-py3-none-any.whl.metadata (89 kB)
Collecting python-dotenv>=1.0.0 (from mlflowlite==0.1.0)
  Using cached python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Collecting openai>=1.12.0 (from mlflowlite==0.1.0)
  Using cached openai-2.7.2-py3-none-any.whl.metadata (29 kB)
Collecting anthropic>=0.18.0 (from mlflowlite==0.1.0)
  Using cached anthropic-0.72.1-py3-none-any.whl.

In [None]:
# Import mlflowlite
import mlflow
import mlflowlite as mla
from datetime import datetime

# Configure MLflow tracking
mlflow.set_tracking_uri("sqlite:///mlflow.db")

# Create unique experiment name for this demo session
experiment_name = f"mlflowlite_demo_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
mlflow.set_experiment(experiment_name)

print("✅ mlflowlite ready!")
print(f"   📊 Tracking: sqlite:///mlflow.db")
print(f"   🧪 Experiment: {experiment_name}")
print("   🌐 UI: http://localhost:5000")


# Part 1: Unified API

One API that works with **every provider**.

## Your First Call

In [3]:
# Call any model with the same API
response = mla.completion(
    model="claude-haiku-4-5-20251001",
    messages=[{"role": "user", "content": "Say hello in 3 words"}]
)

print("✅ Response received!\n")
print(f"Model:   {response.model}")
print(f"Content: {response.content}")
print(f"Tokens:  {response.usage['total_tokens']}")
print(f"Cost:    ${response.cost:.4f}")

✅ Response received!

Model:   claude-haiku-4-5-20251001
Content: Hello, world friend.
Tokens:  22
Cost:    $0.0002


## Switch Providers

Change the model, everything else stays the same

In [4]:
# Test different providers with SAME code
models = [
    "claude-haiku-4-5-20251001",   # Anthropic
    # "gpt-5-mini",                # OpenAI
    # "gemini-1.5-flash",          # Google
]

print("Model                          | Response | Cost")
print("-" * 55)

for model in models:
    try:
        resp = mla.completion(
            model=model,
            messages=[{"role": "user", "content": "Hello"}]
        )
        content = resp.content[:10]
        print(f"{model:<30} | {content:<8} | ${resp.cost:.4f}")
    except:
        print(f"{model:<30} | Skipped")

print("\n✅ Same API, different providers!")

Model                          | Response | Cost
-------------------------------------------------------
claude-haiku-4-5-20251001      | Hello! 👋 H | $0.0002

✅ Same API, different providers!


# Part 2: Automatic Tracking

Every call is tracked - cost, latency, and tokens

In [None]:
# View tracking info
print("📊 Every call logged automatically with full metrics:\n")
print(f"   Cost:    ${response.cost:.4f}")
print(f"   Latency: {response.latency:.2f}s")
print(f"   Tokens:  {response.usage['total_tokens']}")

# Show trace ID for MLflow UI
if hasattr(response, "trace_id") and response.trace_id != "no_trace":
    print(f"\n🔍 🔗 Trace: {response.trace_url or response.trace_id}")


📊 Automatic Tracking:

   Cost:    $0.0002
   Latency: 1.78s
   Tokens:  22

💡 Every call logged automatically with full metrics

🔍 🔗 Trace: http://localhost:5000/#/experiments/4/runs/c8d71a98d6654560b67ae4e7a1852b1b


# Part 3: Prompt Versioning

Git-like version control for your prompts

## Register + Use in One Call

No separate registration needed!

In [17]:
# 🎯 Register + Use in ONE call!
response = mla.completion(
    model="claude-haiku-4-5-20251001",
    prompt_id="sentiment",
    prompt_template="Classify sentiment of this text: {{text}}",  # Auto-registers!
    prompt_variables={"text": "Amazing product!"}
)

print("\n✅ Prompt auto-registered to MLflow!")

✅ Registered prompt 'sentiment_prompt' version 11 in MLflow
   🔗 http://localhost:5000/#/prompts/sentiment_prompt

✅ Prompt auto-registered to MLflow!


## Reuse Registered Prompt

Second call doesn't need prompt_template (already registered)

In [None]:
# Reuse - no template needed!
response2 = mla.completion(
    model="claude-haiku-4-5-20251001",
    prompt_id="sentiment",
    prompt_variables={"text": "Terrible experience"}
)

print(f"Response: {response2.choices[0]['message']['content']}")

print("\n✅ Loaded from registry automatically!")

Response: # Sentiment Classification

**Sentiment: Negative** 😞

**Confidence: Very High**

The text "Terrible experience" expresses clear dissatisfaction and disappointment. The word "terrible" is a strong negative descriptor that indicates a very poor experience.
Tokens:   78

✅ Loaded from registry automatically!


# Part 4: Model Migration

Use the unified API to safely migrate between models

## Auto-Rewrite Prompts for Cost Reduction

Migrate from expensive models (Claude Opus) to cheaper ones (Claude Haiku) while maintaining quality.

Uses MLflow's [`optimize_prompts()`](https://mlflow.org/docs/latest/genai/prompt-registry/rewrite-prompts/) API.

In [8]:
# Step 1: Start with expensive model (Claude Sonnet)

# Register base prompt for sentiment classification
response = mla.completion(
    model="claude-sonnet-4-5-20250929",
    prompt_id="sentiment_classifier",
    prompt_template="Classify sentiment. Answer 'positive', 'negative', or 'neutral'.\n\nText: {{text}}",
    prompt_variables={"text": "This product is amazing!"}
)

print(f"Response: {response.choices[0]['message']['content']}")
print(f"Cost: ${response.cost:.4f}")
print(f"\n💰 Claude-4 is expensive but accurate")

✅ Registered prompt 'sentiment_classifier_prompt' version 26 in MLflow
   🔗 http://localhost:5000/#/prompts/sentiment_classifier_prompt
Response: positive
Cost: $0.0004

💰 Claude-4 is expensive but accurate


## Collect Training Data

In [None]:
# Step 2: Collect outputs from expensive model
training_inputs = [
    "This movie was absolutely fantastic!",
    "Terrible service and cold food.",
    "It was okay, nothing special.",
    "Complete waste of money!",
    "Best experience ever!",
    "Works as described. No complaints.",
    "This exceeded my expectations!",
    "Worst customer support ever.",
    "Fine for the price.",
    "Truly wonderful product!"
]

# Collect outputs and costs from Claude Sonnet
# Traces are automatically captured by mla.completion()
Sonnet_outputs = []
Sonnet_total_cost = 0

print("Collecting training data from Claude Sonnet...\n")
for text in training_inputs:
    result = mla.completion(
        model="claude-sonnet-4-5-20250929",
        prompt_id="sentiment_migrate",
        prompt_variables={"text": text}
    )
    
    output = result.choices[0]['message']['content'].lower()
    Sonnet_outputs.append(output)
    Sonnet_total_cost += result.cost
    
    #print(f"{text[:40]:40} → {output}")

print(f"\n💰 Claude Sonnet Total Cost: ${Sonnet_total_cost:.4f}")
print("✅ Traces automatically captured by MLflow")

Collecting training data from Claude Sonnet...

This movie was absolutely fantastic!     → **positive**

the text expresses clear enthusiasm and praise using strongly positive language like "absolutely fantastic!"
Terrible service and cold food.          → **negative**

this text expresses clear dissatisfaction with both the service quality ("terrible") and the food condition ("cold"), indicating a negative experience.
It was okay, nothing special.            → **neutral** (leaning slightly negative)

this text expresses a lukewarm, indifferent sentiment. the phrase "it was okay" suggests mediocrity, and "nothing special" reinforces that there was nothing particularly noteworthy or impressive about the experience.

💰 Claude Sonnet Total Cost: $0.0021
✅ Traces automatically captured by MLflow


## Create Dataset from Traces

In [None]:
# Step 3: Create training dataset
# Create dataset structure for MLflow
import pandas as pd

# Convert to DataFrame format
dataset_df = pd.DataFrame([
    {
        "inputs": {"text": text},
        "outputs": output,
        "expectations": {"expected_response": output}  # Required by Equivalence scorer
    }
    for text, output in zip(training_inputs, Sonnet_outputs)
])

# Create MLflow dataset
# MLflow optimize_prompts() accepts DataFrames directly - no need to create_dataset()
print(f"✅ Created dataset with {len(dataset_df)} examples")
print(f"   Structure: {{\"inputs\": {{\"text\": ...}}, \"outputs\": ...}}")

✅ Created dataset with 3 examples
   Structure: {"inputs": {"text": ...}, "outputs": ...}


## Test Cheaper Model (Before Optimization)

In [None]:
# Step 4: Test cheap model (before optimization)
haiku_before_cost = 0
mismatches_before = 0

print("Testing Claude Haiku with original prompt...\n")
for i, text in enumerate(training_inputs[:5]):  # Test on subset
    result = mla.completion(
        model="claude-haiku-4-5-20251001",
        prompt_id="sentiment_migrate",
        prompt_variables={"text": text}
    )
    
    haiku_output = result.choices[0]['message']['content'].lower()
    haiku_before_cost += result.cost
    
    match = "✅" if haiku_output == Sonnet_outputs[i] else "⚠️ "
    if match == "⚠️ ":
        mismatches_before += 1
    
    #print(f"{match} {text[:40]:40} → {haiku_output}")

print(f"\n⚠️  Quality Issues: {mismatches_before}/5 mismatches")
print(f"💰 Cost: ${haiku_before_cost:.4f}")

Testing Claude Haiku with original prompt...

⚠️  This movie was absolutely fantastic!     → **positive**

the text expresses a clearly positive sentiment. the word "absolutely fantastic" is strong praise, indicating the speaker enjoyed the movie very much.
⚠️  Terrible service and cold food.          → **negative**

this text expresses clear dissatisfaction with both the service quality and food temperature, using the negative descriptor "terrible" and implying the food was unacceptable by being cold.
⚠️  It was okay, nothing special.            → **neutral**

the text expresses a lukewarm, middling opinion. phrases like "okay" and "nothing special" indicate the person found the experience acceptable but unremarkable—neither particularly good nor bad. this is characteristic of neutral sentiment.

⚠️  Quality Issues: 3/5 mismatches
💰 Cost: $0.0023


## Optimize Prompt for Cheaper Model

Uses MLflow's `optimize_prompts()` to rewrite the prompt

In [12]:
# 🎯 Simple one-call optimization!
optimized_prompt = mla.optimize_prompt(
    prompt_template="Classify the sentiment of this text as positive, negative, or neutral: {{text}}",
    dataset=dataset_df,
    model_from="claude-sonnet-4-5-20250929",  # Expensive model
    model_to="claude-haiku-4-5-20251001",      # Cheap model
    prompt_id="sentiment_migrate",
    max_iterations=2,
    save_optimized=True  # Auto-saves as 'sentiment_migrate_optimized'
)

print(f"\n✅ Done! Optimized prompt saved.")
print(f"\n💡 Use it: mla.completion(model='claude-haiku-4-5-20251001', prompt_id='sentiment_migrate_optimized', ...)")


🔄 Optimizing prompt for migration
   From: claude-sonnet-4-5-20250929
   To:   claude-haiku-4-5-20251001
   Iterations: 6

   Running 6 optimization iterations...
   (Total metric calls: 18)

Iteration 0: Base program full valset score: 0.0
Iteration 1: Selected program 0 score: 0.0
Iteration 1: Proposed new text for sentiment_migrate: Classify the sentiment of this text as positive, negative, or neutral: {{text}}

IMPORTANT TECHNICAL REQUIREMENTS:
- Before performing sentiment classification, ensure any active MLflow runs are properly ended by calling mlflow.end_run() to avoid "Run already active" errors
- If you encounter an MLflow run conflict, end the current run before starting a new one
- Handle MLflow run state management properly to prevent execution failures

OUTPUT FORMAT:
Your response should be formatted as follows:
1. Start with the sentiment classification in bold: **positive**, **negative**, or **neutral**
2. If the sentiment leans between categories, you can indicate th

## Test with Optimized Prompt

In [None]:
# Step 6: Test cheap model with optimized prompt
haiku_after_cost = 0
mismatches_after = 0

# Test on same subset
for i, text in enumerate(training_inputs[:5]):
    result = mla.completion(
        model="claude-haiku-4-5-20251001",
        prompt_id="my_custom_prompt",
        prompt_variables={"text": text}
    )
    
    haiku_output = result.choices[0]['message']['content'].lower()
    haiku_after_cost += result.cost
    
    match = "✅" if haiku_output == Sonnet_outputs[i] else "⚠️ "
    if match == "⚠️ ":
        mismatches_after += 1
    
    #print(f"{match} {text[:40]:40} → {haiku_output}")

print(f"\n✅ Quality Improved: {mismatches_after}/5 mismatches (was {mismatches_before}/5)")
print(f"💰 Cost: ${haiku_after_cost:.4f}")

⚠️  This movie was absolutely fantastic!     → the sentiment of this text is **positive**.

**reasoning:**
the text contains clear indicators of positive sentiment:
- the word "absolutely" is an intensifier that strengthens the positive evaluation
- "fantastic" is an explicitly positive adjective expressing high praise and enthusiasm
- the exclamation point at the end conveys excitement and strong emotional approval

these elements together demonstrate a favorable opinion and enthusiastic endorsement of the movie.
⚠️  Terrible service and cold food.          → the sentiment of this text is **negative**.

**reasoning:**

this text contains clear indicators of dissatisfaction and complaint:

- **"terrible"** - this is a strongly negative descriptor that explicitly expresses unfavorable judgment
- **"cold food"** - this describes a specific problem with the meal quality, indicating a service failure

the combination of these two complaints—poor service quality and inadequate food temperat

## Cost & Quality Summary

In [None]:
# Summary
print("="*60)
print("📊 MODEL MIGRATION RESULTS")
print("="*60)
print(f"\n💰 COST COMPARISON:")
print(f"   Claude Sonnet (original):              ${haiku_after_cost:.4f}")
print(f"   Claude Sonneto-mini (before optimize): ${haiku_before_cost:.4f}")
print(f"   Claude Sonneto-mini (after optimize):  ${haiku_after_cost:.4f}")

savings = Sonnet_total_cost - haiku_after_cost
savings_pct = (savings / Sonnet_total_cost) * 100
print(f"\n   💸 Savings: ${savings:.4f} ({savings_pct:.1f}%)")

print(f"\n✅ QUALITY:")
print(f"   Before optimization: {mismatches_before} mismatches")
print(f"   After optimization:  {mismatches_after} mismatches")

print(f"\n🎯 KEY TAKEAWAY:")
print(f"   Optimized prompt maintains quality at {savings_pct:.1f}% lower cost!")
print("="*60)

print(f"\n📚 Learn more: https://mlflow.org/docs/latest/genai/prompt-registry/rewrite-prompts/")

📊 MODEL MIGRATION RESULTS

💰 COST COMPARISON:
   Claude Sonnet (original):              $0.0021
   Claude Sonneto-mini (before optimize): $0.0023
   Claude Sonneto-mini (after optimize):  $0.0127

   💸 Savings: $-0.0106 (-512.0%)

✅ QUALITY:
   Before optimization: 3 mismatches
   After optimization:  3 mismatches

🎯 KEY TAKEAWAY:
   Optimized prompt maintains quality at -512.0% lower cost!

📚 Learn more: https://mlflow.org/docs/latest/genai/prompt-registry/rewrite-prompts/


# Part 5: Advanced Features

## A/B Testing

In [None]:
from mlflowlite.routing import create_ab_test

# Create A/B test
ab_test = create_ab_test(
    name="model_test",
    variants={
        'prod': {'model': 'claude-sonnet-4-5-20250929'},
        'new':  {'model': 'claude-haiku-4-5-20251001'}
    },
    split=[0.5, 0.5]  # 50/50 split for demo
)

# Test with simple inputs
for text in ["Great!", "Bad", "Okay"]:
    variant, resp = ab_test.run(
        messages=[{"role": "user", "content": text}]
    )
    print(f"{variant:6s} | {text:6s} | ${resp.cost:.4f}")

ab_test.print_report()


## Reliability: Retry + Fallback

In [16]:
# Production-ready with reliability
resp = mla.completion(
    model="claude-haiku-4-5-20251001",
    messages=[{"role": "user", "content": "Hello!"}],
    timeout=30,
    max_retries=5,
    fallback_models=["claude-sonnet-4-5-20250929", "gpt-3.5-turbo"]
)

print(f"✅ Response: {resp.content}")
print(f"   Model: {resp.model}")
print("\n💡 Automatic retry and fallback on failures")

✅ Response: Hello! 👋 How's it going? Is there anything I can help you with today?
   Model: claude-haiku-4-5-20251001

💡 Automatic retry and fallback on failures


# 🎯 End of Notebook