# CacheSaver — Multi-Provider Examples

This notebook shows how to use CacheSaver with each supported LLM provider.
CacheSaver acts as a drop-in wrapper: you use the same API you already know,
and caching, deduplication, and batching happen transparently behind the scenes.

**Providers covered:**
1. OpenAI
2. Together AI
3. Anthropic (Claude)
4. Google Gemini
5. Hugging Face
6. vLLM
7. OpenRouter
8. Groq
9. HF Transformers (local)

In [None]:
import logging
logging.getLogger("asyncio").setLevel(logging.CRITICAL)

---
## 1. OpenAI

The OpenAI wrapper is a drop-in replacement for `openai.AsyncOpenAI`.
It mirrors the `client.chat.completions.create(...)` interface exactly,
including support for the `n` parameter to request multiple completions.

Set the `OPENAI_API_KEY` environment variable before running.

### Standard OpenAI SDK

In [None]:
from openai import AsyncOpenAI

client = AsyncOpenAI()  # uses OPENAI_API_KEY env var

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    n=3
)
print("OpenAI SDK:", [c.message.content for c in response.choices])

### CacheSaver OpenAI Wrapper

In [None]:
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI(namespace="openai_demo", cachedir="./cache")

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    n=3
)
print("CacheSaver OpenAI:", [c.message.content for c in response.choices])

In [None]:
# Run the same call again — results come from cache (no API call)
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    n=3
)
print("Cached:", [c.message.content for c in response.choices])

---
## 2. Together AI

Together AI uses an OpenAI-compatible API, so the interface is identical.

Set the `TOGETHER_API_KEY` environment variable before running.

### Standard Together SDK

In [None]:
from together import AsyncTogether

client = AsyncTogether()  # uses TOGETHER_API_KEY env var

response = await client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Together SDK:", response.choices[0].message.content)

### CacheSaver Together Wrapper

In [None]:
from cachesaver.models.together import AsyncTogether

client = AsyncTogether(namespace="together_demo", cachedir="./cache")

response = await client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("CacheSaver Together:", response.choices[0].message.content)

In [None]:
# Cached call
response = await client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Cached:", response.choices[0].message.content)

---
## 3. Anthropic (Claude)

The Anthropic wrapper mirrors the `client.messages.create(...)` interface.
Claude does not support `n > 1` natively, so CacheSaver makes `n` separate
API calls concurrently and returns them as a list.

Set the `ANTHROPIC_API_KEY` environment variable before running.

### Standard Anthropic SDK

In [None]:
from anthropic import AsyncAnthropic

client = AsyncAnthropic()  # uses ANTHROPIC_API_KEY env var

message = await client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=50,
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Anthropic SDK:", message.content[0].text)

### CacheSaver Anthropic Wrapper

In [None]:
from cachesaver.models.anthropic import AsyncAnthropic

client = AsyncAnthropic(namespace="anthropic_demo", cachedir="./cache")

# n=1: returns the Message object directly
message = await client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=50,
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("CacheSaver Anthropic:", message.content[0].text)

In [None]:
# Cached call
message = await client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=50,
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Cached:", message.content[0].text)

In [None]:
# n>1: CacheSaver makes 3 concurrent API calls and returns a list
messages = await client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=50,
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    n=3
)
print("CacheSaver Anthropic (n=3):", [m.content[0].text for m in messages])

---
## 4. Google Gemini

The Gemini wrapper mirrors the `client.models.generate_content(...)` interface
from the `google-genai` SDK. Like Claude, `n > 1` is handled by making
concurrent calls.

Set the `GOOGLE_API_KEY` environment variable before running.

### Standard Google GenAI SDK

In [None]:
from google import genai

client = genai.Client()  # uses GOOGLE_API_KEY env var

response = await client.aio.models.generate_content(
    model="gemini-2.0-flash",
    contents="Name a random city (only the name).",
)
print("Gemini SDK:", response.text)

### CacheSaver Gemini Wrapper

In [None]:
from cachesaver.models.gemini import AsyncGemini

client = AsyncGemini(namespace="gemini_demo", cachedir="./cache")

# n=1: returns the response object directly
response = await client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Name a random pet (only the name).",
)
print("CacheSaver Gemini:", response.text)

In [None]:
# Cached call
client = AsyncGemini(namespace="gemini_demo", cachedir="./cache")

response = await client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Name a random pet (only the name).",
)
print("Cached:", response.text)

In [None]:
# n>1: CacheSaver makes 3 concurrent API calls and returns a list
responses = await client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Name a random city (only the name).",
    n=3
)
print("CacheSaver Gemini (n=3):", [c.content.parts[0].text for c in responses.candidates])

---
## 5. Hugging Face

The Hugging Face wrapper uses the `huggingface_hub` Inference Client.
It mirrors the OpenAI-compatible `client.chat.completions.create(...)` interface,
and supports the `n` parameter.

Set the `HF_TOKEN` environment variable before running.

### Standard Hugging Face SDK

In [3]:
from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient()  # uses HF_TOKEN env var

response = await client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_tokens=50,
    n=2
)
print("HuggingFace SDK:", response.choices[0].message.content)

HuggingFace SDK: Perth.


### CacheSaver Hugging Face Wrapper

In [None]:
from cachesaver.models.huggingface import AsyncHuggingFace

client = AsyncHuggingFace(namespace="hf_demo", cachedir="./cache")

response = await client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_tokens=50,
)
print("CacheSaver HuggingFace:", response.choices[0].message.content)

In [None]:
# Cached call
response = await client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_tokens=50,
)
print("Cached:", response.choices[0].message.content)

---
## 6. vLLM

vLLM exposes an OpenAI-compatible API server. The CacheSaver wrapper uses the
OpenAI client under the hood, pointed at your vLLM server URL.
No API key is required by default (defaults to `"EMPTY"`).

Start a vLLM server first:
```bash
vllm serve meta-llama/Llama-3-8B-Instruct
```

### Standard vLLM Usage (via OpenAI SDK)

In [None]:
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = await client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("vLLM SDK:", response.choices[0].message.content)

### CacheSaver vLLM Wrapper

In [None]:
from cachesaver.models.vllm import AsyncVLLM

client = AsyncVLLM(
    namespace="vllm_demo",
    cachedir="./cache",
    base_url="http://localhost:8000/v1",
)

response = await client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("CacheSaver vLLM:", response.choices[0].message.content)

In [None]:
# Cached call
response = await client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Cached:", response.choices[0].message.content)

---
## 7. OpenRouter

OpenRouter provides a unified API to 300+ models. It uses the OpenAI-compatible
format, so the interface is identical to OpenAI. The `base_url` defaults to
`https://openrouter.ai/api/v1` automatically.

Set the `OPENROUTER_API_KEY` environment variable before running.

### Standard OpenRouter Usage (via OpenAI SDK)

In [None]:
import os
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

response = await client.chat.completions.create(
    model="meta-llama/llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("OpenRouter SDK:", response.choices[0].message.content)

### CacheSaver OpenRouter Wrapper

In [None]:
from cachesaver.models.openrouter import AsyncOpenRouter

client = AsyncOpenRouter(
    namespace="openrouter_demo",
    cachedir="./cache",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

response = await client.chat.completions.create(
    model="meta-llama/llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("CacheSaver OpenRouter:", response.choices[0].message.content)

In [None]:
client = AsyncOpenRouter(
    namespace="openrouter_demo",
    cachedir="./cache",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

# Cached call
response = await client.chat.completions.create(
    model="meta-llama/llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Cached:", response.choices[0].message.content)

---
## 8. Groq

Groq provides ultra-fast inference. The wrapper mirrors the
`client.chat.completions.create(...)` interface from the `groq` SDK.

Set the `GROQ_API_KEY` environment variable before running.

### Standard Groq SDK

In [None]:
from groq import AsyncGroq

client = AsyncGroq()  # uses GROQ_API_KEY env var

response = await client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Groq SDK:", response.choices[0].message.content)

### CacheSaver Groq Wrapper

In [None]:
from cachesaver.models.groq import AsyncGroq

client = AsyncGroq(namespace="groq_demo", cachedir="./cache")

response = await client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("CacheSaver Groq:", response.choices[0].message.content)

In [None]:
client = AsyncGroq(namespace="groq_demo", cachedir="./cache")

# Cached call
response = await client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Cached:", response.choices[0].message.content)

---
## 9. HF Transformers (Local Inference)

The HF Transformers wrapper runs models locally on your own hardware using
the `transformers` library. Unlike the cloud providers above, this uses the
**LocalAPI** pipeline (Cache → Batcher → Model) optimized for GPU batching.

No API key needed — just install the optional dependencies:
```bash
pip install cachesaver[transformers]
```

### Async HF Transformers

In [None]:
from cachesaver.models.transformers import AsyncHFTransformers

client = AsyncHFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="hf_transformers_demo",
    cachedir="./cache",
    batch_size=4,
)

response = await client.chat.completions.create(
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_new_tokens=50,
)
print("CacheSaver HF Transformers:", response)

In [None]:
# Cached call — no GPU inference needed
response = await client.chat.completions.create(
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_new_tokens=50,
)
print("Cached:", response)

### Sync HF Transformers

In [None]:
from cachesaver.models.transformers import HFTransformers

client = HFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="hf_transformers_sync_demo",
    cachedir="./cache",
    batch_size=4,
)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_new_tokens=50,
)
print("Sync HF Transformers:", response)

---
## Sync Usage

Each provider also offers a synchronous wrapper for use outside of async contexts.
These use `nest_asyncio` under the hood.

In [None]:
from cachesaver.models.openai import OpenAI
from cachesaver.models.anthropic import Anthropic
from cachesaver.models.gemini import Gemini
from cachesaver.models.huggingface import HuggingFace
from cachesaver.models.vllm import VLLM
from cachesaver.models.openrouter import OpenRouter
from cachesaver.models.groq import Groq
from cachesaver.models.transformers import HFTransformers

# Sync OpenAI
client = OpenAI(namespace="sync_demo", cachedir="./cache")
response = client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Sync OpenAI:", response.choices[0].message.content)

# Sync Anthropic
client = Anthropic(namespace="sync_demo", cachedir="./cache")
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=50,
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Sync Anthropic:", message.content[0].text)

# Sync Gemini
client = Gemini(namespace="sync_demo", cachedir="./cache")
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Name a random city (only the name).",
)
print("Sync Gemini:", response.text)

# Sync HuggingFace
client = HuggingFace(namespace="sync_demo", cachedir="./cache")
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_tokens=50,
)
print("Sync HuggingFace:", response.choices[0].message.content)

# Sync vLLM
client = VLLM(namespace="sync_demo", cachedir="./cache", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Sync vLLM:", response.choices[0].message.content)

# Sync OpenRouter
client = OpenRouter(namespace="sync_demo", cachedir="./cache", api_key=os.environ.get("OPENROUTER_API_KEY"))
response = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Sync OpenRouter:", response.choices[0].message.content)

# Sync Groq
client = Groq(namespace="sync_demo", cachedir="./cache")
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
)
print("Sync Groq:", response.choices[0].message.content)

# Sync HF Transformers
client = HFTransformers(
    model_name="meta-llama/Llama-3.2-1B-Instruct",
    namespace="sync_demo",
    cachedir="./cache",
    batch_size=4,
)
response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Name a random city (only the name)."}],
    max_new_tokens=50,
)
print("Sync HF Transformers:", response)