# CacheSaver Examples

*üß† Exploring CacheSaver: Smarter Caching for LLM Workflows*

In this notebook, we explore **CacheSaver**. CacheSaver lets you **reuse previous results**, **reduce redundant computation**, and **ensure reproducibility** across repeated runs with minimal setup.

We will show different examples and use cases demonstrating how CacheSaver speeds up repeated or overlapping reasoning steps, makes results deterministic within namespaces, and works seamlessly with any existing LLM client.


In [None]:
import logging
logging.getLogger("asyncio").setLevel(logging.CRITICAL)

## ‚öôÔ∏è The first universal client-side optimization framework

In this section, we will demonstrate how **CacheSaver** works seamlessly across different LLM providers.  
We will run the **same prompt** using three backends ‚Äî **OpenAI**, **vLLM**, and **Together AI** ‚Äî first without CacheSaver, and then with it enabled.

This experiment shows that CacheSaver‚Äôs caching and deterministic behavior are **backend-agnostic**: it can wrap any compatible client with no changes to your core logic.

In [None]:
prompt = "Write one short proverb about patience (only the proverb, no explanation)."

### OpenAI

In [None]:
# Without CacheSaver
from openai import AsyncOpenAI                   # Without CacheSaver!
from cachesaver.models.openai import AsyncOpenAI # With CacheSaver!

client = AsyncOpenAI()
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

### TogetherAI

In [None]:
from together import AsyncTogether                   # Without CacheSaver!
from cachesaver.models.together import AsyncTogether # With CacheSaver!

client = AsyncTogether()
response = await client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

### vLLM

To use CacheSaver with vLLM **or any other service that supports the OpenAI API Protocol**,  you can simply use the same `base_url` argument that you would use with the regular OpenAI client. CacheSaver is fully compatible with any OpenAI-style API endpoint, so no extra configuration is required.


In [None]:
# Without CacheSaver
from openai import AsyncOpenAI                   # Without CacheSaver!
from cachesaver.models.openai import AsyncOpenAI # With CacheSaver!


openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = AsyncOpenAI(
    api_key=openai_api_key, 
    api_base=openai_api_base
)

response = await client.chat.completions.create(
    model="openai-community/gpt2",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

## üå± Bringing the power of seeding to LLM inference

In this section, we compare **vanilla OpenAI inference** with **CacheSaver-wrapped inference** to illustrate how caching and deterministic namespaces work in practice.


### Example without CacheSaver
Without CacheSaver, each call to `AsyncOpenAI()` produces genuinely random samples, even if the prompt and model are identical. As you vary `n` (the number of completions), you will see new random cities each time.

In [None]:
from openai import AsyncOpenAI

prompt = "Give me the name of a random city from all over the world (only the name, no other text)."
for n in [3, 2, 5]:
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "user", "content": prompt}
        ],
        n=n
    )
    print(f"Asking for {n} output samples: ", [choice.message.content for choice in response.choices])

Asking for 3 output samples:  ['Kyoto', 'Ljubljana', 'Lagos']
Asking for 2 output samples:  ['Lagos', 'Cairo']
Asking for 5 output samples:  ['Lagos', 'Lagos', 'Lagos', 'Lagos', 'Kyoto']


### Example with CacheSaver
With CacheSaver, however, the same prompt and namespace always yield identical results, regardless of the sampling count or run order. CacheSaver acts as a **reproducible seeding layer** that remembers previous outputs, so repeated inference becomes deterministic and cache-efficient.


In [None]:
from cachesaver.models.openai import AsyncOpenAI

prompt = "Give me the name of a random city from all over the world (only the name, no other text)."
for n in [3, 2, 5]:
    client = AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "user", "content": prompt}
        ],
        n=n
    )
    print(f"Asking for {n} output samples: ", [choice.message.content for choice in response.choices])


Asking for 3 output samples:  ['Bogot√°', 'Lagos', 'Kyoto']
Asking for 2 output samples:  ['Bogot√°', 'Lagos']
Asking for 5 output samples:  ['Bogot√°', 'Lagos', 'Kyoto', 'Kyoto', 'Lagos']
