# 1. How to cache LLM responses

The LangChain library provides a caching mechanism for responses from LLMs (Large Language Models) like OpenAI's models. This caching helps you save costs and improve the speed of your application by storing responses to previous queries so that repeated queries don’t make redundant API calls. Here’s a breakdown of the example you provided:

Why Use Caching?
* Cost Reduction: If you frequently ask the same questions or requests, caching saves you from making repeated API calls, thus reducing costs.

* Speed Improvement: Cached responses are returned almost instantly, which speeds up your application since no new API call is made for a cached query.

In [5]:
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass()
# Please manually enter OpenAI Key

In [6]:
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAI

# To make the caching really obvious, lets use a slower and older model.
# Caching supports newer chat models as well.
llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)

OpenAI is initialized with the model gpt-3.5-turbo-instruct.

Parameters:
* n=2: The model generates two completions.
* best_of=2: Chooses the best completion out of the two, effectively picking the best response.

In [7]:
%%time
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: total: 0 ns
Wall time: 2.8 s


"\n\nWhy couldn't the bicycle stand up by itself?\nBecause it was two-tired."

* InMemoryCache is a simple in-memory caching mechanism (all data is stored in memory).
* set_llm_cache(InMemoryCache()) configures the cache to be used by the LLM during queries.

The invoke method sends a request to the LLM.
* Since this is the first query, it takes longer because the response is not yet in the cache.
* The CPU and wall times are reported, showing how long it took to get the response: around 1.11 seconds.

In [8]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: total: 0 ns
Wall time: 1 ms


"\n\nWhy couldn't the bicycle stand up by itself?\nBecause it was two-tired."

The exact same request is made again.
*This time, because the response was cached, the retrieval is almost instantaneous (in microseconds).
*  The cached response is reused, so no new API call is made.

# Key Components
* set_llm_cache(InMemoryCache()): This is the command that enables caching. It uses the InMemoryCache class, which stores the data in memory.
* llm.invoke("Tell me a joke"): This method sends a prompt to the LLM. The response is cached the first time it's called, and subsequent calls use the cached response if the prompt is identical.

# 2. How to stream responses from an LLM

To stream responses from an LLM in LangChain, you can use several methods depending on your scenario:

* 1 Synchronous Token-by-Token Streaming: 

The stream method can be used for synchronous streaming, which returns each part of the response as it’s generated. Below is an example:

In [9]:
from langchain_openai import OpenAI

# Initialize the LLM
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)

# Use the stream method for synchronous streaming
for chunk in llm.stream("Write me a 1 verse song about sparkling water."):
    print(chunk, end="|", flush=True)




|Spark|ling| water|,| oh| so| clear
|Bubbles dancing|,| without| fear|
|Refreshing| taste,| a| delight|
|Spark|ling| water|,| my| thirst|'s| delight||

Here, each chunk is a part of the response, making it easy to see the streaming behavior in real time.



* 2 Asynchronous Token-by-Token Streaming:

 Use astream for asynchronous streaming, which is suitable when working with an async environment or application:

In [10]:
from langchain_openai import OpenAI

# Initialize the LLM
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)

# Use the astream method for async streaming
async for chunk in llm.astream("Write me a 1 verse song about sparkling water."):
    print(chunk, end="|", flush=True)




|Spark|ling| water|,| oh| so| clear
|Bubbles dancing|,| without| fear|
|Refreshing| taste,| a| pure| delight|
|Spark|ling| water|,| my| thirst|'s| delight||

* 3 Asynchronous Event Streaming: 

astream_events provides more granular control when dealing with complex workflows that involve multiple steps (e.g., agents or chains). It can emit structured events related to the generation process, like the start or end of a prompt.

In [11]:
from langchain_openai import OpenAI

# Initialize the LLM
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)

idx = 0

# Stream events asynchronously
async for event in llm.astream_events(
    "Write me a 1 verse song about goldfish on the moon", version="v1"
):
    idx += 1
    if idx >= 5:  # Limit the number of events displayed
        print("...Truncated")
        break
    print(event)


{'event': 'on_llm_start', 'run_id': '3829f406-5169-4542-8baa-72f18f0c47c1', 'name': 'OpenAI', 'tags': [], 'metadata': {}, 'data': {'input': 'Write me a 1 verse song about goldfish on the moon'}, 'parent_ids': []}
{'event': 'on_llm_stream', 'run_id': '3829f406-5169-4542-8baa-72f18f0c47c1', 'tags': [], 'metadata': {}, 'name': 'OpenAI', 'data': {'chunk': '\n\n'}, 'parent_ids': []}
{'event': 'on_llm_stream', 'run_id': '3829f406-5169-4542-8baa-72f18f0c47c1', 'tags': [], 'metadata': {}, 'name': 'OpenAI', 'data': {'chunk': 'Sw'}, 'parent_ids': []}
{'event': 'on_llm_stream', 'run_id': '3829f406-5169-4542-8baa-72f18f0c47c1', 'tags': [], 'metadata': {}, 'name': 'OpenAI', 'data': {'chunk': 'imming'}, 'parent_ids': []}
...Truncated


Use Cases:

* Sync Streaming: Ideal for simple applications where blocking behavior is acceptable.
* Async Streaming: Great for responsive apps, UIs, or web services that need non-blocking behavior.
* Async Event Streaming: Useful in complex workflows that require tracking multiple steps, maintaining states, or debugging with detailed event logs.