# LLM Ops Visibility and Caching Strategies

# 🏗️ Build

You will build an application that leverages a visibility tool (Weights and Biases Promopts) and prompt caching.

# 🚢 Ship

You will ship that application to a Hugging Face space.

# 🚀 Share

Create a social media post explaning or showcasing the power of prompt-caching, and visibility tooling in your LLM Ops stack.

## Visibility Tools

A key part of LLM Ops is having a visibility platform where you can track, trace, and collect, various prompt and user data. 

Let's take a look at it in this notebook!

As always, we'll want to start with our dependencies. 

In [1]:
!pip install -q -U "wandb>=0.15.4" "langchain>=0.0.218"

Getting started with Weights and Biases Prompts can be as easy as setting the `LANGCHAIN_WANDB_TRACING` environment variable to `true`!

In [2]:
import os

os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
os.environ["WANDB_PROJECT"] = "langchain-testing"
os.environ["WANDB_NOTEBOOK_NAME"] = "./wandb_notebook.ipynb"

In [3]:
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

In [4]:
os.environ["WANDB_API_KEY"] = getpass.getpass("Enter your WandB API key: ")

Now we can set up our simple application!

We're going to create an agent with the following characteristics:

1. `ChatOpenAI` : `gpt-3.5-turbo` powered, `temperature` set to reduce creativity
2. `arxiv` tool
3. `ZERO_SHOT_REACT_DESCRIPTION` agent

Please refer to the following documentation if you get stuck:

- [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html)
- [load_tools](https://api.python.langchain.com/en/latest/agents/langchain.agents.load_tools.load_tools.html)
- [initialize_agent](https://api.python.langchain.com/en/latest/agents/langchain.agents.initialize.initialize_agent.html)
- [AgentType](https://api.python.langchain.com/en/latest/agents/langchain.agents.agent_types.AgentType.html)

In [5]:
!pip install -q -U arxiv

In [7]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import load_tools, initialize_agent, AgentType

llm = ChatOpenAI(
    model='gpt-3.5-turbo-16k',
    temperature=0
)

tools = load_tools(
    ["arxiv"],
)

agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True,
)

In [8]:
agent_chain.run("What is QLoRA?")

[34m[1mwandb[0m: Streaming LangChain activity to W&B at https://wandb.ai/cmagganas/langchain-testing/runs/f2q3m7vs
[34m[1mwandb[0m: `WandbTracer` is currently in beta.
[34m[1mwandb[0m: Please report any issues to https://github.com/wandb/wandb/issues with the tag `langchain`.




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI'm not familiar with QLoRA, so I should search for it on arxiv.org to gather more information.
Action: arxiv
Action Input: QLoRA[0m
Observation: [36;1m[1;3mPublished: 2023-05-23
Title: QLoRA: Efficient Finetuning of Quantized LLMs
Authors: Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
Summary: We present QLoRA, an efficient finetuning approach that reduces memory usage
enough to finetune a 65B parameter model on a single 48GB GPU while preserving
full 16-bit finetuning task performance. QLoRA backpropagates gradients through
a frozen, 4-bit quantized pretrained language model into Low Rank
Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all
previous openly released models on the Vicuna benchmark, reaching 99.3% of the
performance level of ChatGPT while only requiring 24 hours of finetuning on a
single GPU. QLoRA introduces a number of innovations to save memory without
sacrifici

'QLoRA is an efficient finetuning approach that reduces memory usage in pretrained language models and has achieved state-of-the-art results in chatbot performance.'

Alright, now that we have some outputs - let's see what Weights and Biases was able to do!

### Exploring WandB Outputs

First things first, we'll want to head to our WandB home page and find our projects!

You'll navigate to `wandb.ai/{YOUR_USERNAME_HERE}` - and then click the `Projects` tab.

![image](https://i.imgur.com/mplxa4p.png)

Now we can head into our project, which should be named `langchain-testing`:

![image](https://i.imgur.com/Q4AU0NC.png)

Explore all the tools made available to you through the Prompt Workspace!

Let's try another prompt and see what happens!

In [9]:
agent_chain.run("What is LLM Ops?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI'm not sure what LLM Ops is. I should search for it on arxiv.org to find more information.
Action: arxiv
Action Input: LLM Ops[0m
Observation: [36;1m[1;3mPublished: 2018-06-18
Title: Exciting LLM Geometries
Authors: Robert de Mello Koch, Jia-Hui Huang, Laila Tribelhorn
Summary: We study excitations of LLM geometries. These geometries arise from the
backreaction of a condensate of giant gravitons. Excitations of the condensed
branes are open strings, which give rise to an emergent Yang-Mills theory at
low energy. We study the dynamics of the planar limit of these emergent gauge
theories, accumulating evidence that they are planar ${\cal N}=4$ super
Yang-Mills. There are three observations supporting this conclusion: (i) we
argue for an isomorphism between the planar Hilbert space of the original
${\cal N}=4$ super Yang-Mills and the planar Hilbert space of the emergent
gauge theory, (ii) we argue that the OPE coefficients 

'The meaning of LLM Ops is ambiguous and needs to be clarified. It could refer to "Exciting LLM Geometries" in the field of physics or "Large Language Models" in the field of natural language processing.'

As we can see, repeated calls will continue to add more information to our `langchain-testing` project!

![image](https://i.imgur.com/Xze6jNE.png)

Let's make a slightly more complex application by adding a Prompt Cache!

## Prompt Caching

### Adding A Prompt Cache

The basic idea of Prompt Caching is to provide a way to circumvent going to the LLM for prompts we have already seen.

Similar to cached embeddings, the idea is simple:

- Keep track of all the input/output pairs
- If a user query is (in the case of semantic similarity caches) close enough to a previous prompt contained in the cache, return the output associated with that pair

### Initializing a Prompt Cache

There are many different tools you can use to implement a Prompt Cache - from a "build it yourself" VectorStore implementation - to Redis - to custom libraries - there are upsides and downsides to each solution. 

Let's look at the Redis-backed Cache vs. `InMemoryCache` as an example:

Redis Cache

| Pros  | Cons  |
|---|---|
| Managed and Robust  | Expensive to Host  |
| Integrations on all Major Cloud Platforms  | Non-trivial to Integrate |
| Easily Scalable  | Does not have a ChatModel implementation |

`InMemoryCache`

| Pros  | Cons  |
|---|---|
| Easily implemented  | Consumes potentially precious memory |
| Completely Cloud Agnostic  | Does not offer inter-session caching |

For the sake of ease of use - and to allow functionality with our `ChatOpenAI` model - we'll leverage `InMemoryCache`.

We need to set our `langchain.llm_cache` to use the `InMemoryCache`.

- [`InMemoryCache`](https://api.python.langchain.com/en/latest/cache/langchain.cache.InMemoryCache.html)

In [10]:
import langchain
from langchain.cache import InMemoryCache

langchain.llm_cache = InMemoryCache()

One more important fact about the `InMemoryCache` is that it is what's called an "exact-match" cache - meaning it will only trigger when the user query is *exactly* represented in the cache. 

This is a safer cache, as we can guarentee the user's query exactly matches with previous queries and we don't have to worry about edge-cases where semantic similarity might fail - but it does reduce the potential to hit the cache.

We could leverage tools like `GPTCache`, or `RedisCache` (for non-chat model implementations) to get a "semantic similarity" cache, if desired!

In [11]:
%%time
agent_chain.run("What is Retrieval Augmented Generation?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use the arxiv tool to search for articles on "Retrieval Augmented Generation" to find a definition or explanation.
Action: arxiv
Action Input: "Retrieval Augmented Generation"[0m
Observation: [36;1m[1;3mPublished: 2022-02-13
Title: A Survey on Retrieval-Augmented Text Generation
Authors: Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu
Summary: Recently, retrieval-augmented text generation attracted increasing attention
of the computational linguistics community. Compared with conventional
generation models, retrieval-augmented text generation has remarkable
advantages and particularly has achieved state-of-the-art performance in many
NLP tasks. This paper aims to conduct a survey about retrieval-augmented text
generation. It firstly highlights the generic paradigm of retrieval-augmented
generation, and then it reviews notable approaches according to different tasks
including dialogue response generation, mach

'Retrieval Augmented Generation refers to a technique that combines retrieval of information from external knowledge resources with text generation models. It has been applied in various natural language processing tasks, including dialogue response generation, machine translation, and composition assistance. The technique aims to enhance the contextual understanding of language models, reduce hallucination, and improve the quality of generated text. There are different approaches and methods proposed in the literature, such as retrieve-and-generate setups and active retrieval augmented generation. The effectiveness of retrieval augmented generation has been demonstrated in several studies, achieving state-of-the-art performance in various tasks.'

Okay, that's great! Working as expected - let's take a look at the output of our `ChatOpenAI` module in our Weights and Biases project:

```
{ "token_usage": { "prompt_tokens": 1057, "completion_tokens": 130, "total_tokens": 1187 }, "model_name": "gpt-3.5-turbo-0613" }
```

So, you can see: We used `1187` total tokens, and the request took ~8s.

Let's look at the full output of our Weights and Biases project:

![image](https://i.imgur.com/cU8NuDK.png)

Let's try the same request again and see what happens this time!

In [12]:
%%time
agent_chain.run("What is Retrieval Augmented Generation?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use the arxiv tool to search for articles on "Retrieval Augmented Generation" to find a definition or explanation.
Action: arxiv
Action Input: "Retrieval Augmented Generation"[0m
Observation: [36;1m[1;3mPublished: 2022-02-13
Title: A Survey on Retrieval-Augmented Text Generation
Authors: Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu
Summary: Recently, retrieval-augmented text generation attracted increasing attention
of the computational linguistics community. Compared with conventional
generation models, retrieval-augmented text generation has remarkable
advantages and particularly has achieved state-of-the-art performance in many
NLP tasks. This paper aims to conduct a survey about retrieval-augmented text
generation. It firstly highlights the generic paradigm of retrieval-augmented
generation, and then it reviews notable approaches according to different tasks
including dialogue response generation, mach

'Retrieval Augmented Generation refers to a technique that combines retrieval of information from external knowledge resources with text generation models. It has been applied in various natural language processing tasks, including dialogue response generation, machine translation, and composition assistance. The technique aims to enhance the contextual understanding of language models, reduce hallucination, and improve the quality of generated text. There are different approaches and methods proposed in the literature, such as retrieve-and-generate setups and active retrieval augmented generation. The effectiveness of retrieval augmented generation has been demonstrated in several studies, achieving state-of-the-art performance in various tasks.'

Right away, we can see that the chain only took ~0.3s, very promising! Let's check in WandB!

This time, we cannot find information about token usage in Weights and Biases because we never actually needed to hit OpenAI's endpoint. 

Let's look at the Weights and Biases project output:

![image](https://i.imgur.com/UjPsC6x.png)

As you can see - we completely bypass the chain - and directly return the previous result!

# Task

Your task is to include both a prompt cache, and visibility to your application in any of your previous assignments, wrap it up in a Chainlit application, and host it on a Hugging Face Space (or EC2)!