# Compare Anthropic Claude 3.5 with and without caching

Anthropic has announced a new feature called Prompt Caching. This feature allows users to cache the response of a prompt for future use, significantly reducing the latency of future requests and costs. 

This notebook will compare the performance of Anthropic Claude 3.5 with and without caching using Weights & Biases Weave. 

The main benefits of long context caching start to show up on longer context, and since Anthropic supports up to 200K context windows, here we will be doing tests on quite a lot of context loaded from transcripts of ThursdAI.news podcast. 


In [None]:
# Install and read in required packages, plus create an anthropic client.
print('⏳ Installing packages')
%pip install -q weave set-env-colab-kaggle-dotenv tqdm ipywidgets requests anthropic
print('✅ Packages installed')

In [None]:
from tqdm.notebook import tqdm_notebook as tqdm
from set_env import set_env
import anthropic
client = anthropic.Anthropic()
import weave
import os
import glob

import json
import requests
set_env("ANTHROPIC_API_KEY")
set_env("WANDB_API_KEY")

client = anthropic.Anthropic()

weave.init('compare-claude-caching')

FAST_MODEL = "claude-3-haiku-20240307"
SMART_MODEL = "claude-3-5-sonnet-20240620"

In [None]:
import os
import glob

@weave.op()
def formatted_system_message_with_transcripts():
    transcripts = []
    for file_path in glob.glob('./data/*.md'):
        with open(file_path, 'r', encoding='utf-8') as file:
            file_name = os.path.basename(file_path)
            content = file.read()
            transcripts.append(f"<Transcript: {file_name}>\n\n{content}\n\n</Transcript {file_name}>")
    
    all_transcripts = "\n".join(transcripts)
    
    formatted_prompt = f"""
You will be analyzing multiple podcast transcripts from different shows. The transcripts include dates the shows were recorded, timestamps, speaker labels, and chapter annotations (marked with ##). Your task is to perform a comprehensive analysis across these transcripts, taking into account the date of each transcript (found in the transcript name) and the content, answering any questions that are asked. 

Here are the transcripts you will be analyzing:

<transcripts>
{all_transcripts}
</transcripts>

Please follow these steps when answering questions about the transcripts:
1. Notify the user that you're knowledge is limited and based on the transcripts you've been given. If the answer is not in those transcripts, you should announce that you don't have this answer from the transcritps. 
2. Refuse to answer general questions that don't pertain to the transcripts or topics in them.
3. For each question, provide the answer with additional context such as relevant timestamps, quotes with speakers, summary of topics, whether this topic was discussed before or after in the show etc' 
4. If asked about a summary of topics, use your best judgement to summarize the topics based on chapters and the TL;DR sections that the host provides in the beginning of each show. 
5. The show is hosted by Alex Volkov from Weights & Biases. 
6. Transcripts can be imperfect and sometimes the speaker labels are not exactly on time or some nouns are not exactly transcripted, take this into account. 
7. Your users will be users who may be familiar with the show and they would use this to look up information for editing and quoting so do not give out false information and always ground the information given with file names and timestamps and speaker labels.
8. You return reponses in Markdown, quotes are formatted with `>` and do a nice formatting with timestamps as code inline blocks etc. Make it readable and make it look like a blog post response.

"""
    return formatted_prompt

# print(format_prompt("You are an AI assistant tasked with analyzing podcast transcripts."))


In [None]:
@weave.op
def calculate_price(response):
    # Define the pricing structure based on Claude response object usage 
    pricing = {
        "claude-3-5-sonnet": {
            "base_input": 3.00 / 1_000_000,
            "cache_write": 3.75 / 1_000_000,
            "cache_hit": 0.30 / 1_000_000,
            "output": 15.00 / 1_000_000
        },
        "claude-3-haiku": {
            "base_input": 0.25 / 1_000_000,
            "cache_write": 0.30 / 1_000_000,
            "cache_hit": 0.03 / 1_000_000,
            "output": 1.25 / 1_000_000
        }
    }

    model = response.model
    usage = response.usage

    if model.startswith('claude-3-5-sonnet'):
        model_key = "claude-3-5-sonnet"
    elif model.startswith('claude-3-haiku'):
        model_key = "claude-3-haiku"
    else:
        raise ValueError("Unsupported model type")
    

    # Calculate the cost
    if hasattr(usage, 'cache_creation_input_tokens'):
        # Cached content
        cache_creation_cost = usage.cache_creation_input_tokens * pricing[model_key]["cache_write"]
        cache_read_cost = usage.cache_read_input_tokens * pricing[model_key]["cache_hit"]
        input_cost = usage.input_tokens * pricing[model_key]["base_input"]
        output_cost = usage.output_tokens * pricing[model_key]["output"]
        total_cost = cache_creation_cost + cache_read_cost + input_cost + output_cost
        resp_dict = {
            "total_cost": total_cost,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "cache_creation_input_tokens": usage.cache_creation_input_tokens,
            "cache_read_input_tokens": usage.cache_read_input_tokens
        }
    else:
        # Non-cached content
        input_cost = usage.input_tokens * pricing[model_key]["base_input"]
        output_cost = usage.output_tokens * pricing[model_key]["output"]
        total_cost = input_cost + output_cost
        resp_dict = {
            "total_cost": total_cost,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "cache_creation_input_tokens": 0,
            "cache_read_input_tokens": 0
        }

    return resp_dict

In [None]:
@weave.op()
def get_claude_response_with_caching(conversation_history, user_message, model=FAST_MODEL):
    # Append the new user message to the conversation history
    conversation_history.append({
        "role": "user",
        "content": user_message
    })
    
    response = client.beta.prompt_caching.messages.create(
        model=model,
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": "You are an AI assistant tasked with analyzing podcast transcripts and answering questions based on this content."
            },
            {
                "type": "text",
                "text": formatted_system_message_with_transcripts(),
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=conversation_history
    )
    
    # Append the assistant's response to the conversation history

    conversation_history.append({
        "role": response.role,
        "content": response.content[0].text
    })
    price = calculate_price(response)
    print(f"Total cost: ${price['total_cost']} for {price['input_tokens']} input tokens and {price['output_tokens']} output tokens with {price['cache_creation_input_tokens']} cache creation input tokens and {price['cache_read_input_tokens']} cache read input tokens")
    print(response)
    return {"response_text": response.content[0].text, "conversation_history": conversation_history, "price": price}

# # Example usage
# conversation_history = [
# ]
# user_message = "Summarize Meta announcements and quotes from speakers about the major Meta annoucements. "
# response, conversation_history = get_claude_response_with_caching(conversation_history, user_message)

@weave.op
def get_claude_response(conversation_history, user_message, model=FAST_MODEL):
    conversation_history.append({
        "role": "user",
        "content": user_message
    })
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        system= [
            {
                "type": "text",
                "text": "You are an AI assistant tasked with analyzing podcast transcripts and answering questions based on this content."
            },
            {
                "type": "text",
                "text": formatted_system_message_with_transcripts(),
            }
        ],
        messages=conversation_history
    )
    price = calculate_price(response)
    print(f"Total cost: ${price['total_cost']} for {price['input_tokens']} input tokens and {price['output_tokens']} output tokens with {price['cache_creation_input_tokens']} cache creation input tokens and {price['cache_read_input_tokens']} cache read input tokens")
    print(response)
    conversation_history.append({
        "role": response.role,
        "content": response.content[0].text
    })
    return {"response_text": response.content[0].text, "conversation_history": conversation_history, "price": price}




In [None]:
conversation_history = []
user_message = "Tell me more about LLama 3.1 and what folks said about it! include quotes from the speakers and summaries of the transcripts."
response, conversation_history, price = get_claude_response_with_caching(conversation_history, user_message)

user_message = "What else did Meta announce that's not related to LLMs?"
response, conversation_history, price = get_claude_response_with_caching(conversation_history, user_message)

print(response)

In [61]:
import os
os.environ["WEAVE_PARALLELISM"] = "1"

In [66]:
my_transcript_questions=[
        {
            "question": "Summarize the major LLM related announcements from Meta across transcripts, in cronological order using transcript names and dates, for each announcement include quotes from the speakers 1-2 sentences each, and summarize the announcement in 1-2 sentences",
            "rubric": "Answer must include LLama 3.1 405B, and the fact that it beats GPT-4o in a bunch of metrics"
        },
        {
            "question": "What else did Meta announce that's not related to LLMs?",
            "rubric": "Answer must include SAM 2 and Joseph and Skalski quotes on how awesome it is"
        },
        {
            "question": "Which companies provide caching for their requests? And why caching is a big deal? ",
            "rubric": "Answer must include Google and DeepSeek and include quotes from the speakers"
        },
        {
            "question": "What was the model that does very well on vision tasks and is around 8B parameters called?",
            "rubric": "Answer must include MiniCPM by OpenBNB"
        },
        {
            "question": "Taking into account that later episodes have newer information, which model is the best for coding accoridng to speakers?",
            "rubric": "Answer might include DeepSeek coder and Claude" 
        },
        {
            "question": "What about this straweberry thing? Summarize the saga.",
            "rubric": "Answer may include the that OpenAI has project stawberry and there were leaks related to it and Sam Altman added fuel to the file by responding on twitter."
        },
        {
            "question": "What does Alex think about Gemini 1.5 and when did it update? ",
            "rubric": "Answer must include taht Alex has used Gemini 1.5 experimental to help draft the ThursdAI episodes and likes it."
        }
    ]

# since this won't be a standard evaluation, here we'll define how many "turns" of conversation from the above dataset we're going to run on each evaluation to compare latency and cost and number of tokens overall
evaluation_dataset = [
    {
        "number_of_questions": 2
    },
    {
        "number_of_questions": 4
    }
]

In [67]:
from weave.flow.scorer import Scorer
from weave import WeaveList
from typing import Any, Optional

class ClaudePriceScorer(Scorer):

    @weave.op()
    async def score(self, model_output: Optional[dict], number_of_questions: int) -> Any:
        """Score the correctness of the predictions by comparing the pred, query, target.
           Args:
            - model_output: the dict that will be provided by the model that is evaluated
            - query: the question asked - as defined in the dataset
            - answer: the target answer - as defined in the dataset
           Returns:
            - single dict {metric name: single evaluation value}"""

        # the column name displayed in Weave
        return {
            "input_tokens": model_output.get("input_tokens"), 
            "output_tokens": model_output.get("output_tokens"),
            "cache_creation_input_tokens": model_output.get("cache_creation_input_tokens"),
            "cache_read_input_tokens": model_output.get("cache_read_input_tokens"),
            "price": model_output.get("price")
        }

    @weave.op()
    def summarize(self, score_rows: WeaveList) -> Optional[dict]:
        """Aggregate all the scores that are calculated for each row by the scoring function.
           Args:
            - score_rows: a WeaveList object, nested dict of metrics and scores
           Returns:
            - nested dict with the same structure as the input"""
        
        # if nothing is provided the weave.flow.scorer.auto_summarize function is used
        # return auto_summarize(score_rows)

        total_cost = sum(row.get("price", 0) for row in score_rows)
        total_input_tokens = sum(row.get("input_tokens", 0) for row in score_rows)
        total_output_tokens = sum(row.get("output_tokens", 0) for row in score_rows)
        cache_creation_input_tokens = sum(row.get("cache_creation_input_tokens", 0) for row in score_rows)
        cache_read_input_tokens = sum(row.get("cache_read_input_tokens", 0) for row in score_rows)

        return {
            "Total Cost": f"${total_cost:.2f}",
            "Total Input Tokens": total_input_tokens,
            "Total Output Tokens": total_output_tokens,
            "Total Cached Created Input Tokens ": cache_creation_input_tokens,
            "Total Cached Read Input Tokens ": cache_read_input_tokens
        }

In [68]:
# what I want: 4 functions, 1 that calls Haiku without cache, 1 that calls Haiku with cache, 1 that calls Sonnet without cache, 1 that calls Sonnet with cache, each function is wrapped with weave.op and each function is calling 
from time import sleep

@weave.op
def claude_requests_wrapper(number_of_questions: int, model=FAST_MODEL, cached=False) -> dict:
    conversation_history = []
    total_input_tokens = 0
    total_output_tokens = 0
    total_price = 0
    total_cache_creation_input_tokens = 0
    total_cache_read_input_tokens = 0
    
    for i in range(number_of_questions):
        user_message = my_transcript_questions[i]["question"]
        if cached:
            model_response =  get_claude_response_with_caching(conversation_history, user_message, model=model)
        else:
            model_response =  get_claude_response(conversation_history, user_message, model=model)
        conversation_history = model_response.get('conversation_history', [])

        total_input_tokens += model_response["price"]["input_tokens"]
        total_output_tokens += model_response["price"]["output_tokens"]
        total_price += model_response["price"]["total_cost"]
        total_cache_creation_input_tokens += model_response["price"]["cache_creation_input_tokens"]
        total_cache_read_input_tokens += model_response["price"]["cache_read_input_tokens"]
    
    return {
        "number_of_follow_up_questions": number_of_questions,
        "response_text": model_response.get('response_text', ''),
        "input_tokens": total_input_tokens,
        "output_tokens": total_output_tokens,
        "cache_creation_input_tokens": total_cache_creation_input_tokens,
        "cache_read_input_tokens": total_cache_read_input_tokens,
        "price": total_price
    }

@weave.op
def haiku_cached(number_of_questions: int) -> dict:
    print(f"Running {number_of_questions} questions with Haiku cached")
    return claude_requests_wrapper(number_of_questions, model=FAST_MODEL, cached=True)

@weave.op
def haiku_uncached(number_of_questions: int) -> dict:
    return claude_requests_wrapper(number_of_questions, model=FAST_MODEL, cached=False)

@weave.op
def sonnet_cached(number_of_questions: int) -> dict:
    return claude_requests_wrapper(number_of_questions, model=SMART_MODEL, cached=True)

@weave.op
def sonnet_uncached(number_of_questions: int) -> dict:
    return claude_requests_wrapper(number_of_questions, model=SMART_MODEL, cached=False)
 

haiku_uncached_eval = weave.Evaluation(dataset=evaluation_dataset, scorers=[ClaudePriceScorer()], name="Uncached Haiku")
haiku_cached_eval = weave.Evaluation(dataset=evaluation_dataset, scorers=[ClaudePriceScorer()], name="Cached Haiku")
sonnet_cached_eval = weave.Evaluation(dataset=evaluation_dataset, scorers=[ClaudePriceScorer()], name="Cached Sonnet")
sonnet_uncached_eval = weave.Evaluation(dataset=evaluation_dataset, scorers=[ClaudePriceScorer()], name="Uncached Sonnet")


claude_haiku_cached_eval_results = await haiku_cached_eval.evaluate(haiku_cached)
sleep(120)
claude_haiku_uncached_eval_results = await haiku_uncached_eval.evaluate(haiku_uncached)
sleep(120)
claude_sonnet_cached_eval_results = await sonnet_cached_eval.evaluate(sonnet_cached)
sleep(120)
claude_sonnet_uncached_eval_results = await sonnet_uncached_eval.evaluate(sonnet_uncached)
sleep(120)


Running 2 questions with Haiku cached
Total cost: $0.04573185 for 63 input tokens and 426 output tokens with 150612 cache creation input tokens and 0 cache read input tokens
PromptCachingBetaMessage(id='msg_015zSQ6rYoEVNRPUT722gs1V', content=[TextBlock(text='Based on the transcripts provided, here is a summary of the major LLM-related announcements from Meta in chronological order:\n\nFrom "ThursdAI - July 4th.md":\n> **Meta Releases Llama 3.1, a 405B Parameter Model**\n> Alex Volkov mentions that Meta released Llama 3.1, a 405 billion parameter model that beats GPT-4 on multiple benchmarks. Junyang Lin from the Qwent team comments that the 405B model is "a big leap" and the 70B and 8B models also saw significant improvements.\n\nFrom "ThursdAI - July 25th.md": \n> **Meta Licenses Llama 3.1 for Synthetic Data Generation and Distillation**\n> Alex Volkov discusses how Meta updated the license for Llama 3.1 to allow for synthetic data creation and distillation. Junyang and LDJ comment th

Total cost: $0.00503286 for 63 input tokens and 399 output tokens with 0 cache creation input tokens and 150612 cache read input tokens
PromptCachingBetaMessage(id='msg_01XdL3Bz2vdncyXnu7ryxqSC', content=[TextBlock(text='Based on the transcripts provided, here is a summary of the major LLM-related announcements from Meta in chronological order:\n\nFrom the transcript "ThursdAI - July 25th.md":\n> "Meta released, and gave us three new models in open weights, including an incredibly detailed paper, just an incredibly detailed paper full of, full of nuggets, full of information. I honestly, We\'ll not be able to relay most of it still, because I\'m still like,we did multiple paper cup."\n\nMeta released three new LLM models under the OpenWeights initiative, including a 405 billion parameter model, a 70 billion parameter model, and an 8 billion parameter model. These models were reported to outperform GPT-4 on various benchmarks.\n\nFrom the transcript "ThursdAI - July 4th.md":\n> "This we

Total cost: $0.0382525 for 150675 input tokens and 467 output tokens with 0 cache creation input tokens and 0 cache read input tokens
Message(id='msg_01Us6FdrWaJb4X1d9NXSC4QF', content=[TextBlock(text='Certainly, I\'ll summarize the major LLM-related announcements from Meta across the provided transcripts in chronological order:\n\nFrom the transcript "ThursdAI - July 4th.md":\n> **Alex Volkov:** This week is the 52nd consecutive week that I\'ve been publishing ThursdAI on Substack as a podcast. And many of you specifically liked ThursdAI, this version, like listening to ThursdAI while driving, like listening to it, I don\'t know, later.\n\nThis introduction highlights that ThursdAI has been running for over a year as a weekly podcast.\n\nFrom the transcript "ThursdAI - July 25th.md":\n> **Alex Volkov:** Meta released, and gave us three new models in open weights, including an incredibly detailed paper, full of nuggets, full of information. We got three new models, 405 billion paramete

Total cost: $0.03847875 for 150675 input tokens and 648 output tokens with 0 cache creation input tokens and 0 cache read input tokens
Message(id='msg_01FWkUEKgPfX3KhmfaB7qtLd', content=[TextBlock(text='Certainly, I\'ll summarize the major LLM-related announcements from Meta across the provided transcripts in chronological order. My knowledge is limited to the information contained in these transcripts, so I may not have all the details. I\'ll do my best to provide a helpful summary without reproducing copyrighted material.\n\nFrom the transcript "ThursdAI - July 4th":\nThe transcript mentions that Meta (referred to as "Facebook" in the transcript) released the Segment Anything Model (SAM) in March 2023. The guests Piotr Skalski and Joseph Nelson discuss how SAM is able to segment objects in images and videos with a single click, and how this can be useful for various applications:\n\n> "So instead of going through the tedious process of drawing all the masks, you just click on an obje

Total cost: $0.5715840000000001 for 63 input tokens and 440 output tokens with 150612 cache creation input tokens and 0 cache read input tokens
PromptCachingBetaMessage(id='msg_01CVKeJB4TKkgwh7sCa5A2jq', content=[TextBlock(text='Certainly! I\'ll summarize the major LLM-related announcements from Meta across the transcripts in chronological order. I\'ll include quotes from speakers and provide a brief summary for each announcement. Please note that my knowledge is based solely on the transcripts provided.\n\n1. July 25, 2023 - Meta Llama 3.1 Release (ThursdAI - July 25th.md)\n\nMeta released Llama 3.1, including three new models: 405B, 70B, and 8B parameters.\n\n> "Meta released, and gave us three new models in open weights, including an incredibly detailed paper, just an incredibly detailed paper full of, full of nuggets, full of information."\n\nAlex Volkov summarizes the release, highlighting that the 405B parameter model beats GPT-4 on multiple benchmarks and has a 128K context wind

Total cost: $0.0515976 for 63 input tokens and 415 output tokens with 0 cache creation input tokens and 150612 cache read input tokens
PromptCachingBetaMessage(id='msg_01V8RbTJkdVB2pbNpdnnjwYe', content=[TextBlock(text='Certainly! I\'ll summarize the major LLM-related announcements from Meta across the transcripts in chronological order. I\'ll include quotes from speakers and summarize each announcement briefly. My knowledge is based solely on the transcripts provided, so I\'ll focus on the information available in those.\n\n1. Meta Llama 3.1 (July 25th, 2023)\n\nIn the transcript "ThursdAI - July 25th.md", Meta announced the release of Llama 3.1, including three new models: 8B, 70B, and 405B parameters.\n\n> "Meta released, and gave us three new models in open weights, including an incredibly detailed paper, just an incredibly detailed paper full of, full of nuggets, full of information."\n\nThe 405B parameter model showed impressive performance, beating GPT-4 on multiple benchmarks. 

Total cost: $0.459255 for 150675 input tokens and 482 output tokens with 0 cache creation input tokens and 0 cache read input tokens
Message(id='msg_01SWdc7wRYTgUJR2kdRccVFM', content=[TextBlock(text='Certainly! I\'ll summarize the major LLM-related announcements from Meta across the transcripts in chronological order. I\'ll include quotes from speakers and provide a brief summary for each announcement. Please note that my knowledge is based solely on the transcripts provided.\n\n1. Meta LLAMA 3.1 (July 25th, 2023)\nFrom "ThursdAI - July 25th.md":\n\n> "Meta released, and gave us three new models in open weights, including an incredibly detailed paper, just an incredibly detailed paper full of, full of nuggets, full of information."\n\nMeta released LLAMA 3.1 with three new models: 405 billion parameters, 70 billion parameters, and 8 billion parameters. The 405B model beats GPT-4 on multiple benchmarks and has a 128K context window.\n\n2. Meta Segment Anything 2 (SAM2) (August 1st, 202

Total cost: $0.46245 for 150675 input tokens and 695 output tokens with 0 cache creation input tokens and 0 cache read input tokens
Message(id='msg_01YGCHpsw2ZZ97BkVEyXm4He', content=[TextBlock(text='Certainly! I\'ll summarize the major LLM-related announcements from Meta across the transcripts in chronological order. I\'ll include quotes from speakers and provide a brief summary for each announcement. Please note that my knowledge is based solely on the transcripts provided.\n\n1. Meta Llama 3 (April 2023)\nTranscript: ThursdAI - July 25th.md\n\n> "Yeah, I wanted to say, yes, the original release was actually a pre release as Mark said, because, he wanted to put out the 3 before it was really ready, so that was the thing, and I was pretty disappointed with 3 because of the small context, only 8k, I mean it is what, was double what we had before." - Wolfram Ravenwolf\n\nMeta released Llama 3 with an 8k context window, which was an improvement but still disappointing to some users due t