# Retrieval Augmented Generation (Basic w/Evals)

LLMs excels at a wide range of tasks, but struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables the LLM to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions. Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.

In this guide, we'll demonstrate how to build and optimize a RAG system using the Anthropic documentation as our knowledge base. We'll walk you through:

1. Embeddings are from the `intfloat/multilingual-e5-large-instruct` model, where input is truncated to at most 512 tokens
2. In-memory vector database class is from Anthropic
3. Building a robust evaluation suite. We'll go beyond 'vibes' based evals and show you how to measure the retrieval pipeine & end to end performance independently
4. Implementing advanced techniques to improve RAG including summary indexing and re-ranking with Claude.

Through a series of targeted improvements, we achieved significant performance gains on the following metrics compared to a basic RAG pipeline (we'll explain what all these metrics *mean* in a bit)

## Table of Contents

1) Setup
2) Level 1 - Basic RAG
3) Building an Evaluation System

## Setup

We'll need a few libraries and models:

1. `intfloat/multilingual-e5-large-instruct` to generate high quality embeddings
2. `openai`,  LLM for (1) generation (2) judge
4. `pandas`, `numpy`, `matplotlib`, and `scikit-learn` for data manipulation and visualization


In [1]:
## silent setup (-q)
!pip install openai -q
!pip install pandas -q
!pip install numpy -q
!pip install matplotlib -q
!pip install seaborn -q
!pip install -U scikit-learn -q
!pip install sentence-transformers -q
!pip install pyyaml -q

In [None]:
# model configuration
embedding_model = "intfloat/multilingual-e5-large-instruct"; generation_model = "gpt-4o-mini"; judge_model = "gpt-4o-mini"

In [10]:
import os
import getpass
from openai import OpenAI
OPENAI_API_KEY = getpass.getpass("Enter OpenAI API key")
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
# print(os.environ.get("OPENAI_API_KEY"))
client = OpenAI()

Enter OpenAI API key ········


### Download the Embeddings model and run a quick test

In [None]:
from sentence_transformers import SentenceTransformer

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, '南瓜的家常做法')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右,放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
]
input_texts = queries + documents

model = SentenceTransformer(embedding_model)

embeddings = model.encode(input_texts, convert_to_tensor=True, normalize_embeddings=True)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[91.92853546142578, 67.5802993774414], [70.38143157958984, 92.13307189941406]]


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/140k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[91.92853546142578, 67.58030700683594], [70.38142395019531, 92.1330795288086]]


### Initialize a Vector DB Class

In this example, we're using an in-memory vector DB, but for a production application, you may want to use a hosted solution. 

In [12]:
import os
import pickle
import json
import numpy as np

class VectorDB:
    def __init__(self, name, api_key=None):
        self.name = name
        self.embeddings = []
        self.metadata = []
        self.query_cache = {}
        self.db_path = f"./data/{name}/vector_db.pkl"

    def load_vec_db_in_memory(self, data):
        if self.embeddings and self.metadata:
            print("Vector database is already loaded. Skipping data loading.")
            return
        if os.path.exists(self.db_path):
            print("Loading vector database from disk.")
            self.load_vec_db()
            return

        texts = [f"Heading: {item['chunk_heading']}\n\n Chunk Text:{item['text']}" for item in data]
        self._embed_and_store(texts, data)
        self.save_db()
        print("Vector database loaded and saved.")

    def _embed_and_store(self, texts, data):
        batch_size = 128
        result = [
            model.encode(texts[i : i + batch_size])
            for i in range(0, len(texts), batch_size)
        ]
        self.embeddings = [embedding for batch in result for embedding in batch]
        self.metadata = data

    def search(self, query, k=5, similarity_threshold=0.75):
        if query in self.query_cache:
            query_embedding = self.query_cache[query]
        else:
            query_embedding = model.encode(query)
            self.query_cache[query] = query_embedding

        if not self.embeddings:
            raise ValueError("No data loaded in the vector database.")

        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1]
        top_examples = []
        
        for idx in top_indices:
            if similarities[idx] >= similarity_threshold:
                example = {
                    "metadata": self.metadata[idx],
                    "similarity": similarities[idx],
                }
                top_examples.append(example)
                
                if len(top_examples) >= k:
                    break
        # self.save_db()
        return top_examples

    def save_db(self):
        data = {
            "embeddings": self.embeddings,
            "metadata": self.metadata,
            "query_cache": json.dumps(self.query_cache),
        }
        os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
        with open(self.db_path, "wb") as file:
            pickle.dump(data, file)

    def load_vec_db(self):
        if not os.path.exists(self.db_path):
            raise ValueError("Vector database file not found. Use load_vec_in_memory to create a new database.")
        with open(self.db_path, "rb") as file:
            data = pickle.load(file)
        self.embeddings = data["embeddings"]
        self.metadata = data["metadata"]
        self.query_cache = json.loads(data["query_cache"])

## Level 1 - Basic RAG

To get started, we'll set up a basic RAG pipeline using a bare bones approach. This is sometimes called 'Naive RAG' by many in the industry. A basic RAG pipeline includes the following 3 steps:

0) Pick a prompt (there's more than one to try out)

1) Chunk documents by heading - containing only the content from each subheading

2) Embed each document

3) Use Cosine similarity to retrieve documents in order to answer query

In [13]:
# pick out a prompt
import yaml

def read_prompts(filename):
    with open(filename, 'r') as file:
        data = yaml.safe_load(file)
        
    prompts = [entry["prompt"] for entry in data]
    return prompts


filename = "./prompts/prompts.yaml"
prompts = read_prompts(filename)
    
for i, prompt in enumerate(prompts, start=0):
    print(f"Prompt {i}:\n{prompt}\n")


Prompt 0:
You have been tasked with helping us to answer the following query: 
<query>
{query}
</query>
You have access to the following documents which are meant to provide context as you answer the query:
<documents>
{context}
</documents>
Please remain faithful to the underlying context, and only deviate from it if you are 100% sure that you know the answer already. 
Answer the question now, and avoid providing preamble such as 'Here is the answer', etc


Prompt 1:
You have been tasked with helping us to answer the following query: 
<query>
{query}
</query>
You have access to the following documents which are meant to provide context as you answer the query:
<documents>
{context}
</documents>
Please remain absolutely faithful to the underlying context, and do not deviate from it at all.
If you do not find the answer, say, "The context does not have the answer," 
Answer the question now, and avoid providing preamble such as 'Here is the answer', etc




In [None]:
import json
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET
from tqdm import tqdm
import logging
from typing import Callable, List, Dict, Any, Tuple, Set

def retrieve_similar(query, db):
    results = db.search(query, k=3)
    context = ""
    for result in results:
        chunk = result['metadata']
        context += f"\n{chunk['text']}\n"
    return results, context

def construct_prompt(query, context):
    # query = "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool"
    
    prompt = f"""
    You have been tasked with helping us to answer the following query: 
    <query>
    {query}
    </query>
    You have access to the following documents which are meant to provide context as you answer the query:
    <documents>
    {context}
    </documents>
    Please remain faithful to the underlying context, and only deviate from it if you are 100% sure that you know the answer already. 
    Answer the question now, and avoid providing preamble such as 'Here is the answer', etc
    """

    # prompt = prompts[1]
    return prompt

def answer_query_from_context(query, db):
    _, context = retrieve_similar(query, db)  # k=3 similar
    completion = client.chat.completions.create(
        model=generation_model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": construct_prompt(query, context)
            }
        ],
        temperature=0.2
    )
    return completion.choices[0].message.content

logging.basicConfig(filename="log.log",
                    filemode='a',
                    format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
                    datefmt='%H:%M:%S',
                    level=logging.INFO)

# Load the evaluation dataset
with open('evaluation/docs_evaluation_dataset.json', 'r') as f:
    eval_data = json.load(f)

# Load the Anthropic documentation
with open('data/anthropic_docs.json', 'r') as f:
    anthropic_docs = json.load(f)

# Initialize the VectorDB
db = VectorDB("anthropic_docs")
db.load_vec_db_in_memory(anthropic_docs)

# test
query = "What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?"
test_results, test_contexts = retrieve_similar(query, db)
print(f'Test contexts:\n{test_contexts}')
print(f'Test Answer:\n{answer_query_from_context(query, db)}')

Loading vector database from disk.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Test contexts:

How to get embeddings with Anthropic


Anthropic does not offer its own embedding model. One embeddings provider that has a wide variety of options and capabilities encompassing all of the above considerations is Voyage AI.
Voyage AI makes state-of-the-art embedding models and offers customized models for specific industry domains such as finance and healthcare, or bespoke fine-tuned models for individual customers.
The rest of this guide is for Voyage AI, but we encourage you to assess a variety of embeddings vendors to find the best fit for your specific use case.


Before implementing embeddings


When selecting an embeddings provider, there are several factors you can consider depending on your needs and preferences:
Dataset size & domain specificity: size of the model training dataset and its relevance to the domain you want to embed. Larger or more domain-specific data generally produces better in-domain embeddings
Inference performance: embedding lookup speed and

## Eval Setup

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end to end system separately.

We synthetically generated an evaluation dataset consisting of 100 samples which include the following:
- A question
- Chunks from our docs which are relevant to that question. This is what we expect our retrieval system to retrieve when the question is asked
- A correct answer to the question.

This is a relatively challenging dataset. Some of our questions require synthesis between more than one chunk in order to be answered correctly, so it's important that our system can load in more than one chunk at a time. You can inspect the dataset by opening `evaluation/docs_evaluation_dataset.json`

Run the next cell to see a preview of the dataset

In [36]:
#previewing our eval dataset
import json

def preview_json(file_path, num_items=4):
    try:
        with open(file_path, 'r') as file:
            data = json.load(file)
            
        if isinstance(data, list):
            preview_data = data[:num_items]
        elif isinstance(data, dict):
            preview_data = dict(list(data.items())[:num_items])
        else:
            print(f"Unexpected data type: {type(data)}. Cannot preview.")
            return
        
        print(f"Preview of the first {num_items} items from {file_path}:")
        print(json.dumps(preview_data, indent=2))
        print(f"\nTotal number of items: {len(data)}")
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except json.JSONDecodeError:
        print(f"Invalid JSON in file: {file_path}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

preview_json('evaluation/docs_evaluation_dataset.json')

Preview of the first 4 items from evaluation/docs_evaluation_dataset.json:
[
  {
    "id": "efc09699",
    "question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
    "correct_chunks": [
      "https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases",
      "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
    ],
    "correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
  },
  {
    "id": "1305ea00",
    "question": "What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?",
    "correct_chunks": [
      "https://docs.anthropic.com/en/docs/build-with-claude/embeddings#before-implementing-embeddings",
      "h

## Defining Our Metric Calculation Functions

In [None]:
def calculate_mrr(retrieved_links: List[str], correct_links: Set[str]) -> float:
    for i, link in enumerate(retrieved_links, 1):
        if link in correct_links:
            return 1 / i
    return 0

def evaluate_retrieval(retrieval_function: Callable, evaluation_data: List[Dict[str, Any]], db: Any) -> Tuple[float, float, float, float, List[float], List[float], List[float]]:
    precisions = []
    recalls = []
    mrrs = []
    
    for i, item in enumerate(tqdm(evaluation_data, desc="Evaluating Retrieval")):
        try:
            retrieved_chunks, _ = retrieval_function(item['question'], db)
            retrieved_links = [chunk['metadata'].get('chunk_link', chunk['metadata'].get('url', '')) for chunk in retrieved_chunks]
        except Exception as e:
            logging.error(f"Error in retrieval function: {e}")
            continue

        correct_links = set(item['correct_chunks'])
        
        true_positives = len(set(retrieved_links) & correct_links)
        precision = true_positives / len(retrieved_links) if retrieved_links else 0
        recall = true_positives / len(correct_links) if correct_links else 0
        mrr = calculate_mrr(retrieved_links, correct_links)
        
        precisions.append(precision)
        recalls.append(recall)
        mrrs.append(mrr)
        
        if (i + 1) % 10 == 0:
            print(f"Processed {i + 1}/{len(evaluation_data)} items. Current Avg Precision: {sum(precisions) / len(precisions):.4f}, Avg Recall: {sum(recalls) / len(recalls):.4f}, Avg MRR: {sum(mrrs) / len(mrrs):.4f}")
    
    avg_precision = sum(precisions) / len(precisions) if precisions else 0
    avg_recall = sum(recalls) / len(recalls) if recalls else 0
    avg_mrr = sum(mrrs) / len(mrrs) if mrrs else 0
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
    
    return avg_precision, avg_recall, avg_mrr, f1, precisions, recalls, mrrs

def evaluate_end_to_end(answer_query_function, db, eval_data):
    correct_answers = 0
    results = []
    total_questions = len(eval_data)
    
    for i, item in enumerate(tqdm(eval_data, desc="Evaluating End-to-End")):
        query = item['question']
        correct_answer = item['correct_answer']
        generated_answer = answer_query_function(query, db) # ??
        
        comparision_prompt = f"""
        You are an AI assistant tasked with evaluating the correctness of answers to questions about Anthropic's documentation.
        
        Question: {query}
        
        Correct Answer: {correct_answer}
        
        Generated Answer: {generated_answer}
        
        Is the Generated Answer correct based on the Correct Answer? You should pay attention to the substance of the answer, and ignore minute details that may differ. 
        
        Small differences or changes in wording don't matter. If the generated answer and correct answer are saying essentially the same thing then that generated answer should be marked correct. 
        
        However, if there is any critical piece of information which is missing from the generated answer in comparison to the correct answer, then we should mark this as incorrect. 
        
        Finally, if there are any direct contradictions between the correct answer and generated answer, we should deem the generated answer to be incorrect.
        
        Respond in the following XML format (don't prefix with xml):
        <evaluation>
        <content>
        <explanation>Your explanation here</explanation>
        <is_correct>true/false</is_correct>
        </content>
        </evaluation>
        """
        
        try:
            response = client.chat.completions.create(
                model=judge_model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": comparision_prompt}
                ],
                temperature=0.2,
            )
            response_text = str(response.choices[0].message.content)
            print(f'Query:\n{query}')
            print(f'Correct answer:\n{correct_answer}')
            print(f'Generated anser:\n{generated_answer}')
            print(f'Response_text from judge LLM:\n{response_text}')
            
            evaluation = ET.fromstring(response_text)
            is_correct_value = evaluation.find(".//is_correct").text
            
            is_correct = is_correct_value == 'true'
            
            if is_correct:
                correct_answers += 1
            results.append(is_correct)
            
            logging.info(f"Question {i + 1}/{total_questions}: {query}")
            logging.info(f"Correct: {is_correct}")
            logging.info("---")
            
        except ET.ParseError as e:
            logging.error(f"XML parsing error: {e}")
            is_correct = 'true' in response_text.lower()
            results.append(is_correct)
        except Exception as e:
            logging.error(f"Unexpected error: {e}")
            results.append(False)
        
        if (i + 1) % 10 == 0:
            current_accuracy = correct_answers / (i + 1)
            print(f"Processed {i + 1}/{total_questions} questions. Current Accuracy: {current_accuracy:.4f}")
        # time.sleep(2)
    accuracy = correct_answers / total_questions
    return accuracy, results

## Evaluating Our Base Case

In [38]:
import pandas as pd

avg_precision, avg_recall, avg_mrr, f1, precisions, recalls, mrrs = evaluate_retrieval(retrieve_similar, eval_data, db)
e2e_accuracy, e2e_results = evaluate_end_to_end(answer_query_from_context, db, eval_data)

# Create a DataFrame
df = pd.DataFrame({
    'question': [item['question'] for item in eval_data],
    'retrieval_precision': precisions,
    'retrieval_recall': recalls,
    'retrieval_mrr': mrrs,
    'e2e_correct': e2e_results
})

# Save to CSV
from pathlib import Path
csv_dir = Path('evaluation/csvs')
csv_file_name = Path('evaluation_results_detailed.csv')
df.to_csv(csv_dir / csv_file_name, index=False)
print(f"Detailed results saved to {csv_dir/ csv_file_name}")

# Print the results
print(f"Average Precision: {avg_precision:.4f}")
print(f"Average Recall: {avg_recall:.4f}")
print(f"Average MRR: {avg_mrr:.4f}")
print(f"Average F1: {f1:.4f}")
print(f"End-to-End Accuracy: {e2e_accuracy:.4f}")

# Save the results to a file
json_dir = Path("evaluation/json_results")
result_file_name = Path("evaluation_results_one.json")
Path(json_dir).mkdir(parents=True, exist_ok=True)
with open(json_dir / result_file_name, 'w') as f:
    json.dump({
        "name": "Basic RAG",
        "average_precision": avg_precision,
        "average_recall": avg_recall,
        "average_f1": f1,
        "average_mrr": avg_mrr,
        "end_to_end_accuracy": e2e_accuracy
    }, f, indent=2)

print(f"Evaluation complete. Results saved to {json_dir / result_file_name}, {csv_dir/ csv_file_name}")

Evaluating Retrieval:   0%|          | 0/100 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:   3%|▎         | 3/100 [00:00<00:05, 18.97it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:   5%|▌         | 5/100 [00:00<00:04, 19.07it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:   7%|▋         | 7/100 [00:00<00:04, 19.33it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:   9%|▉         | 9/100 [00:00<00:04, 19.51it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 10/100 items. Current Avg Precision: 0.4333, Avg Recall: 0.7000, Avg MRR: 0.9000


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  11%|█         | 11/100 [00:00<00:04, 19.22it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  13%|█▎        | 13/100 [00:00<00:04, 19.40it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  15%|█▌        | 15/100 [00:00<00:04, 19.33it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  18%|█▊        | 18/100 [00:00<00:04, 20.13it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 20/100 items. Current Avg Precision: 0.3333, Avg Recall: 0.5500, Avg MRR: 0.7000


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  21%|██        | 21/100 [00:01<00:03, 21.39it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  24%|██▍       | 24/100 [00:01<00:03, 22.21it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  27%|██▋       | 27/100 [00:01<00:03, 22.92it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  30%|███       | 30/100 [00:01<00:03, 23.23it/s]

Processed 30/100 items. Current Avg Precision: 0.3778, Avg Recall: 0.6000, Avg MRR: 0.7667


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  33%|███▎      | 33/100 [00:01<00:02, 23.58it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  36%|███▌      | 36/100 [00:01<00:02, 23.91it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  39%|███▉      | 39/100 [00:01<00:02, 23.91it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 40/100 items. Current Avg Precision: 0.4083, Avg Recall: 0.6250, Avg MRR: 0.8000


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  42%|████▏     | 42/100 [00:01<00:02, 24.12it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  45%|████▌     | 45/100 [00:02<00:02, 24.19it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  48%|████▊     | 48/100 [00:02<00:02, 23.99it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 50/100 items. Current Avg Precision: 0.4067, Avg Recall: 0.6300, Avg MRR: 0.7800


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  51%|█████     | 51/100 [00:02<00:02, 23.96it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  54%|█████▍    | 54/100 [00:02<00:01, 24.06it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  57%|█████▋    | 57/100 [00:02<00:01, 24.00it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  60%|██████    | 60/100 [00:02<00:01, 24.14it/s]

Processed 60/100 items. Current Avg Precision: 0.4056, Avg Recall: 0.6361, Avg MRR: 0.7833


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  63%|██████▎   | 63/100 [00:02<00:01, 24.08it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  66%|██████▌   | 66/100 [00:02<00:01, 24.16it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  69%|██████▉   | 69/100 [00:03<00:01, 24.15it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 70/100 items. Current Avg Precision: 0.3952, Avg Recall: 0.6167, Avg MRR: 0.7548


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  72%|███████▏  | 72/100 [00:03<00:01, 24.16it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  75%|███████▌  | 75/100 [00:03<00:01, 24.20it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  78%|███████▊  | 78/100 [00:03<00:00, 24.26it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processed 80/100 items. Current Avg Precision: 0.4208, Avg Recall: 0.6583, Avg MRR: 0.7792


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  81%|████████  | 81/100 [00:03<00:00, 24.25it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  84%|████████▍ | 84/100 [00:03<00:00, 24.34it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  87%|████████▋ | 87/100 [00:03<00:00, 24.29it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  90%|█████████ | 90/100 [00:03<00:00, 24.35it/s]

Processed 90/100 items. Current Avg Precision: 0.4185, Avg Recall: 0.6556, Avg MRR: 0.7704


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  93%|█████████▎| 93/100 [00:04<00:00, 24.44it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  96%|█████████▌| 96/100 [00:04<00:00, 24.38it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval:  99%|█████████▉| 99/100 [00:04<00:00, 24.14it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating Retrieval: 100%|██████████| 100/100 [00:04<00:00, 23.22it/s]


Processed 100/100 items. Current Avg Precision: 0.3933, Avg Recall: 0.6183, Avg MRR: 0.7333


Evaluating End-to-End:   1%|          | 1/100 [00:03<06:15,  3.80s/it]

Query:
How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?
Correct answer:
To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios.
Generated anser:
To create multiple test cases for evaluation in the Anthropic Evaluation tool, follow these steps:

1. Access the Evaluation screen, where you will initially see a single row for a test case.
2. Click the ‘Add Test Case’ button to create additional test cases.
3. Fill in the required values for each variable in your prompt for each test case.
4. Repeat the process to create as many scenarios as needed.

Make sure your original prompt includes at least 1-2 dynamic variables using the double brace syntax (e.g., {{variable}}), as this is necessary for creating evaluation test sets. If you update your prompt text, you can re-run the entire evaluation su

Evaluating End-to-End:   2%|▏         | 2/100 [00:06<05:15,  3.22s/it]

Query:
What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?
Correct answer:
Anthropic recommends Voyage AI for embedding models. Voyage AI offers customized models for specific industry domains like finance and healthcare, as well as bespoke fine-tuned models for individual customers. They have a wide variety of options and capabilities.
Generated anser:
Anthropic recommends Voyage AI as the embeddings provider for customized domain-specific models. Voyage AI offers state-of-the-art embedding models and provides customized models tailored for specific industry domains such as finance and healthcare, as well as bespoke fine-tuned models for individual customers.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information provided in the correct answer. Both answers mention that Anthropic recommends Voyage AI for embedding models and highlight

Evaluating End-to-End:   3%|▎         | 3/100 [00:11<06:04,  3.76s/it]

Query:
What are some key success metrics to consider when evaluating Claude's performance on a classification task, and how do they relate to choosing the right model to reduce latency?
Correct answer:
When evaluating Claude's performance on a classification task, some key success metrics to consider include accuracy, F1 score, consistency, structure, speed, bias and fairness. Choosing the right model that fits your specific requirements in terms of speed and output quality is a straightforward way to reduce latency and meet the acceptable response time for your use case.
Generated anser:
Key success metrics to consider when evaluating Claude's performance on a classification task include:

1. **Accuracy**: Measures the proportion of correct predictions made by the model. It is calculated as the number of correct predictions divided by the total number of predictions.

2. **F1 Score**: Balances precision and recall, providing a single metric that reflects both false positives and false

Evaluating End-to-End:   4%|▍         | 4/100 [00:14<05:45,  3.60s/it]

Query:
What are two ways that Claude for Sheets can improve prompt engineering workflows compared to using chained prompts?
Correct answer:
Claude for Sheets enables testing prompts across evaluation suites in parallel, which is faster than running chained prompts sequentially. It also excels at office tasks like survey analysis and online data processing that may be more cumbersome with chained prompts.
Generated anser:
Claude for Sheets can improve prompt engineering workflows compared to using chained prompts in two significant ways:

1. **Parallel Testing of Prompts**: Claude for Sheets allows users to test multiple prompts across evaluation suites simultaneously. This capability enables more efficient experimentation and comparison of different prompts, which can lead to quicker iterations and refinements in the prompt engineering process.

2. **Centralized Example Repository**: The Claude for Sheets prompting examples workbench serves as a centralized location for various example

Evaluating End-to-End:   5%|▌         | 5/100 [00:16<04:56,  3.12s/it]

Query:
What happens if a prompt for the Text Completions API is missing the "\n\nHuman:" and "\n\nAssistant:" turns?
Correct answer:
If a prompt for the Text Completions API is missing the required "\n\nHuman:" and "\n\nAssistant:" turns, it will result in an API error.
Generated anser:
If a prompt for the Text Completions API is missing the "\n\nHuman:" and "\n\nAssistant:" turns, it will result in an API error. The API requires these specific markers to properly identify the roles of the participants in the conversation. Without them, the prompt is considered invalid, and the API will not process it.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer is correct as it conveys the same essential information as the correct answer. It states that missing the "\n\nHuman:" and "\n\nAssistant:" turns will result in an API error and explains that these markers are required for the API to identify the roles in the conversation. There are no critical pieces 

Evaluating End-to-End:   6%|▌         | 6/100 [00:20<05:15,  3.36s/it]

Query:
How do the additional tokens required for tool use in Claude API requests impact pricing compared to regular API requests?
Correct answer:
Tool use requests in the Claude API are priced the same as regular API requests, based on the total input and output tokens. However, tool use requests have additional tokens beyond the regular input and output, including the tools parameter, tool use content blocks, tool result content blocks, and a special system prompt that enables tool use, which add to the total tokens and cost.
Generated anser:
The additional tokens required for tool use in Claude API requests impact pricing by increasing the total token count that is used to calculate the cost of the request. Tool use requests are priced the same as regular API requests, based on the total number of input tokens sent to the model (which includes tokens from the tools parameter) and the number of output tokens generated.

The additional tokens come from several sources:
1. The tools par

Evaluating End-to-End:   7%|▋         | 7/100 [00:22<04:20,  2.80s/it]

Query:
When will the new Anthropic Developer Console features that show API usage, billing details, and rate limits be available?
Correct answer:
The new Usage, Cost, and Rate Limits tabs in the Anthropic Developer Console that show API usage, billing details, and current rate limits will be available on June 27th, 2024.
Generated anser:
The new Anthropic Developer Console features that show API usage, billing details, and rate limits will be available on June 27th, 2024.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer states that the new Anthropic Developer Console features will be available on June 27th, 2024, which matches the information provided in the correct answer. There are no critical pieces of information missing, and there are no contradictions. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   8%|▊         | 8/100 [00:25<04:46,  3.11s/it]

Query:
When deciding whether to use chain-of-thought (CoT) for a task, what are two key factors to consider in order to strike the right balance between performance and latency?
Correct answer:
When deciding whether to use CoT, consider if the task requires in-depth thinking that a human would need to work through, and be aware that the increased output length from CoT may impact latency.
Generated anser:
When deciding whether to use chain-of-thought (CoT) for a task, two key factors to consider are:

1. **Complexity of the Task**: CoT should be used for tasks that require in-depth thinking, such as complex math, multi-step analysis, or writing complex documents. If the task is straightforward and does not require extensive reasoning, using CoT may not be necessary and could lead to unnecessary latency.

2. **Impact on Latency**: Increased output length from using CoT can impact latency. It's important to balance the need for thorough reasoning with the acceptable response time for the

Evaluating End-to-End:   9%|▉         | 9/100 [00:30<05:12,  3.43s/it]

Query:
How can I use Claude to more easily digest the content of long PDF documents?
Correct answer:
You can upload PDFs and have Claude summarize their content, making it easier to understand the key points of long documents without having to read through everything.
Generated anser:
To use Claude for more easily digesting the content of long PDF documents, you can follow these steps:

1. **Extract Text from PDF**: First, convert the PDF document into a text format. This can be done using various PDF extraction tools or software that can read and convert PDF files into plain text or Word documents.

2. **Chunk the Text**: Since long documents can be overwhelming, break the extracted text into smaller, manageable sections or chunks. This will make it easier for Claude to process and summarize the content effectively.

3. **Use Claude for Summarization**: Input each chunk of text into Claude and ask it to summarize the content. You can prompt Claude with specific questions about the tex

Evaluating End-to-End:  10%|█         | 10/100 [00:31<04:21,  2.91s/it]

Query:
According to the documentation, where can you view your organization's current API rate limits in the Anthropic Console?
Correct answer:
You can view your organization's current API rate limits in the Rate Limits tab of the Developer Console.
Generated anser:
You can view your organization's current API rate limits in the Rate Limits tab in the Developer Console of the Anthropic Console.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer conveys the same essential information as the correct answer, stating that the API rate limits can be viewed in the Rate Limits tab of the Developer Console within the Anthropic Console. The slight difference in wording does not affect the substance of the answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 10/100 questions. Current Accuracy: 0.9000


Evaluating End-to-End:  11%|█         | 11/100 [00:38<05:52,  3.96s/it]

Query:
How can we measure the performance of the ticket classification system implemented using Claude beyond just accuracy?
Correct answer:
In addition to accuracy, we can measure the 95th percentile response time and average cost per classification to assess the ticket classification system's performance and production-readiness.
Generated anser:
To measure the performance of the ticket classification system implemented using Claude beyond just accuracy, consider the following metrics and approaches:

1. **Precision, Recall, and F1 Score**: These metrics provide a deeper understanding of the model's performance, especially in multi-class classification scenarios. Precision measures the accuracy of the positive predictions, recall assesses the model's ability to find all relevant instances, and the F1 score balances both precision and recall.

2. **Confusion Matrix**: Analyzing the confusion matrix helps identify specific classes where the model is performing well or poorly. This can 

Evaluating End-to-End:  12%|█▏        | 12/100 [00:41<05:32,  3.78s/it]

Query:
How can you specify a system prompt using the Text Completions API versus the Messages API?
Correct answer:
With the Text Completions API, the system prompt is added as text before the first "\n\nHuman:" turn. With the Messages API, the system prompt is specified using the separate "system" parameter when making the API request.
Generated anser:
To specify a system prompt using the Text Completions API, you include the system prompt as part of the raw string input before the first "\n\nHuman:" turn. For example:

```python
prompt = "Today is January 1, 2024.\n\nHuman: Hello, Claude\n\nAssistant:"
```

In contrast, when using the Messages API, you specify the system prompt with the `system` parameter in the API call. For example:

```python
anthropic.Anthropic().messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    system="Today is January 1, 2024.",  # <-- system prompt
    messages=[
        {"role": "user", "content": "Hello, Claude"}
    ]
)
```
Respon

Evaluating End-to-End:  13%|█▎        | 13/100 [00:48<07:03,  4.86s/it]

Query:
How can you combine XML tags with chain of thought reasoning to create high-performance prompts for Claude?
Correct answer:
You can combine XML tags like <thinking> and <answer> with chain of thought reasoning, where Claude explains its step-by-step reasoning process, to create structured, high-performance prompts. For example, you can prompt Claude to show its reasoning by including "Before answering, explain your reasoning step-by-step in <thinking> tags." in the user message or system prompt.
Generated anser:
To combine XML tags with chain of thought reasoning for creating high-performance prompts for Claude, follow these steps:

1. **Identify Subtasks**: Break down your overall task into distinct, sequential subtasks. Each subtask should focus on a specific aspect of the main task.

2. **Structure with XML**: Use XML tags to clearly delineate different parts of your prompt. This helps in organizing the information and ensures that Claude can interpret each section accurately

Evaluating End-to-End:  14%|█▍        | 14/100 [00:52<06:27,  4.50s/it]

Query:
When evaluating the Claude model's performance for ticket routing, what three key metrics are calculated and what are the results for the claude-3-haiku-20240307 model on the 91 test samples?
Correct answer:
When evaluating the claude-3-haiku-20240307 model's performance on the 91 test samples, the three key metrics calculated are accuracy (89.01%), 95th percentile response time (1.61 seconds), and average cost per request routing ($0.0004).
Generated anser:
The three key metrics calculated when evaluating the Claude model's performance for ticket routing are intelligence, latency, and price. For the claude-3-haiku-20240307 model, the results on the 91 test samples are not explicitly provided in the documents. However, the model is noted for being the fastest and most compact, designed for near-instant responsiveness, which suggests it would perform well in terms of latency. The cost for this model is $0.25 per million tokens for input and $1.25 for output, indicating a competit

Evaluating End-to-End:  15%|█▌        | 15/100 [00:55<05:34,  3.94s/it]

Query:
Before starting to engineer and improve a prompt in Claude, what key things does Anthropic recommend you have in place first?
Correct answer:
Before prompt engineering, Anthropic highly recommends having a clear definition of success criteria for your use case, some ways to empirically test against those criteria, and a first draft prompt you want to improve.
Generated anser:
Before starting to engineer and improve a prompt in Claude, Anthropic recommends having the following key things in place:

1. A clear definition of the success criteria for your use case.
2. Some ways to empirically test against those criteria.
3. A first draft prompt that you want to improve.

If you do not have a first draft prompt, it is suggested to use the prompt generator available in the Anthropic Console.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The Generated Answer accurately reflects the key recommendations from the Correct Answer. It includes all the essential elements: 

Evaluating End-to-End:  16%|█▌        | 16/100 [01:01<06:26,  4.60s/it]

Query:
How does the Messages API handle mid-response prompting compared to the Text Completions API?
Correct answer:
The Messages API allows you to continue a response by making the last input message have the "assistant" role, whereas the Text Completions API lets you pre-fill part of Claude's response directly in the prompt string.
Generated anser:
The Messages API handles mid-response prompting by allowing the last input message to have the assistant role, which enables the response to continue from the specified content. For example, if you want to pre-fill part of the assistant's response, you can set the last message as an assistant message with the desired text, and the model will continue from there.

In contrast, the Text Completions API uses raw strings for inputs, where you alternate between "Human:" and "Assistant:" prompts. This means that while you can pre-fill responses in both APIs, the method of doing so differs. The Messages API requires a structured list of messages 

Evaluating End-to-End:  17%|█▋        | 17/100 [01:05<06:07,  4.42s/it]

Query:
How does Claude's response differ when given a role through a system prompt compared to not having a specific role in the financial analysis example?
Correct answer:
When given the role of CFO through a system prompt, Claude provides a much more insightful, structured, and actionable financial analysis compared to not having a specific role. The role-based response breaks down key financial metrics, provides strategic commentary, and makes specific recommendations.
Generated anser:
Claude's response differs significantly when given a specific role through a system prompt compared to not having a defined role. When a role is assigned, such as "a seasoned data scientist at a Fortune 500 company," Claude's analysis is likely to be more accurate, focused, and tailored to the expectations associated with that role. This includes using a professional tone, emphasizing relevant insights, and applying domain-specific knowledge that aligns with the responsibilities of a data scientist.



Evaluating End-to-End:  18%|█▊        | 18/100 [01:18<09:36,  7.03s/it]

Query:
What are some quantitative metrics that can be used to measure the success of a sentiment analysis model, and how might specific targets for those metrics be determined?
Correct answer:
Quantitative metrics for evaluating a sentiment analysis model include task-specific metrics like F1 score, as well as generic metrics like accuracy, precision, and recall. Specific targets should be based on industry benchmarks, prior experiments, AI research, or expert knowledge, and should represent an improvement over the current baseline.
Generated anser:
Quantitative metrics that can be used to measure the success of a sentiment analysis model include:

1. **Accuracy**: The proportion of correct predictions made by the model, calculated as (Number of correct predictions) / (Total number of predictions).

2. **F1 Score**: A measure that balances precision and recall, providing a single score that reflects both the model's ability to correctly identify positive instances and its ability to av

Evaluating End-to-End:  19%|█▉        | 19/100 [01:20<07:24,  5.49s/it]

Query:
What is a power user tip mentioned in the documentation for creating high-performance prompts using XML tags?
Correct answer:
Combining XML tags with other prompt engineering techniques like multishot prompting (using <examples> tags) or chain of thought (using <thinking> and <answer> tags) to create super-structured, high-performance prompts.
Generated anser:
Combine XML tags with other techniques like multishot prompting (<examples>) or chain of thought (<thinking>, <answer>). This creates super-structured, high-performance prompts.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately conveys the main idea of the correct answer, mentioning the combination of XML tags with multishot prompting and chain of thought techniques to create high-performance prompts. It includes the essential elements without omitting critical information, thus maintaining the substance of the original answer.</explanation>
<is_correct>true</is_correct>
</co

Evaluating End-to-End:  20%|██        | 20/100 [01:29<08:43,  6.55s/it]

Query:
How can you use an LLM like Claude to automatically grade the outputs of other LLMs based on a rubric?
Correct answer:
You can use an LLM like Claude to grade the outputs of other LLMs by providing it with the output to grade along with a detailed rubric. Instruct the LLM to think through its reasoning and then output a simple 'correct' or 'incorrect' result based on how well the output matches the criteria in the rubric.
Generated anser:
To use an LLM like Claude to automatically grade the outputs of other LLMs based on a rubric, follow these steps:

1. **Define a Clear Rubric**: Create a detailed rubric that specifies the criteria for grading. For example, you might state that "the answer should always mention 'Acme Inc.' in the first sentence." This clarity helps the LLM understand what to look for in the responses.

2. **Build a Grader Prompt**: Construct a prompt that instructs Claude to evaluate an answer based on the rubric. The prompt should include the rubric and the an

Evaluating End-to-End:  21%|██        | 21/100 [01:36<08:46,  6.67s/it]

Query:
How can you access and deploy Voyage embeddings on AWS Marketplace?
Correct answer:
To access Voyage embeddings on AWS, subscribe to the model package on AWS Marketplace, select the model to deploy, agree to the terms, and copy the Product ARN for your selected region. Then create a JupyterLab space in SageMaker Studio, upload Voyage's notebook, and follow the instructions to deploy the model package using the ARN.
Generated anser:
To access and deploy Voyage embeddings on AWS Marketplace, follow these steps:

1. **Subscribe to the Model Package**:
   - Navigate to the model package listing page on AWS Marketplace and select the Voyage model you wish to deploy.
   - Click on the "Continue to subscribe" button.
   - Review the details on the "Subscribe to this software" page. If you agree with the End-User License Agreement (EULA), pricing, and support terms, click on "Accept Offer".
   - After selecting "Continue to configuration" and choosing a region, you will receive a Produc

Evaluating End-to-End:  22%|██▏       | 22/100 [01:41<08:14,  6.35s/it]

Query:
When using tools just to get Claude to produce JSON output following a particular schema, what key things should you do in terms of tool setup and prompting?
Correct answer:
When using tools to get JSON output, you should provide a single tool, set the tool_choice to explicitly instruct the model to use that tool, and ensure the tool name and description are from the model's perspective since it will pass the input to the tool.
Generated anser:
To effectively set up tools for Claude to produce JSON output following a particular schema, you should focus on the following key aspects:

1. **Define Tools Clearly**: Specify the tools in your API request with clear names, descriptions, and input schemas. Each tool definition should include:
   - **Name**: A unique identifier for the tool.
   - **Description**: A detailed explanation of what the tool does and when it should be used.
   - **Input Schema**: A JSON Schema object that defines the expected parameters for the tool, including

Evaluating End-to-End:  23%|██▎       | 23/100 [01:48<08:11,  6.39s/it]

Query:
What are the key differences between the legacy Claude Instant 1.2 model and the Claude 3 Haiku model in terms of capabilities and performance?
Correct answer:
The Claude 3 Haiku model has vision capabilities, is faster, more performant, and more intelligent than the legacy Claude Instant 1.2 model. Claude 3 Haiku also has more up-to-date training data.
Generated anser:
The key differences between the legacy Claude Instant 1.2 model and the Claude 3 Haiku model in terms of capabilities and performance are as follows:

1. **Intelligence and Performance**: Claude 3 Haiku is part of the Claude 3 family, which is designed to be more intelligent and capable than legacy models like Claude Instant 1.2. The Claude 3 models offer higher levels of performance, fluency, and understanding.

2. **Speed and Latency**: Claude 3 Haiku is noted for being the fastest and most compact model, providing near-instant responsiveness. In contrast, Claude Instant 1.2 is slower compared to the Claude 3 m

Evaluating End-to-End:  24%|██▍       | 24/100 [01:50<06:19,  4.99s/it]

Query:
What is one key benefit of using examples when prompt engineering with Claude?
Correct answer:
One key benefit of using examples in prompts is that they reduce misinterpretation of instructions, leading to more accurate outputs from Claude.
Generated anser:
One key benefit of using examples when prompt engineering with Claude is that they reduce misinterpretation of instructions, leading to increased accuracy in the responses generated.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer conveys the same essential idea as the correct answer, stating that using examples reduces misinterpretation of instructions and leads to increased accuracy in responses. The slight difference in wording does not change the meaning, so the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  25%|██▌       | 25/100 [01:52<05:15,  4.21s/it]

Query:
According to the Anthropic documentation, what is one key advantage of using prompt engineering instead of fine-tuning when it comes to adapting an AI model to new domains or tasks?
Correct answer:
Prompt engineering allows you to easily adapt AI models to new domains by providing domain-specific context directly in the prompts, without needing to retrain the model through fine-tuning.
Generated anser:
One key advantage of using prompt engineering instead of fine-tuning is that prompt engineering is far faster and provides nearly instantaneous results, allowing for quick problem-solving. In contrast, fine-tuning can take hours or even days to implement.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer highlights the speed advantage of prompt engineering over fine-tuning, which is a relevant point. However, it does not mention the aspect of providing domain-specific context directly in the prompts, which is a key part of the correct answer. T

Evaluating End-to-End:  26%|██▌       | 26/100 [01:55<04:38,  3.76s/it]

Query:
How can I quickly get started using the Claude for Sheets extension with a pre-made template?
Correct answer:
You can make a copy of Anthropic's provided Claude for Sheets workbook template to quickly get started using the extension with your own work.
Generated anser:
To quickly get started using the Claude for Sheets extension with a pre-made template, you should make a copy of the Claude for Sheets workbook template. This will allow you to begin working with your own Claude for Sheets setup right away. You can find the template and instructions on how to make a copy in the provided resources. Once you have the template, you can start utilizing Claude for various tasks, including prompt engineering and data processing.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly states that you should make a copy of the Claude for Sheets workbook template to get started, which aligns with the correct answer. It also mentions the ability to u

Evaluating End-to-End:  27%|██▋       | 27/100 [01:58<04:35,  3.77s/it]

Query:
How does the "index" field in the "content_block_delta" event relate to the text being streamed in a response?
Correct answer:
The "index" field in each "content_block_delta" event indicates which content block the text delta applies to. Multiple deltas with the same index consecutively stream the text for a single content block in the response.
Generated anser:
The "index" field in the "content_block_delta" event indicates the position of the content block being updated within the overall message structure. Each "content_block_delta" event corresponds to a specific content block, and the "index" helps identify which block is being modified. 

In the context of the text being streamed in a response, the "index" allows the receiving system to correctly place the incoming text updates into the appropriate location within the message. For example, if multiple content blocks are present, each with its own deltas, the "index" ensures that the updates are applied to the correct block,

Evaluating End-to-End:  28%|██▊       | 28/100 [02:02<04:29,  3.74s/it]

Query:
How can you include an image as part of a Claude API request, and what image formats are currently supported?
Correct answer:
To include an image in a Claude API request, provide it as a base64-encoded image in an "image" content block within the "messages" array. The currently supported image formats are JPEG, PNG, GIF, and WebP.
Generated anser:
To include an image as part of a Claude API request, you can upload the image directly in the request. The supported image formats are JPEG, PNG, GIF, and WebP, specifically: 

- image/jpeg
- image/png
- image/gif
- image/webp

Keep in mind that the maximum file size for images uploaded via the API is 5MB. Additionally, you can include up to 20 images in a single request. Note that Claude cannot read image URLs, so images must be uploaded directly rather than linked.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer provides an alternative method for including images in a Claude API request by stati

Evaluating End-to-End:  29%|██▉       | 29/100 [02:06<04:36,  3.90s/it]

Query:
What is the relationship between time to first token (TTFT) and latency when evaluating a language model's performance?
Correct answer:
TTFT is a specific measure of latency that captures the time it takes for a language model to generate the first token of its response after receiving a prompt. It is an important component of a model's overall latency and responsiveness, especially for interactive applications.
Generated anser:
The relationship between Time to First Token (TTFT) and latency in evaluating a language model's performance is that TTFT is a specific component of latency. TTFT measures the time it takes for a model to generate the first token of its output after receiving a prompt, while latency refers to the overall time taken for the model to respond to the prompt, which includes the time taken to process the input and generate the entire output.

A lower TTFT contributes to lower latency, particularly in interactive applications where quick initial feedback is ess

Evaluating End-to-End:  30%|███       | 30/100 [02:13<05:25,  4.65s/it]

Query:
How can providing Claude with examples of handling certain edge cases like implicit requests or emotional prioritization help improve its performance in routing support tickets?
Correct answer:
Providing edge case examples to Claude in the prompt can meaningfully improve its performance in correctly routing support tickets in scenarios where it may otherwise misclassify them, such as implicit requests, emotional prioritization, ambiguous intent vs. routing, or issue prioritization.
Generated anser:
Providing Claude with examples of handling edge cases like implicit requests and emotional prioritization can significantly enhance its performance in routing support tickets by improving its understanding of nuanced customer interactions. 

1. **Implicit Requests**: Customers often communicate their needs indirectly. By training Claude with examples of such implicit requests, it can learn to recognize phrases that suggest a need for information or action, even when not explicitly sta

Evaluating End-to-End:  31%|███       | 31/100 [02:18<05:30,  4.79s/it]

Query:
How does the stop_reason of "tool_use" relate to the overall workflow of integrating external tools with Claude?
Correct answer:
When Claude determines that one of the user-provided tools can help answer the user's query, it constructs a tool use request. This causes the API response to have a stop_reason of "tool_use", signaling Claude's intent to use the tool. The user must then extract the tool input from Claude's request, run the actual tool code client-side, and continue the conversation by sending the tool results back to Claude.
Generated anser:
The stop_reason of "tool_use" indicates that Claude has determined that utilizing an external tool is necessary to effectively respond to the user's query. This is a critical part of the workflow for integrating external tools with Claude, as it signifies that Claude has assessed the user prompt and identified that the available tools can provide the required information or functionality.

In the overall workflow, the process begi

Evaluating End-to-End:  32%|███▏      | 32/100 [02:20<04:38,  4.10s/it]

Query:
According to the documentation, what error event and corresponding HTTP error code may be sent during periods of high usage for the Anthropic API when using streaming responses?
Correct answer:
During periods of high usage, an overloaded_error event may be sent in the event stream, which would normally correspond to an HTTP 529 error code in a non-streaming context.
Generated anser:
During periods of high usage for the Anthropic API when using streaming responses, the error event that may be sent is `overloaded_error`, which corresponds to the HTTP error code 529 in a non-streaming context.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information provided in the correct answer. It states that during periods of high usage, the `overloaded_error` event may be sent in the event stream, and it correctly identifies that this corresponds to an HTTP 529 error code in a non-streaming context. There are no critical pieces 

Evaluating End-to-End:  33%|███▎      | 33/100 [02:23<03:57,  3.55s/it]

Query:
What are the two types of deltas that can be contained in a content_block_delta event when streaming responses from the Anthropic API?
Correct answer:
The two types of deltas that can be contained in a content_block_delta event are text_delta and input_json_delta.
Generated anser:
The two types of deltas that can be contained in a content_block_delta event when streaming responses from the Anthropic API are:

1. **input_json_delta** - This delta type corresponds to updates for the input field of the block and consists of partial JSON strings.

2. **(The second type is not explicitly mentioned in the provided documents. The context only refers to input_json_delta as the type of delta for tool_use content blocks.)**
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies one of the delta types as input_json_delta but fails to mention the second type, which is text_delta. Since it does not provide the complete information as state

Evaluating End-to-End:  34%|███▍      | 34/100 [02:25<03:31,  3.20s/it]

Query:
On what date did Claude 3.5 Sonnet and tool use both become generally available across the Anthropic API, Amazon Bedrock, and Google Vertex AI?
Correct answer:
Claude 3.5 Sonnet became generally available across those platforms on June 20th, 2024, while tool use became generally available on May 30th, 2024.
Generated anser:
Claude 3.5 Sonnet and tool use both became generally available across the Anthropic API, Amazon Bedrock, and Google Vertex AI on June 20th, 2024.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer incorrectly states that both Claude 3.5 Sonnet and tool use became generally available on June 20th, 2024. However, according to the correct answer, Claude 3.5 Sonnet became available on June 20th, 2024, while tool use became available earlier on May 30th, 2024. This is a critical piece of information that is missing in the generated answer, making it incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>

Evaluating End-to-End:  35%|███▌      | 35/100 [02:27<03:05,  2.85s/it]

Query:
In what order did Anthropic launch Claude.ai and the Claude iOS app in Canada and Europe?
Correct answer:
Anthropic launched Claude.ai and the Claude iOS app in Europe in May 2024, and then launched them in Canada the following month in June 2024.
Generated anser:
Claude.ai and the Claude iOS app were launched in Canada on June 5th, 2024, and in Europe on May 13th, 2024. Therefore, the order of launch is Europe first, followed by Canada.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the order of the launches, stating that Claude.ai and the Claude iOS app were launched in Europe first and then in Canada. However, it provides specific dates that differ from the correct answer, which states the launches occurred in May and June 2024 without specific dates. Since the generated answer does not contradict the order of launches and conveys the same essential information, it can be considered correct despite the date discrepa

Evaluating End-to-End:  36%|███▌      | 36/100 [02:32<03:35,  3.37s/it]

Query:
When the API response from Claude has a stop_reason of "tool_use", what does this indicate and what should be done next to continue the conversation?
Correct answer:
A stop_reason of "tool_use" signals that Claude has decided to use a tool and has constructed a formatted tool use request. To continue the conversation, the tool name and input should be extracted from Claude's request, the actual tool code should be executed client-side, and then a new user message containing a tool_result content block should be sent to Claude.
Generated anser:
When the API response from Claude has a stop_reason of "tool_use," it indicates that Claude has determined that a tool is necessary to assist with the user's query. The next steps to continue the conversation are as follows:

1. **Extract Tool Information**: Retrieve the tool name, id, and input from the tool_use block in the response.

2. **Execute the Tool**: Run the corresponding tool in your codebase using the extracted input.

3. **Co

Evaluating End-to-End:  37%|███▋      | 37/100 [02:34<03:10,  3.02s/it]

Query:
What Python libraries are used in the example code snippet for evaluating tone and style in a customer service chatbot?
Correct answer:
The example code snippet for evaluating tone and style in a customer service chatbot uses the anthropic Python library to interact with the Claude AI model.
Generated anser:
The Python library used in the example code snippet for evaluating tone and style in a customer service chatbot is `anthropic`.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the `anthropic` Python library as the one used in the example code snippet for evaluating tone and style in a customer service chatbot. It captures the essential information provided in the correct answer, which also mentions the interaction with the Claude AI model. Although it does not explicitly mention the Claude AI model, the core information about the library being used is accurate and aligns with the correct answer's substance.</explana

Evaluating End-to-End:  38%|███▊      | 38/100 [02:37<03:04,  2.98s/it]

Query:
What are the two main ways to authenticate when using the Anthropic Python SDK to access Claude models on Amazon Bedrock?
Correct answer:
The two main ways to authenticate are: 1) Directly providing the aws_access_key, aws_secret_key, and optionally aws_session_token, or 2) Using the default AWS credential providers, such as the ~/.aws/credentials file or the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables.
Generated anser:
The two main ways to authenticate when using the Anthropic Python SDK to access Claude models on Amazon Bedrock are:

1. By providing AWS access keys directly in the code using `aws_access_key` and `aws_secret_key`.
2. By using the default AWS credential providers, such as those configured in `~/.aws/credentials` or through the environment variables `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID`.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the two main ways to authenticate when 

Evaluating End-to-End:  39%|███▉      | 39/100 [02:40<03:03,  3.01s/it]

Query:
When deciding whether to implement leak-resistant prompt engineering strategies, what two factors should be considered and balanced?
Correct answer:
When deciding to use leak-resistant prompt engineering, the potential reduction in prompt leaks should be balanced against the risk of degraded model performance due to the added complexity of the prompt.
Generated anser:
When deciding whether to implement leak-resistant prompt engineering strategies, two key factors to consider and balance are:

1. **Complexity of the Prompt**: Implementing leak-resistant strategies can add complexity to the prompt, which may negatively impact the model's performance or the quality of its outputs. It's essential to ensure that the added complexity does not degrade the overall effectiveness of the task.

2. **Risk of Information Leakage**: The need to prevent sensitive or proprietary information from leaking through the model's outputs is crucial. Balancing the necessity of protecting this informati

Evaluating End-to-End:  40%|████      | 40/100 [02:43<03:05,  3.09s/it]

Query:
How can selecting the appropriate Claude model based on your specific requirements help reduce latency in your application?
Correct answer:
Choosing the right Claude model that best fits your needs in terms of speed and output quality is one of the most straightforward ways to reduce latency in your application. Anthropic offers a range of Claude models with different capabilities and performance characteristics to allow you to choose the optimal balance of intelligence, speed, and cost for your use case.
Generated anser:
Selecting the appropriate Claude model based on your specific requirements can significantly reduce latency in your application by allowing you to choose a model that is optimized for speed and performance characteristics that align with your needs. Different models within the Claude family have varying capabilities, and some are designed to deliver faster responses while still maintaining acceptable output quality.

By carefully assessing your use case, you ca

Evaluating End-to-End:  41%|████      | 41/100 [02:47<03:12,  3.26s/it]

Query:
How can you stream responses from the Anthropic API using the Python SDK?
Correct answer:
You can stream responses from the Anthropic API using the Python SDK by using the client.messages.stream() method and iterating over the stream.text_stream attribute in a for loop.
Generated anser:
To stream responses from the Anthropic API using the Python SDK, you can use the following code snippet:

```python
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
    model="claude-3-5-sonnet-20240620",
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
```

This code initializes the Anthropic client, sets up a streaming request with specified parameters (like `max_tokens`, `messages`, and `model`), and then iterates over the streamed text responses, printing them in real-time.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generate

Evaluating End-to-End:  42%|████▏     | 42/100 [02:50<03:03,  3.16s/it]

Query:
How can you guide Claude's response by pre-filling part of the response, and what API parameter is used to generate a short response in this case?
Correct answer:
You can shape Claude's response by pre-filling part of it in the last position of the input messages list. To get a short response like a single multiple choice answer, you can set the "max_tokens" parameter to a small value like 1.
Generated anser:
To guide Claude's response by pre-filling part of the response, you can include the desired initial text in the Assistant message. This allows Claude to continue from where the Assistant message leaves off. 

The API parameter used to generate a short response in this case is `max_tokens`, which you can set to limit the length of the response. For example, you might set `max_tokens=50` to encourage a shorter response.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly explains how to guide Claude's response by pre-filling part o

Evaluating End-to-End:  43%|████▎     | 43/100 [02:52<02:53,  3.04s/it]

Query:
What is more important when building an eval set for an AI system - having a larger number of test cases with automated grading, or having fewer high-quality test cases graded by humans?
Correct answer:
When building an eval set, it is better to prioritize having a larger volume of test cases with slightly lower signal automated grading over having fewer questions with high-quality human hand-grading.
Generated anser:
When building an eval set for an AI system, having a larger number of test cases with automated grading is generally more important than having fewer high-quality test cases graded by humans. The principles suggest prioritizing volume over quality, as more questions with slightly lower signal from automated grading can provide a broader assessment of the AI's performance across various scenarios. Automated grading methods are also faster, more reliable, and scalable, making them preferable for evaluating AI systems efficiently. Human grading, while high quality, is

Evaluating End-to-End:  44%|████▍     | 44/100 [02:55<02:46,  2.98s/it]

Query:
What are the two required fields in a content_block_delta event for a text delta type?
Correct answer:
The two required fields in a content_block_delta event for a text delta type are "index" and "delta", where the "delta" field contains a "type" of "text_delta" and the "text" being added.
Generated anser:
The two required fields in a content_block_delta event for a text delta type are "index" and "delta". The "delta" field must contain a "type" field with the value "text_delta" and a "text" field that holds the actual text content.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the two required fields "index" and "delta" for a content_block_delta event of type text delta. It also accurately states that the "delta" field must contain a "type" field with the value "text_delta" and a "text" field that holds the actual text content. Therefore, it conveys the same essential information as the correct answer without omittin

Evaluating End-to-End:  45%|████▌     | 45/100 [02:58<02:43,  2.98s/it]

Query:
What are two interactive ways to learn how to use Claude's capabilities, such as uploading PDFs and generating embeddings?
Correct answer:
The Anthropic Cookbook provides interactive Jupyter notebooks demonstrating how to upload PDFs, generate embeddings, and more. The Developer Console offers a prompt generator tool for easier, more powerful prompting.
Generated anser:
Two interactive ways to learn how to use Claude's capabilities, such as uploading PDFs and generating embeddings, are:

1. **Prompt Engineering Interactive Tutorial**: This tutorial utilizes Claude for Sheets and provides a hands-on approach to learning prompt engineering techniques. Users can interact with Claude directly, which allows for practical experience in using its capabilities. An API key is required to access this tutorial.

2. **Using Claude’s Vision Capabilities**: You can explore Claude’s vision features by uploading images directly through the chat interface or using the Console Workbench. This int

Evaluating End-to-End:  46%|████▌     | 46/100 [03:03<03:07,  3.47s/it]

Query:
Why does breaking a task into distinct subtasks for chained prompts help improve Claude's accuracy on the overall task?
Correct answer:
Breaking a task into distinct subtasks for chained prompts improves Claude's accuracy because each subtask gets Claude's full attention, reducing errors compared to tackling the entire complex task at once.
Generated anser:
Breaking a task into distinct subtasks for chained prompts helps improve Claude's accuracy on the overall task for several reasons:

1. **Focused Attention**: Each subtask allows Claude to concentrate fully on a specific aspect of the task, which reduces the likelihood of errors that can occur when handling complex, multi-faceted instructions all at once.

2. **Clarity**: Simpler and more defined subtasks lead to clearer instructions and outputs. When tasks are broken down, the expectations for each step are more straightforward, making it easier for Claude to understand and execute them correctly.

3. **Traceability**: By st

Evaluating End-to-End:  47%|████▋     | 47/100 [03:07<03:22,  3.81s/it]

Query:
How does the streaming format for Messages responses differ from Text Completions streaming responses?
Correct answer:
Messages streaming responses can contain multiple content blocks of varying types, making the streaming format more complex compared to Text Completions which only include completion, ping, and error server-sent-events.
Generated anser:
The streaming format for Messages responses differs from Text Completions streaming responses in several key ways:

1. **Complexity of Content**: Messages can contain multiple content blocks of varying types, making their streaming format more complex compared to Text Completions, which primarily involves a single completion response.

2. **Event Structure**: In the streaming format for Messages, the response includes:
   - A `message_start` event.
   - Potentially multiple content blocks, each containing:
     - A `content_block_start` event.
     - Multiple `content_block_delta` events.
     - A `content_block_stop` event.
   -

Evaluating End-to-End:  48%|████▊     | 48/100 [03:10<03:05,  3.57s/it]

Query:
What are two ways to start experimenting with Claude as a user, according to Anthropic's documentation?
Correct answer:
According to the documentation, users can start experimenting with Claude by visiting claude.ai or using Anthropic's web Console.
Generated anser:
Two ways to start experimenting with Claude as a user, according to Anthropic's documentation, are:

1. Follow the Quickstart to make your first API call.
2. Experiment and start building with the Workbench.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The Generated Answer provides two different methods for experimenting with Claude: following the Quickstart to make an API call and using the Workbench. However, these methods do not match the two methods mentioned in the Correct Answer, which are visiting claude.ai and using Anthropic's web Console. Since the Generated Answer does not include the correct methods as specified in the Correct Answer, it is deemed incorrect.</explanation>
<is_correct>

Evaluating End-to-End:  49%|████▉     | 49/100 [03:15<03:17,  3.87s/it]

Query:
How can using chain prompts help reduce errors and inconsistency in complex tasks handled by Claude?
Correct answer:
Chain prompts break complex tasks into smaller subtasks, allowing Claude to give its full attention to each one. This reduces errors and inconsistencies that may occur when trying to handle a complex workflow all at once.
Generated anser:
Using chain prompts can significantly reduce errors and inconsistencies in complex tasks handled by Claude by breaking down the overall task into smaller, manageable subtasks. This approach allows Claude to focus on one specific aspect at a time, which enhances accuracy and minimizes the likelihood of mistakes.

By structuring tasks into simpler subtasks, clarity is improved, as each subtask has clearer instructions and expected outputs. This clarity helps ensure that the responses generated are more consistent and aligned with the intended goals.

Additionally, chain prompts facilitate traceability, making it easier to identify 

Evaluating End-to-End:  50%|█████     | 50/100 [03:17<02:40,  3.21s/it]

Query:
What HTTP status code does an overloaded_error event correspond to in a non-streaming context for the Anthropic API?
Correct answer:
In a non-streaming context, an overloaded_error event would normally correspond to an HTTP 529 status code.
Generated anser:
An overloaded_error event corresponds to an HTTP status code of 529 in a non-streaming context for the Anthropic API.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately states that an overloaded_error event corresponds to an HTTP status code of 529 in a non-streaming context for the Anthropic API, which is consistent with the correct answer. There are no critical pieces of information missing or contradictions present. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 50/100 questions. Current Accuracy: 0.7600


Evaluating End-to-End:  51%|█████     | 51/100 [03:19<02:24,  2.95s/it]

Query:
What are the two ways to specify the format in which Voyage AI returns embeddings through its HTTP API?
Correct answer:
When making a request to Voyage AI's embedding endpoint, you can either leave the encoding_format parameter unspecified to get the embeddings as lists of floating-point numbers, or set encoding_format to "base64" to get the embeddings compressed to Base64 encodings.
Generated anser:
The two ways to specify the format in which Voyage AI returns embeddings through its HTTP API are:

1. By default, the embeddings are represented as lists of floating-point numbers.
2. By setting the `encoding_format` parameter to "base64", the embeddings are compressed to Base64 encodings.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the two ways to specify the format for embeddings in Voyage AI's HTTP API. It mentions that by default, embeddings are represented as lists of floating-point numbers and that setting the `e

Evaluating End-to-End:  52%|█████▏    | 52/100 [03:25<02:58,  3.72s/it]

Query:
When streaming API requests that use tools, how are the input JSON deltas for tool_use content blocks sent, and how can they be accumulated and parsed by the client?
Correct answer:
When streaming requests with tool use, the input JSON deltas for tool_use content blocks are sent as partial JSON strings in multiple content_block_delta events. The client can accumulate these partial JSON strings and parse the complete JSON object once a content_block_stop event is received, using a library like Pydantic for partial JSON parsing or helpers provided in Anthropic's SDKs.
Generated anser:
When streaming API requests that use tools, the input JSON deltas for tool_use content blocks are sent as partial JSON strings. Each delta corresponds to updates for the input field of the block, allowing for maximum granularity. The final `tool_use.input` is always an object, while the deltas are emitted as multiple `content_block_delta` events, each containing a `partial_json` field that represents

Evaluating End-to-End:  53%|█████▎    | 53/100 [03:28<02:44,  3.49s/it]

Query:
What are the two interactive prompt engineering tutorials that Anthropic offers, and how do they differ?
Correct answer:
Anthropic offers a GitHub prompting tutorial that covers prompt engineering concepts in-depth with examples, and a lighter-weight Google Sheets prompting tutorial that utilizes Claude for Sheets.
Generated anser:
Anthropic offers two interactive prompt engineering tutorials: one that utilizes Claude for Sheets and another that includes a prompt generator in the Anthropic Console. 

The tutorial using Claude for Sheets focuses on teaching and refining prompt engineering techniques through direct interaction with the Claude model in a spreadsheet environment. Users need an API key to access this tutorial.

The prompt generator, on the other hand, is designed for users who do not have a first draft prompt. It assists in creating initial prompts that can then be improved upon using the techniques learned in the Claude for Sheets tutorial. 

In summary, the Claude 

Evaluating End-to-End:  54%|█████▍    | 54/100 [03:36<03:53,  5.07s/it]

Query:
What are some of the key capabilities that make Claude suitable for enterprise use cases requiring integration with specialized applications and processing of large volumes of sensitive data?
Correct answer:
Claude offers a 200K token context window, tool use for integration into specialized applications, multimodal input capabilities for richer context, and is uniquely positioned to serve high-trust industries processing large volumes of sensitive data with enterprise-grade security and data handling.
Generated anser:
Claude is suitable for enterprise use cases requiring integration with specialized applications and processing of large volumes of sensitive data due to several key capabilities:

1. **Enterprise-Grade Security**: Claude offers SOC II Type 2 certification and HIPAA compliance options, ensuring robust security and data handling for sensitive information.

2. **Trustworthiness**: The model is designed to be resistant to jailbreaks and misuse, with continuous monitor

Evaluating End-to-End:  55%|█████▌    | 55/100 [03:38<03:04,  4.11s/it]

Query:
As of June 2024, in which regions are Anthropic's Claude.ai API and iOS app available?
Correct answer:
As of June 2024, Anthropic's Claude.ai API and iOS app are available in the United States, Canada, and Europe.
Generated anser:
As of June 2024, Anthropic's Claude.ai API and iOS app are available in Canada and Europe.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer is missing the information that the Claude.ai API and iOS app are also available in the United States, which is a critical piece of information compared to the correct answer. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  56%|█████▌    | 56/100 [03:42<02:56,  4.01s/it]

Query:
What are the two main approaches for integrating Claude into a support ticket workflow, and how do they differ in terms of scalability and ease of implementation?
Correct answer:
The two main approaches for integrating Claude into a support ticket workflow are push-based using webhooks, and pull-based. The push-based approach is more web-scalable but requires exposing a public endpoint which has IT security implications. The pull-based approach is easier to implement but makes unnecessary calls to the support ticket system.
Generated anser:
The two main approaches for integrating Claude into a support ticket workflow are the push-based approach and the pull-based approach.

1. **Push-Based Approach**: In this method, the support ticket system (like Zendesk) triggers the integration by sending a webhook event to the routing service whenever a new ticket is created. This approach is more scalable because it allows for real-time processing of tickets as they are created, reducing l

Evaluating End-to-End:  57%|█████▋    | 57/100 [03:44<02:27,  3.43s/it]

Query:
When did Anthropic release a prompt generator tool to help guide Claude in generating high-quality prompts, and through what interface is it available?
Correct answer:
On May 10th, 2024, Anthropic released a prompt generator tool that is available through the Developer Console.
Generated anser:
Anthropic released a prompt generator tool to help guide Claude in generating high-quality prompts on June 20th, 2024. This tool is available through the Developer Console, which allows for easier and more powerful prompting in the browser.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer contains a critical error regarding the release date of the prompt generator tool, stating it was released on June 20th, 2024, while the correct answer states it was released on May 10th, 2024. This discrepancy is significant and constitutes a direct contradiction. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
<

Evaluating End-to-End:  58%|█████▊    | 58/100 [03:46<02:09,  3.07s/it]

Query:
Which Claude 3 model provides the best balance of intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing?
Correct answer:
The Claude 3 Sonnet model balances intelligence and speed, making it well-suited for high-throughput tasks like sales forecasting and targeted marketing.
Generated anser:
The Claude 3 Sonnet model provides the best balance of intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing. It is designed for strong utility and is balanced for scaled deployments, making it suitable for quick and accurate performance in these applications.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information provided in the correct answer. Both answers state that the Claude 3 Sonnet model provides a good balance of intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing. The additional details in the 

Evaluating End-to-End:  59%|█████▉    | 59/100 [03:49<02:05,  3.07s/it]

Query:
How can you calculate the similarity between two Voyage embedding vectors, and what is this equivalent to since Voyage embeddings are normalized to length 1?
Correct answer:
You can calculate the similarity between two Voyage embedding vectors using the dot product, which is equivalent to cosine similarity since Voyage embeddings are normalized to length 1.
Generated anser:
To calculate the similarity between two Voyage embedding vectors, you can use the dot product of the two vectors. Since Voyage embeddings are normalized to length 1, the cosine similarity between the two vectors is equivalent to their dot product. This means you can compute the similarity using the following code snippet:

```python
import numpy as np

similarity = np.dot(embd1, embd2)
```

In this case, `embd1` and `embd2` are the two embedding vectors you want to compare. The resulting `similarity` value will indicate how similar the two embeddings are, with values closer to 1 indicating greater similarity.

Evaluating End-to-End:  60%|██████    | 60/100 [03:56<02:46,  4.16s/it]

Query:
How can using examples in prompts improve Claude's performance on complex tasks?
Correct answer:
Well-chosen examples in prompts can boost Claude's ability to handle complex tasks by reducing misinterpretation of instructions, enforcing consistent structure and style, and serving as a guide for the desired output.
Generated anser:
Using examples in prompts can significantly enhance Claude's performance on complex tasks in several ways:

1. **Accuracy**: Examples help clarify the instructions, reducing the likelihood of misinterpretation. By providing concrete instances of what is expected, Claude can better understand the nuances of the task.

2. **Consistency**: Including examples establishes a uniform structure and style for the responses. This consistency aids Claude in generating outputs that align more closely with the desired format and content.

3. **Performance Boost**: Well-chosen examples can enhance Claude's ability to tackle complex tasks. They serve as reference poi

Evaluating End-to-End:  61%|██████    | 61/100 [04:02<02:58,  4.59s/it]

Query:
What are the two types of content block deltas that can be emitted when streaming responses with tool use, and what does each delta type contain?
Correct answer:
When streaming responses with tool use, the two types of content block deltas are text deltas and input JSON deltas. Text deltas contain a "text" field with a string of the incrementally generated text. Input JSON deltas contain a "partial_json" field with a string containing part of the JSON object specifying the tool's input.
Generated anser:
The two types of content block deltas that can be emitted when streaming responses with tool use are:

1. **Input JSON delta**: This delta type corresponds to updates for the input field of the content block. It contains a partial JSON string that represents a portion of the input data. The final input is always an object, and the deltas are emitted as multiple events to support finer granularity. An example of this delta type is:
   ```json
   {
       "type": "input_json_delta"

Evaluating End-to-End:  62%|██████▏   | 62/100 [04:04<02:34,  4.07s/it]

Query:
What are two key capabilities of Claude that enable it to build interactive systems and personalized user experiences?
Correct answer:
Claude's question answering and text analysis capabilities enable it to build intelligent, interactive systems like chatbots and personalize user experiences by understanding sentiment and preferences.
Generated anser:
Two key capabilities of Claude that enable it to build interactive systems and personalized user experiences are:

1. **Text and Code Generation**: Claude can summarize text, answer questions, extract data, translate text, and generate code. This capability allows for the creation of tailored interactions in customer-facing applications, such as chatbots and personalized content generation, enhancing user engagement and satisfaction.

2. **Vision Processing**: Claude can process and analyze visual input, generating text and code from images. This ability enables the development of interactive systems that can understand and respond

Evaluating End-to-End:  63%|██████▎   | 63/100 [04:08<02:21,  3.82s/it]

Query:
What are the key event types included in a raw HTTP stream response when using message streaming, and what is the typical order they occur in?
Correct answer:
A raw HTTP stream response includes a message_start event, followed by one or more content blocks (each with a content_block_start, content_block_delta events, and content_block_stop), a message_delta event, and a final message_stop event. Ping events may also be dispersed throughout.
Generated anser:
The key event types included in a raw HTTP stream response when using message streaming, in typical order, are as follows:

1. **message_start**: This event contains a Message object with empty content.
2. **content_block_start**: This event marks the beginning of a content block.
3. **content_block_delta**: There can be one or more of these events, indicating changes within the content block.
4. **content_block_stop**: This event marks the end of the content block.
5. **message_delta**: This event indicates top-level changes

Evaluating End-to-End:  64%|██████▍   | 64/100 [04:11<02:07,  3.55s/it]

Query:
What is the maximum number of images that can be included in a single request using the Anthropic API compared to the claude.ai interface?
Correct answer:
The Messages API allows including up to 20 images per request, while the claude.ai interface has a lower limit of up to 5 images per turn.
Generated anser:
The maximum number of images that can be included in a single request using the Anthropic API is up to 20 images per request, while the claude.ai interface allows for up to 5 images per turn.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information provided in the correct answer, stating that the Anthropic API allows up to 20 images per request and the claude.ai interface allows up to 5 images per turn. There are no critical pieces of information missing, and there are no contradictions. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  65%|██████▌   | 65/100 [04:13<01:51,  3.19s/it]

Query:
When Claude's response is cut off due to hitting the max_tokens limit and contains an incomplete tool use block, what should you do to get the full tool use?
Correct answer:
If Claude's response hits the max_tokens limit and has an incomplete tool use block, you should retry the request with a higher max_tokens value to get Claude's full response including the complete tool use.
Generated anser:
If Claude's response is cut off due to hitting the max_tokens limit and contains an incomplete tool use block, you should retry the request with a higher max_tokens value to get the full tool use.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the correct answer by stating that if Claude's response is cut off due to hitting the max_tokens limit and contains an incomplete tool use block, the appropriate action is to retry the request with a higher max_tokens value to obtain the full tool use. There are no critical pieces of infor

Evaluating End-to-End:  66%|██████▌   | 66/100 [04:15<01:38,  2.91s/it]

Query:
What two steps are needed before running a classification evaluation on Claude according to the documentation?
Correct answer:
Before running a classification evaluation on Claude, you need to 1) develop your test cases, and 2) take a look at Anthropic's guide to developing test cases.
Generated anser:
The two steps needed before running a classification evaluation on Claude are:

1. Build a strong input prompt.
2. Develop your test cases.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it omits the step of looking at Anthropic's guide to developing test cases, which is a critical part of the process before running a classification evaluation on Claude. While it includes the step of developing test cases, it introduces a different first step (building a strong input prompt) that is not mentioned in the correct answer. Therefore, the generated answer does not accurately reflect the necessary steps outlined in the correct

Evaluating End-to-End:  67%|██████▋   | 67/100 [04:19<01:40,  3.06s/it]

Query:
How can you use the content parameter in the messages list to influence Claude's response?
Correct answer:
You can provide content in the last position of the messages list, with the "assistant" role, to pre-fill part of Claude's response. This allows you to shape the assistant's output.
Generated anser:
You can use the content parameter in the messages list to influence Claude's response by pre-filling part of the assistant's reply. By including a message with the role of "assistant" and providing specific content, you set the stage for how Claude will continue the conversation. For example, if you want Claude to respond in a particular way, you can craft the assistant's message to lead into the desired response. 

For instance, if you want Claude to discuss a favorite color, you might include a message like: 

```python
{"role": "assistant", "content": "As an AI assistant, I don't have a favorite color, but if I had to pick, it would be green because"}
```

This way, Claude wi

Evaluating End-to-End:  68%|██████▊   | 68/100 [04:21<01:35,  2.98s/it]

Query:
What are two key advantages of prompt engineering over fine-tuning when it comes to model comprehension and general knowledge preservation?
Correct answer:
Compared to fine-tuning, prompt engineering is far more effective at helping models understand and utilize external content like retrieved documents. Prompt engineering also preserves the model's broad general knowledge, while fine-tuning risks catastrophic forgetting where the model loses its general capabilities.
Generated anser:
Two key advantages of prompt engineering over fine-tuning when it comes to model comprehension and general knowledge preservation are:

1. **Comprehension Improvements**: Prompt engineering is more effective at helping models better understand and utilize external content, such as retrieved documents. This allows for enhanced comprehension without the need for extensive retraining.

2. **Preservation of General Knowledge**: Fine-tuning can lead to catastrophic forgetting, where the model loses its 

Evaluating End-to-End:  69%|██████▉   | 69/100 [04:24<01:25,  2.75s/it]

Query:
What are the two main steps to get started with making requests to Claude models on Anthropic's Bedrock API?
Correct answer:
To get started making requests to Claude models on Anthropic's Bedrock API, you need to: 1) Install and configure the AWS CLI, and 2) Install an SDK for accessing Bedrock, such as the Python SDK shown in the example code.
Generated anser:
1. Subscribe to Anthropic models by going to the AWS Console, navigating to Bedrock, and requesting access to Anthropic models. Be aware that availability may vary by region.

2. Install and configure the AWS CLI and an SDK for accessing Bedrock, then set your API key to start making requests to the Claude models.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer includes an additional step of subscribing to Anthropic models via the AWS Console, which is not mentioned in the correct answer. While both answers mention installing and configuring the AWS CLI and an SDK, the correct answer

Evaluating End-to-End:  70%|███████   | 70/100 [04:26<01:21,  2.71s/it]

Query:
How can you check which Claude models are available in a specific AWS region using the AWS CLI?
Correct answer:
You can list the available Claude models in a specific AWS region by running the command `aws bedrock list-foundation-models --region=<region> --by-provider anthropic --query "modelSummaries[*].modelId"`, replacing `<region>` with the desired AWS region such as `us-west-2`.
Generated anser:
To check which Claude models are available in a specific AWS region using the AWS CLI, you can use the following command:

```bash
aws bedrock list-foundation-models --region=<your-region> --by-provider=anthropic --query "modelSummaries[*].modelId"
```

Replace `<your-region>` with the desired AWS region, such as `us-west-2`. This command will list the model IDs of all Claude models provided by Anthropic that are available in that region.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The Generated Answer correctly describes how to check which Claude models are av

Evaluating End-to-End:  71%|███████   | 71/100 [04:29<01:14,  2.58s/it]

Query:
What argument can be passed to the voyageai.Client.embed() method or the Voyage HTTP API to specify whether the input text is a query or a document?
Correct answer:
The input_type argument can be passed with a value of "query" or "document" to specify the type of input text being embedded.
Generated anser:
The argument that can be passed to the `voyageai.Client.embed()` method or the Voyage HTTP API to specify whether the input text is a query or a document is `input_type`. This argument can take the values `"query"` or `"document"`. If not specified, it defaults to `None`.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the argument `input_type` and states that it can take the values `"query"` or `"document"`. It also mentions the default value of `None`, which is additional information but does not contradict the correct answer. Therefore, the generated answer is essentially saying the same thing as the correct answer

Evaluating End-to-End:  72%|███████▏  | 72/100 [04:34<01:36,  3.44s/it]

Query:
How do the streaming API delta formats differ between tool_use content blocks and text content blocks?
Correct answer:
Tool_use content block deltas contain partial JSON strings for the input field, whereas text content block deltas directly contain the text delta. Tool_use deltas may have delays between streaming events as the model emits one complete key-value pair at a time.
Generated anser:
The streaming API delta formats for tool_use content blocks and text content blocks differ primarily in their structure and the type of data they handle.

1. **Tool_use Content Blocks**:
   - The deltas for tool_use content blocks are designed to update the input field of the block. They consist of partial JSON strings that represent incremental updates to the input data.
   - Each delta is emitted as a `content_block_delta` event with a `delta` type of `input_json_delta`, and it includes a `partial_json` field that contains the partial JSON string.
   - The final input for a tool_use con

Evaluating End-to-End:  73%|███████▎  | 73/100 [04:36<01:23,  3.09s/it]

Query:
What are the image file size limits when uploading images to Claude using the API versus on claude.ai?
Correct answer:
When uploading images to Claude, the API has a maximum file size limit of 5MB per image, while on claude.ai the limit is 10MB per image.
Generated anser:
The image file size limits for uploading images to Claude are as follows:

- **API**: Maximum 5MB per image
- **claude.ai**: Maximum 10MB per image

Images larger than these limits will be rejected and return an error when using the respective interface.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information provided in the correct answer regarding the image file size limits for the API and claude.ai. Both answers state that the API has a maximum file size limit of 5MB and claude.ai has a limit of 10MB, and they both mention that images larger than these limits will be rejected. Therefore, the generated answer is correct.</explanation>
<is_corr

Evaluating End-to-End:  74%|███████▍  | 74/100 [04:38<01:13,  2.81s/it]

Query:
What is one key consideration when selecting a Claude model for an enterprise use case that needs low latency?
Correct answer:
When selecting a Claude model for an enterprise use case that requires low latency, it's important to choose the model that best balances speed and output quality based on the specific requirements of the use case.
Generated anser:
One key consideration when selecting a Claude model for an enterprise use case that needs low latency is to choose the appropriate model that balances speed and output quality. Selecting a model specifically designed for faster performance can significantly reduce latency, especially for tasks that do not require in-depth thinking or complex analysis.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer captures the essence of the correct answer by emphasizing the importance of balancing speed and output quality when selecting a Claude model for low latency use cases. It also adds the detail a

Evaluating End-to-End:  75%|███████▌  | 75/100 [04:42<01:13,  2.95s/it]

Query:
What embedding model does Anthropic recommend for code retrieval, and how does its performance compare to alternatives according to Voyage AI?
Correct answer:
For code retrieval, Voyage AI recommends using the voyage-code-2 embedding model, which they claim performs 17% better than alternatives and achieves state-of-the-art results on general-purpose corpora as well.
Generated anser:
Anthropic does not offer its own embedding model but recommends using Voyage AI's embedding models for code retrieval. Specifically, Voyage AI suggests the "voyage-code-2" model, which is optimized for code retrieval and reportedly performs 17% better than alternatives. Additionally, it is state-of-the-art on general-purpose corpora, making it a strong choice for various applications.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the information from the correct answer. It states that Voyage AI recommends the "voyage-code-2" embedding mode

Evaluating End-to-End:  76%|███████▌  | 76/100 [04:45<01:11,  2.98s/it]

Query:
What are two ways the Anthropic Cookbook can help developers learn to use Anthropic's APIs?
Correct answer:
The Anthropic Cookbook provides interactive Jupyter notebooks that demonstrate how to upload PDFs and work with embeddings to help developers learn to use Anthropic's APIs.
Generated anser:
The Anthropic Cookbook can help developers learn to use Anthropic's APIs in two ways:

1. **Interactive Jupyter Notebooks**: The Cookbook provides interactive Jupyter notebooks that demonstrate various functionalities, such as uploading PDFs and working with embeddings. This hands-on approach allows developers to experiment and see real-time results, enhancing their understanding of how to utilize the APIs effectively.

2. **Practical Examples**: By offering practical examples and use cases within the notebooks, the Cookbook helps developers understand the implementation of different features and best practices when working with the Anthropic APIs, making it easier to apply this knowled

Evaluating End-to-End:  77%|███████▋  | 77/100 [04:49<01:15,  3.27s/it]

Query:
How does the size of the context window impact a language model's ability to utilize retrieval augmented generation (RAG)?
Correct answer:
The size of the context window determines how much retrieved information can be passed to the language model to augment its knowledge when generating a response using RAG. A larger context window allows more relevant retrieved information to be utilized by the model, improving the accuracy and groundedness of the generated text.
Generated anser:
The size of the context window significantly impacts a language model's ability to utilize retrieval augmented generation (RAG). A larger context window allows the model to incorporate more retrieved information from the external knowledge base along with the original query. This enables the model to generate responses that are more coherent and relevant, as it can reference a broader range of context and details.

Conversely, a smaller context window may limit the amount of retrieved information that

Evaluating End-to-End:  78%|███████▊  | 78/100 [04:53<01:16,  3.48s/it]

Query:
How can the Evaluation tool in Anthropic's Claude platform help improve prompts and build more robust AI applications?
Correct answer:
The Evaluation tool helps identify edge cases where prompts might falter, allows rating individual results to determine prompt performance, ensures consistent performance across inputs, and enables prompt refinement for better reliability. Reviewing results across test cases helps spot patterns to make informed adjustments that lead to more robust AI applications.
Generated anser:
The Evaluation tool in Anthropic's Claude platform can significantly enhance prompt development and the robustness of AI applications in several ways:

1. **Identifying Edge Cases**: The tool helps users pinpoint specific scenarios where prompts may not perform as expected, allowing for targeted improvements.

2. **Rating Results**: By enabling users to rate individual outputs, the tool provides insights into where prompts excel or fall short, facilitating a better unde

Evaluating End-to-End:  79%|███████▉  | 79/100 [04:56<01:13,  3.50s/it]

Query:
Which Claude model has the fastest comparative latency according to the comparison tables?
Correct answer:
The Claude 3 Haiku model has the fastest comparative latency
Generated anser:
The Claude model with the fastest comparative latency is the Claude 3 Haiku, which is described as the "fastest and most compact model for near-instant responsiveness."
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the Claude 3 Haiku model as having the fastest comparative latency, which aligns with the correct answer. It also adds that it is described as the "fastest and most compact model for near-instant responsiveness," which does not contradict the correct answer and provides additional context. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  80%|████████  | 80/100 [05:02<01:23,  4.18s/it]

Query:
How can you build up a conversation with multiple turns using the Anthropic Messages API in Python?
Correct answer:
To have a multi-turn conversation using the Anthropic Messages API in Python, send the full conversation history in the messages parameter each time, including any prior user and assistant messages. The API is stateless, so the entire context must be provided with each request.
Generated anser:
To build up a conversation with multiple turns using the Anthropic Messages API in Python, you need to maintain the full conversational history and send it with each API request. Here's how you can do it:

1. **Initialize the conversation**: Start with an initial message from the user.
2. **Add responses**: After receiving a response from the assistant, append it to the conversation history.
3. **Continue the conversation**: For each new user input, include the entire conversation history in the API request.

Here’s an example of how to implement this in Python:

```python
i

Evaluating End-to-End:  81%|████████  | 81/100 [05:08<01:27,  4.59s/it]

Query:
How can using XML tags to provide a specific role or context help improve Claude's analysis of a legal contract compared to not using a role prompt?
Correct answer:
Providing Claude with a specific role, such as being the General Counsel of a company, using XML tags can help it catch critical legal issues and risks in a contract that it might miss without the role context, potentially saving the company millions of dollars.
Generated anser:
Using XML tags to provide a specific role or context can significantly enhance Claude's analysis of a legal contract compared to not using a role prompt in several ways:

1. **Clarity and Structure**: XML tags help clearly separate different parts of the prompt, making it easier for Claude to understand the specific elements of the legal contract that need analysis. This structured approach reduces ambiguity and ensures that Claude focuses on the relevant sections.

2. **Accuracy**: By using XML tags, the risk of misinterpretation is minimize

Evaluating End-to-End:  82%|████████▏ | 82/100 [05:11<01:16,  4.23s/it]

Query:
What are the key differences between how Claude 3 Opus and Claude 3 Sonnet handle missing information when making tool calls?
Correct answer:
When required parameters are missing, Claude 3 Opus is more likely to ask the user for the missing information, while Claude 3 Sonnet is more likely to try to infer reasonable values on its own to proceed with the tool call.
Generated anser:
Claude 3 Opus and Claude 3 Sonnet handle missing information during tool calls differently based on their design and intended use cases. 

Claude 3 Opus is optimized for complex tasks and ambiguous queries. It is more adept at managing multiple tools and is designed to seek clarification when it encounters missing parameters. This means that when Opus identifies that certain information is lacking, it actively prompts for additional details to ensure that the tool call can be executed accurately.

In contrast, Claude 3 Sonnet, while still intelligent, is more balanced in terms of speed and utility. It 

Evaluating End-to-End:  83%|████████▎ | 83/100 [05:17<01:19,  4.69s/it]

Query:
What steps should be taken to ensure a reliable deployment of an automated ticket routing system using Claude into a production environment?
Correct answer:
To ensure a reliable production deployment of Claude for ticket routing, key steps include implementing retry logic to handle errors, conducting thorough staging and load testing, setting up error handling and logging, using a gradual rollout process, providing documentation and training, and establishing monitoring and alerting.
Generated anser:
To ensure a reliable deployment of an automated ticket routing system using Claude into a production environment, follow these steps:

1. **Implement Retry Logic**: Add try/except logic to handle cases where Claude may not return the expected output or is temporarily unavailable. Incorporate back-off logic to retry requests after increasing intervals or adjust the temperature for output variations.

2. **Conduct Thorough Staging Testing**: Test the system extensively in a staging en

Evaluating End-to-End:  84%|████████▍ | 84/100 [05:22<01:18,  4.94s/it]

Query:
How should you evaluate a model's performance on a ticket routing classifier?
Correct answer:
You should evaluate performance in terms of accuracy, cost, and speed.
Generated anser:
To evaluate a model's performance on a ticket routing classifier, you should consider the following factors:

1. **Accuracy**: Measure how well the model correctly classifies tickets into the appropriate categories or teams. This can be done using metrics such as precision, recall, F1 score, and overall accuracy. A confusion matrix can also help visualize the model's performance across different classes.

2. **Cost**: Assess the operational costs associated with using the model, including computational resources, infrastructure, and any potential costs related to errors (e.g., misrouted tickets leading to delays or customer dissatisfaction). It's important to balance the model's performance with its cost-effectiveness.

3. **Speed**: Evaluate the response time of the model in classifying tickets. Thi

Evaluating End-to-End:  85%|████████▌ | 85/100 [05:24<01:01,  4.13s/it]

Query:
What two methods does Anthropic recommend for learning how to prompt engineer with Claude before diving into the techniques?
Correct answer:
Anthropic recommends trying their interactive GitHub prompting tutorial and Google Sheets prompting tutorial to learn prompt engineering concepts before diving into the techniques in the documentation.
Generated anser:
Anthropic recommends two methods for learning how to prompt engineer with Claude before diving into the techniques: 

1. Engaging with the in-depth prompt engineering interactive tutorial that utilizes Claude for Sheets.
2. Exploring the Prompt Library for example prompts to understand effective prompting strategies.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer mentions engaging with an interactive tutorial and exploring a Prompt Library, but it does not specifically mention the interactive GitHub prompting tutorial or the Google Sheets prompting tutorial as recommended by Anthropic. 

Evaluating End-to-End:  86%|████████▌ | 86/100 [05:29<00:59,  4.23s/it]

Query:
What are the key differences between a pretrained large language model and Claude in terms of their training and capabilities?
Correct answer:
Pretrained large language models are trained on unlabeled text data to predict the next word given the previous context, but are not inherently good at answering questions or following instructions without prompt engineering. In contrast, Claude is a large language model that has been further fine-tuned and trained using RLHF to be more helpful, honest, and capable of performing a wider range of useful tasks.
Generated anser:
The key differences between a pretrained large language model (LLM) and Claude primarily lie in their training processes and capabilities. 

1. **Training Process**:
   - Pretrained LLMs undergo an initial phase called pretraining, where they are trained on a large corpus of unlabeled text to predict the next word in a sequence based on the previous context. This phase does not focus on specific tasks like answering 

Evaluating End-to-End:  87%|████████▋ | 87/100 [05:34<00:58,  4.48s/it]

Query:
What are some key advantages of using prompt engineering instead of fine-tuning to adapt a pretrained language model for a specific task or domain?
Correct answer:
Prompt engineering is typically faster, more cost-effective, requires less data and compute resources, and preserves the model's general knowledge compared to fine-tuning. It also allows for greater flexibility, rapid iteration, and transparency.
Generated anser:
Key advantages of using prompt engineering instead of fine-tuning to adapt a pretrained language model for a specific task or domain include:

1. **Resource Efficiency**: Prompt engineering requires only text input, making it much more resource-friendly compared to fine-tuning, which demands high-end GPUs and large memory.

2. **Cost-Effectiveness**: Fine-tuning can incur significant costs, especially for cloud-based AI services. Prompt engineering typically utilizes the base model, which is cheaper.

3. **Maintaining Model Updates**: Fine-tuned models may re

Evaluating End-to-End:  88%|████████▊ | 88/100 [05:37<00:48,  4.05s/it]

Query:
How can you authenticate with GCP before running requests to access Claude models on Vertex AI?
Correct answer:
Before running requests to access Claude models on Vertex AI, you may need to run `gcloud auth application-default login` to authenticate with GCP.
Generated anser:
To authenticate with Google Cloud Platform (GCP) before running requests to access Claude models on Vertex AI, you need to run the following command in your terminal:

```
gcloud auth application-default login
```

This command will allow you to authenticate your application with GCP, enabling you to make requests to the Claude models hosted on Vertex AI. After authenticating, you can proceed to use the appropriate SDK (like the `AnthropicVertex` client) to interact with the models.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly states the command `gcloud auth application-default login` for authenticating with GCP before accessing Claude models on Vertex AI.

Evaluating End-to-End:  89%|████████▉ | 89/100 [05:39<00:38,  3.49s/it]

Query:
What new capabilities and features were introduced by Anthropic on May 10th, 2024 and how do they enable users to create and tailor prompts for specific tasks?
Correct answer:
According to the information provided, on May 10th, 2024, Anthropic introduced a new "Prompt Generator" tool in the Developer Console. This tool is designed to help users guide Claude to generate high-quality prompts tailored to their specific tasks. The text states that the Prompt Generator "makes it easy to guide Claude to generate a high-quality prompts tailored to your specific tasks." This indicates that the Prompt Generator feature provides users with the ability to create customized prompts for Claude, going beyond the standard prompting capabilities. By combining this information with the details about the Claude iOS app and the Claude Team plan released around the same time, we can infer that Anthropic was expanding its platform and tools to provide users with more advanced capabilities for intera

Evaluating End-to-End:  90%|█████████ | 90/100 [05:41<00:30,  3.08s/it]

Query:
On what date did both the Claude 3.5 Sonnet model and the Artifacts feature in Claude.ai become available?
Correct answer:
Both Claude 3.5 Sonnet and the Artifacts feature in Claude.ai became available on June 20th, 2024.
Generated anser:
Both the Claude 3.5 Sonnet model and the Artifacts feature in Claude.ai became available on June 20th, 2024.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer states that both the Claude 3.5 Sonnet model and the Artifacts feature in Claude.ai became available on June 20th, 2024, which is exactly the same information provided in the correct answer. There are no critical pieces of information missing, and there are no contradictions. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 90/100 questions. Current Accuracy: 0.7667


Evaluating End-to-End:  91%|█████████ | 91/100 [05:43<00:24,  2.73s/it]

Query:
When putting words in Claude's mouth to shape the response, what header and value can you use in the request to limit Claude's response to a single token?
Correct answer:
You can use "max_tokens": 1 in the request to limit Claude's response to a single token when putting words in its mouth.
Generated anser:
To limit Claude's response to a single token, you can use the header `"max_tokens": 1` in your request.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly states that you can use the header "max_tokens": 1 to limit Claude's response to a single token, which is consistent with the correct answer. There are no critical pieces of information missing, and there are no contradictions. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  92%|█████████▏| 92/100 [05:46<00:21,  2.68s/it]

Query:
What does the temperature parameter do when working with large language models?
Correct answer:
Temperature is a parameter that controls the randomness of the model during generation
Generated anser:
The temperature parameter controls the randomness of a language model's predictions during text generation. A higher temperature results in more creative and diverse outputs, allowing for variations in phrasing and responses. Conversely, a lower temperature leads to more conservative and deterministic outputs, favoring the most probable choices. By adjusting the temperature, users can encourage the model to explore less common or surprising word choices and sequences, rather than sticking solely to the most likely predictions.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer accurately describes the function of the temperature parameter in controlling the randomness of a language model's predictions during text generation. It elaborates on how h

Evaluating End-to-End:  93%|█████████▎| 93/100 [05:49<00:20,  2.95s/it]

Query:
What are two ways to specify API parameters when calling the Claude API using Claude for Sheets?
Correct answer:
When calling the Claude API using Claude for Sheets, you can specify API parameters in two ways: 1) As additional arguments after the prompt and model in the CLAUDE() function, like =CLAUDE(prompt, model, "max_tokens", 3). 2) By passing in an API key to be used just for a specific cell, like "api_key", "sk-ant-api03-j1W..."
Generated anser:
Two ways to specify API parameters when calling the Claude API using Claude for Sheets are:

1. **Simple Prompt**: You can use a basic prompt by typing in a cell like this: `=CLAUDE("Claude, in one sentence, what's good about the color blue?")`. This will send the prompt without any additional parameters.

2. **Adding Parameters**: You can specify parameters by including them after the initial prompt. For example, you can type in a cell: `=CLAUDE("Hi, Claude!", "claude-3-haiku-20240307", "max_tokens", 3)`. Here, the model is specif

Evaluating End-to-End:  94%|█████████▍| 94/100 [05:54<00:19,  3.33s/it]

Query:
How does prefilling the response with an opening curly brace ({ ) affect Claude's output when extracting structured data from text?
Correct answer:
Prefilling Claude's response with { causes it to skip the preamble explanation and directly output the extracted data as a JSON object, resulting in a more concise response that is easier for programs to parse without additional processing.
Generated anser:
Prefilling the response with an opening curly brace ({) can significantly influence Claude's output when extracting structured data from text. By starting the response with a curly brace, it signals to Claude that the expected output format is likely structured data, such as JSON. This can guide Claude to produce a response that adheres to this format, thereby enhancing the likelihood of receiving well-structured and parsable data.

When Claude recognizes the prefilled curly brace, it may interpret the context as requiring a structured response, which can lead to outputs that are 

Evaluating End-to-End:  95%|█████████▌| 95/100 [06:00<00:20,  4.14s/it]

Query:
What are some helpful resources provided by Anthropic to dive deeper into building with images using Claude?
Correct answer:
Anthropic provides a multimodal cookbook with tips on getting started with images and best practices, as well as API reference documentation for the Messages API that includes example API calls involving images.
Generated anser:
To dive deeper into building with images using Claude, you can explore the following resources provided by Anthropic:

1. **How to Use Vision**: This section explains how to utilize Claude's vision capabilities. You can upload images via:
   - **claude.ai**: Upload an image like a file or drag and drop it directly into the chat window.
   - **The Console Workbench**: If you select a model that accepts images (Claude 3 models only), a button to add images will appear at the top right of every User message block.
   - **API Request**: Refer to the examples provided in the guide for making API requests that include images.

2. **Devel

Evaluating End-to-End:  96%|█████████▌| 96/100 [06:03<00:16,  4.00s/it]

Query:
How do you specify the API key when creating a new Anthropic client in the Python and TypeScript SDK examples?
Correct answer:
In both the Python and TypeScript examples, you can specify the API key as a string parameter when creating a new Anthropic client object. If no API key is provided, it defaults to using the ANTHROPIC_API_KEY environment variable.
Generated anser:
To specify the API key when creating a new Anthropic client in the Python SDK, you can either set the API key in an environment variable named `ANTHROPIC_API_KEY` or supply it directly when initializing the client.

In Python, you can set the API key like this:

```python
import anthropic

client = anthropic.Client(api_key='your-api-key-here')
```

In TypeScript, you would typically set the API key in a similar manner when initializing the client:

```typescript
import { AnthropicClient } from 'anthropic-sdk';

const client = new AnthropicClient({ apiKey: 'your-api-key-here' });
```

Alternatively, for both lan

Evaluating End-to-End:  97%|█████████▋| 97/100 [06:06<00:10,  3.64s/it]

Query:
What are two key benefits of using the Anthropic Evaluation tool when developing prompts for an AI classification application?
Correct answer:
The Evaluation tool helps identify edge cases where the prompt might falter, and ensures consistent performance across a range of test case inputs. This allows you to refine the prompt for better reliability in the AI classification application.
Generated anser:
Two key benefits of using the Anthropic Evaluation tool when developing prompts for an AI classification application are:

1. **Identification of Edge Cases**: The Evaluation tool helps developers identify edge cases where the prompt might not perform well. This is crucial for refining the prompt to ensure it can handle a wide range of inputs effectively.

2. **Performance Consistency**: The tool allows for the rating of individual results, enabling developers to assess the performance of their prompts across different test cases. This ensures consistent performance and reliabilit

Evaluating End-to-End:  98%|█████████▊| 98/100 [06:12<00:08,  4.22s/it]

Query:
What are the key differences between a pretrained language model like Claude's underlying model, and the final version of Claude available through Anthropic's API?
Correct answer:
The pretrained language model that forms Claude's foundation is not inherently good at answering questions or following instructions. To create the helpful, honest and safe Claude assistant available through the API, the pretrained model underwent fine-tuning and reinforcement learning from human feedback (RLHF).
Generated anser:
The key differences between a pretrained language model like Claude's underlying model and the final version of Claude available through Anthropic's API include:

1. **Training and Fine-Tuning**: The pretrained model is typically trained on a broad dataset to develop a general understanding of language. The final version of Claude, such as Claude 3.5 Sonnet, undergoes additional fine-tuning and optimization to improve its performance on specific tasks, making it more effective

Evaluating End-to-End:  99%|█████████▉| 99/100 [06:14<00:03,  3.51s/it]

Query:
What is the IPv6 address range used by Anthropic?
Correct answer:
The IPv6 address range used by Anthropic is 2607:6bc0::/48.
Generated anser:
The IPv6 address range used by Anthropic is 2607:6bc0::/48.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer matches the correct answer exactly, providing the same IPv6 address range used by Anthropic. There are no missing pieces of information or contradictions between the two answers.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End: 100%|██████████| 100/100 [06:16<00:00,  3.77s/it]

Query:
When using the Python SDK to create a message with Claude, what are two ways you can specify your API key?
Correct answer:
When using the Python SDK, you can specify your API key either by passing it as the api_key parameter when initializing the Anthropic client, or by setting it as an environment variable named ANTHROPIC_API_KEY which the client will use by default.
Generated anser:
You can specify your API key in two ways when using the Python SDK to create a message with Claude:

1. Set the API key as an environment variable named `ANTHROPIC_API_KEY`. This allows you to omit the API key from your code.
2. Supply the API key directly when initializing the Anthropic client, like this: `client = anthropic.Anthropic(api_key='your-api-key-here')`.
Response_text from judge LLM:
<evaluation>
<content>
<explanation>The generated answer correctly identifies both ways to specify the API key when using the Python SDK: setting it as an environment variable and supplying it directly when




In [39]:
!cat evaluation/json_results/evaluation_results_one.json 

{
  "name": "Basic RAG",
  "average_precision": 0.3933333333333335,
  "average_recall": 0.6183333333333334,
  "average_f1": 0.48081274025260856,
  "average_mrr": 0.7333333333333334,
  "end_to_end_accuracy": 0.78
}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [40]:
!cat evaluation/csvs/evaluation_results_detailed.csv

question,retrieval_precision,retrieval_recall,retrieval_mrr,e2e_correct
How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?,0.3333333333333333,0.5,1.0,True
"What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?",0.6666666666666666,1.0,1.0,True
"What are some key success metrics to consider when evaluating Claude's performance on a classification task, and how do they relate to choosing the right model to reduce latency?",0.6666666666666666,1.0,1.0,True
What are two ways that Claude for Sheets can improve prompt engineering workflows compared to using chained prompts?,0.3333333333333333,0.5,1.0,False
"What happens if a prompt for the Text Completions API is missing the ""\n\nHuman:"" and ""\n\nAssistant:"" turns?",0.6666666666666666,1.0,1.0,True
How do the additional tokens required for tool use in Claude API requests impact pricing compared to regular API requests?,

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
