# Retrieval Augmented Generation

LLMs excels at a wide range of tasks, but struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables the LLM to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions. Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.

In this guide, we'll demonstrate how to build and optimize a RAG system using the Anthropic documentation as our knowledge base. We'll walk you through:

1. Embeddings are from the `intfloat/multilingual-e5-large-instruct` model, where input is truncated to at most 512 tokens
2. In-memory vector database class is from Anthropic
3. Building a robust evaluation suite. We'll go beyond 'vibes' based evals and show you how to measure the retrieval pipeine & end to end performance independently
4. Implementing advanced techniques to improve RAG including summary indexing and re-ranking with Claude.

Through a series of targeted improvements, we achieved significant performance gains on the following metrics compared to a basic RAG pipeline (we'll explain what all these metrics *mean* in a bit)

## Table of Contents

1) Setup
2) Level 1 - Basic RAG
3) Building an Evaluation System

## Setup

We'll need a few libraries and models:

1. `intfloat/multilingual-e5-large-instruct` to generate high quality embeddings
2. `openai`,  the LLM for generation
4. `pandas`, `numpy`, `matplotlib`, and `scikit-learn` for data manipulation and visualization


In [1]:
## silent setup (-q)
!pip install openai -q
!pip install pandas -q
!pip install numpy -q
!pip install matplotlib -q
!pip install seaborn -q
!pip install -U scikit-learn -q
!pip install sentence-transformers -q

In [2]:
import os
import getpass
from openai import OpenAI
OPENAI_API_KEY = getpass.getpass("Enter OpenAI API key")
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
# print(os.environ.get("OPENAI_API_KEY"))
client = OpenAI()

Enter OpenAI API key ········


### Downlaod the Embeddings model and run a quick test

In [3]:
from sentence_transformers import SentenceTransformer

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, '南瓜的家常做法')
]
# No need to add instruction for retrieval documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右,放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
]
input_texts = queries + documents

model = SentenceTransformer('intfloat/multilingual-e5-large-instruct')

embeddings = model.encode(input_texts, convert_to_tensor=True, normalize_embeddings=True)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
# [[91.92853546142578, 67.5802993774414], [70.38143157958984, 92.13307189941406]]


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/140k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

[[91.92853546142578, 67.58030700683594], [70.38142395019531, 92.1330795288086]]


### Initialize a Vector DB Class

In this example, we're using an in-memory vector DB, but for a production application, you may want to use a hosted solution. 

In [4]:
import os
import pickle
import json
import numpy as np

class VectorDB:
    def __init__(self, name, api_key=None):
        self.name = name
        self.embeddings = []
        self.metadata = []
        self.query_cache = {}
        self.db_path = f"./data/{name}/vector_db.pkl"

    def load_vec_db_in_memory(self, data):
        if self.embeddings and self.metadata:
            print("Vector database is already loaded. Skipping data loading.")
            return
        if os.path.exists(self.db_path):
            print("Loading vector database from disk.")
            self.load_vec_db()
            return

        texts = [f"Heading: {item['chunk_heading']}\n\n Chunk Text:{item['text']}" for item in data]
        self._embed_and_store(texts, data)
        self.save_db()
        print("Vector database loaded and saved.")

    def _embed_and_store(self, texts, data):
        batch_size = 128
        result = [
            model.encode(texts[i : i + batch_size])
            for i in range(0, len(texts), batch_size)
        ]
        self.embeddings = [embedding for batch in result for embedding in batch]
        self.metadata = data

    def search(self, query, k=5, similarity_threshold=0.75):
        if query in self.query_cache:
            query_embedding = self.query_cache[query]
        else:
            # query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
            query_embedding = model.encode(query)
            self.query_cache[query] = query_embedding

        if not self.embeddings:
            raise ValueError("No data loaded in the vector database.")

        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1]
        top_examples = []
        
        for idx in top_indices:
            if similarities[idx] >= similarity_threshold:
                example = {
                    "metadata": self.metadata[idx],
                    "similarity": similarities[idx],
                }
                top_examples.append(example)
                
                if len(top_examples) >= k:
                    break
        # self.save_db()
        return top_examples

    def save_db(self):
        data = {
            "embeddings": self.embeddings,
            "metadata": self.metadata,
            "query_cache": json.dumps(self.query_cache),
        }
        os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
        with open(self.db_path, "wb") as file:
            pickle.dump(data, file)

    def load_vec_db(self):
        if not os.path.exists(self.db_path):
            raise ValueError("Vector database file not found. Use load_vec_in_memory to create a new database.")
        with open(self.db_path, "rb") as file:
            data = pickle.load(file)
        self.embeddings = data["embeddings"]
        self.metadata = data["metadata"]
        self.query_cache = json.loads(data["query_cache"])

## Level 1 - Basic RAG

To get started, we'll set up a basic RAG pipeline using a bare bones approach. This is sometimes called 'Naive RAG' by many in the industry. A basic RAG pipeline includes the following 3 steps:

1) Chunk documents by heading - containing only the content from each subheading

2) Embed each document

3) Use Cosine similarity to retrieve documents in order to answer query

In [5]:
import json
import matplotlib.pyplot as plt
import xml.etree.ElementTree as ET
from tqdm import tqdm
import logging
from typing import Callable, List, Dict, Any, Tuple, Set


def retrieve_similar(query, db):
    results = db.search(query, k=3)
    context = ""
    for result in results:
        chunk = result['metadata']
        context += f"\n{chunk['text']}\n"
    return results, context

def construct_prompt(query, context):
    # query = "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool"
    
    prompt = f"""
    You have been tasked with helping us to answer the following query: 
    <query>
    {query}
    </query>
    You have access to the following documents which are meant to provide context as you answer the query:
    <documents>
    {context}
    </documents>
    Please remain faithful to the underlying context, and only deviate from it if you are 100% sure that you know the answer already. 
    Answer the question now, and avoid providing preamble such as 'Here is the answer', etc
    """

    return prompt

def answer_query_from_context(query, context):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": construct_prompt(query, context)
            }
        ],
        temperature=0.2
    )
    return completion.choices[0].message.content

# Load the evaluation dataset
with open('evaluation/docs_evaluation_dataset.json', 'r') as f:
    eval_data = json.load(f)

# Load the Anthropic documentation
with open('data/anthropic_docs.json', 'r') as f:
    anthropic_docs = json.load(f)

# Initialize the VectorDB
db = VectorDB("anthropic_docs")
db.load_vec_db_in_memory(anthropic_docs)

# test
query = "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?"
context = ""'Creating Test Cases\n\n\nWhen you first access the Evaluation screen, you’ll see a single row:\n\nTo add more test cases:\nClick the ‘Add Test Case’ button.\nFill in values for each variable in your prompt.\nRepeat to create multiple scenarios.\nHere’s an example of a populated Evaluation screen with several test cases:\n\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\n\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\n'
print(retrieve_similar(query, db))
print(answer_query_from_context(query, context))

Vector database loaded and saved.
([{'metadata': {'chunk_link': 'https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases', 'chunk_heading': 'Creating Test Cases', 'text': 'Creating Test Cases\n\n\nWhen you first access the Evaluation screen, you’ll see a single row:\n\nTo add more test cases:\nClick the ‘Add Test Case’ button.\nFill in values for each variable in your prompt.\nRepeat to create multiple scenarios.\nHere’s an example of a populated Evaluation screen with several test cases:\n\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across all test cases.\n\nIf you update your original prompt text, you can re-run the entire eval suite against the new prompt to see how changes affect performance across al

## Eval Setup

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end to end system separately.

We synthetically generated an evaluation dataset consisting of 100 samples which include the following:
- A question
- Chunks from our docs which are relevant to that question. This is what we expect our retrieval system to retrieve when the question is asked
- A correct answer to the question.

This is a relatively challenging dataset. Some of our questions require synthesis between more than one chunk in order to be answered correctly, so it's important that our system can load in more than one chunk at a time. You can inspect the dataset by opening `evaluation/docs_evaluation_dataset.json`

Run the next cell to see a preview of the dataset

In [6]:
#previewing our eval dataset
import json

def preview_json(file_path, num_items=4):
    try:
        with open(file_path, 'r') as file:
            data = json.load(file)
            
        if isinstance(data, list):
            preview_data = data[:num_items]
        elif isinstance(data, dict):
            preview_data = dict(list(data.items())[:num_items])
        else:
            print(f"Unexpected data type: {type(data)}. Cannot preview.")
            return
        
        print(f"Preview of the first {num_items} items from {file_path}:")
        print(json.dumps(preview_data, indent=2))
        print(f"\nTotal number of items: {len(data)}")
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except json.JSONDecodeError:
        print(f"Invalid JSON in file: {file_path}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

preview_json('evaluation/docs_evaluation_dataset.json')

Preview of the first 4 items from evaluation/docs_evaluation_dataset.json:
[
  {
    "id": "efc09699",
    "question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
    "correct_chunks": [
      "https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases",
      "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
    ],
    "correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
  },
  {
    "id": "1305ea00",
    "question": "What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?",
    "correct_chunks": [
      "https://docs.anthropic.com/en/docs/build-with-claude/embeddings#before-implementing-embeddings",
      "h

## Defining Our Metric Calculation Functions

In [7]:
def calculate_mrr(retrieved_links: List[str], correct_links: Set[str]) -> float:
    for i, link in enumerate(retrieved_links, 1):
        if link in correct_links:
            return 1 / i
    return 0

def evaluate_retrieval(retrieval_function: Callable, evaluation_data: List[Dict[str, Any]], db: Any) -> Tuple[float, float, float, float, List[float], List[float], List[float]]:
    precisions = []
    recalls = []
    mrrs = []
    
    for i, item in enumerate(tqdm(evaluation_data, desc="Evaluating Retrieval")):
        try:
            retrieved_chunks, _ = retrieval_function(item['question'], db)
            retrieved_links = [chunk['metadata'].get('chunk_link', chunk['metadata'].get('url', '')) for chunk in retrieved_chunks]
        except Exception as e:
            logging.error(f"Error in retrieval function: {e}")
            continue

        correct_links = set(item['correct_chunks'])
        
        true_positives = len(set(retrieved_links) & correct_links)
        precision = true_positives / len(retrieved_links) if retrieved_links else 0
        recall = true_positives / len(correct_links) if correct_links else 0
        mrr = calculate_mrr(retrieved_links, correct_links)
        
        precisions.append(precision)
        recalls.append(recall)
        mrrs.append(mrr)
        
        if (i + 1) % 10 == 0:
            print(f"Processed {i + 1}/{len(evaluation_data)} items. Current Avg Precision: {sum(precisions) / len(precisions):.4f}, Avg Recall: {sum(recalls) / len(recalls):.4f}, Avg MRR: {sum(mrrs) / len(mrrs):.4f}")
    
    avg_precision = sum(precisions) / len(precisions) if precisions else 0
    avg_recall = sum(recalls) / len(recalls) if recalls else 0
    avg_mrr = sum(mrrs) / len(mrrs) if mrrs else 0
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
    
    return avg_precision, avg_recall, avg_mrr, f1, precisions, recalls, mrrs

def evaluate_end_to_end(answer_query_function, db, eval_data):
    correct_answers = 0
    results = []
    total_questions = len(eval_data)
    
    for i, item in enumerate(tqdm(eval_data, desc="Evaluating End-to-End")):
        query = item['question']
        correct_answer = item['correct_answer']
        generated_answer = answer_query_function(query, db)
        
        comparision_prompt = f"""
        You are an AI assistant tasked with evaluating the correctness of answers to questions about Anthropic's documentation.
        
        Question: {query}
        
        Correct Answer: {correct_answer}
        
        Generated Answer: {generated_answer}
        
        Is the Generated Answer correct based on the Correct Answer? You should pay attention to the substance of the answer, and ignore minute details that may differ. 
        
        Small differences or changes in wording don't matter. If the generated answer and correct answer are saying essentially the same thing then that generated answer should be marked correct. 
        
        However, if there is any critical piece of information which is missing from the generated answer in comparison to the correct answer, then we should mark this as incorrect. 
        
        Finally, if there are any direct contradictions between the correct answer and generated answer, we should deem the generated answer to be incorrect.
        
        Respond in the following XML format (don't prefix with xml):
        <evaluation>
        <content>
        <explanation>Your explanation here</explanation>
        <is_correct>true/false</is_correct>
        </content>
        </evaluation>
        """
        
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": comparision_prompt}
                ],
                temperature=0.2,
            )
            response_text = str(response.choices[0].message.content)
            print(f'response_text:\n{response_text}')
            
            evaluation = ET.fromstring(response_text)
            is_correct_value = evaluation.find(".//is_correct").text
            
            is_correct = is_correct_value == 'true'
            
            if is_correct:
                correct_answers += 1
            results.append(is_correct)
            
            logging.info(f"Question {i + 1}/{total_questions}: {query}")
            logging.info(f"Correct: {is_correct}")
            logging.info("---")
            
        except ET.ParseError as e:
            logging.error(f"XML parsing error: {e}")
            is_correct = 'true' in response_text.lower()
            results.append(is_correct)
        except Exception as e:
            logging.error(f"Unexpected error: {e}")
            results.append(False)
        
        if (i + 1) % 10 == 0:
            current_accuracy = correct_answers / (i + 1)
            print(f"Processed {i + 1}/{total_questions} questions. Current Accuracy: {current_accuracy:.4f}")
        # time.sleep(2)
    accuracy = correct_answers / total_questions
    return accuracy, results

## Evaluating Our Base Case

In [None]:
import pandas as pd

avg_precision, avg_recall, avg_mrr, f1, precisions, recalls, mrrs = evaluate_retrieval(retrieve_similar, eval_data, db)
e2e_accuracy, e2e_results = evaluate_end_to_end(answer_query_from_context, db, eval_data)

# Create a DataFrame
df = pd.DataFrame({
    'question': [item['question'] for item in eval_data],
    'retrieval_precision': precisions,
    'retrieval_recall': recalls,
    'retrieval_mrr': mrrs,
    'e2e_correct': e2e_results
})

# Save to CSV
from pathlib import Path
csv_dir = Path('evaluation/csvs')
csv_file_name = Path('evaluation_results_detailed.csv')
df.to_csv(csv_dir / csv_file_name, index=False)
print(f"Detailed results saved to {csv_dir/ csv_file_name}")

# Print the results
print(f"Average Precision: {avg_precision:.4f}")
print(f"Average Recall: {avg_recall:.4f}")
print(f"Average MRR: {avg_mrr:.4f}")
print(f"Average F1: {f1:.4f}")
print(f"End-to-End Accuracy: {e2e_accuracy:.4f}")

# Save the results to a file
json_dir = Path("evaluation/json_results")
result_file_name = Path("evaluation_results_one.json")
Path(json_dir).mkdir(parents=True, exist_ok=True)
with open(json_dir / result_file_name, 'w') as f:
    json.dump({
        "name": "Basic RAG",
        "average_precision": avg_precision,
        "average_recall": avg_recall,
        "average_f1": f1,
        "average_mrr": avg_mrr,
        "end_to_end_accuracy": e2e_accuracy
    }, f, indent=2)

print(f"Evaluation complete. Results saved to {json_dir / result_file_name}, {csv_dir/ csv_file_name}")

Evaluating Retrieval:  16%|█▌        | 16/100 [00:00<00:02, 35.04it/s]

Processed 10/100 items. Current Avg Precision: 0.4333, Avg Recall: 0.7000, Avg MRR: 0.9000


Evaluating Retrieval:  24%|██▍       | 24/100 [00:00<00:02, 36.58it/s]

Processed 20/100 items. Current Avg Precision: 0.3333, Avg Recall: 0.5500, Avg MRR: 0.7000


Evaluating Retrieval:  36%|███▌      | 36/100 [00:01<00:01, 37.25it/s]

Processed 30/100 items. Current Avg Precision: 0.3778, Avg Recall: 0.6000, Avg MRR: 0.7667


Evaluating Retrieval:  44%|████▍     | 44/100 [00:01<00:01, 37.67it/s]

Processed 40/100 items. Current Avg Precision: 0.4083, Avg Recall: 0.6250, Avg MRR: 0.8000


Evaluating Retrieval:  56%|█████▌    | 56/100 [00:01<00:01, 37.82it/s]

Processed 50/100 items. Current Avg Precision: 0.4067, Avg Recall: 0.6300, Avg MRR: 0.7800


Evaluating Retrieval:  64%|██████▍   | 64/100 [00:01<00:00, 37.93it/s]

Processed 60/100 items. Current Avg Precision: 0.4056, Avg Recall: 0.6361, Avg MRR: 0.7833


Evaluating Retrieval:  76%|███████▌  | 76/100 [00:02<00:00, 38.18it/s]

Processed 70/100 items. Current Avg Precision: 0.3952, Avg Recall: 0.6167, Avg MRR: 0.7548


Evaluating Retrieval:  84%|████████▍ | 84/100 [00:02<00:00, 38.36it/s]

Processed 80/100 items. Current Avg Precision: 0.4208, Avg Recall: 0.6583, Avg MRR: 0.7792


Evaluating Retrieval:  96%|█████████▌| 96/100 [00:02<00:00, 38.49it/s]

Processed 90/100 items. Current Avg Precision: 0.4185, Avg Recall: 0.6556, Avg MRR: 0.7704


Evaluating Retrieval: 100%|██████████| 100/100 [00:02<00:00, 37.21it/s]


Processed 100/100 items. Current Avg Precision: 0.3933, Avg Recall: 0.6183, Avg MRR: 0.7333


Evaluating End-to-End:   1%|          | 1/100 [00:05<08:56,  5.42s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive overview of the process for creating multiple test cases in the Anthropic Evaluation tool, including defining evaluation criteria, creating diverse test cases, formatting them, using batch processing, running evaluations, analyzing results, and iterating based on findings. However, it does not mention the specific action of clicking the 'Add Test Case' button and filling in values for each variable in the prompt, which is a critical piece of information from the correct answer. Therefore, while the generated answer contains relevant information, it lacks the specific steps outlined in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   2%|▏         | 2/100 [00:07<05:45,  3.52s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it states that Anthropic recommends Pinecone as the embeddings provider, while the correct answer specifies Voyage AI as the recommended provider. This is a critical piece of information that directly contradicts the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   3%|▎         | 3/100 [00:12<06:29,  4.01s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive list of key success metrics for evaluating Claude's performance on a classification task, including accuracy, precision, recall, F1 score, ROC-AUC, latency, and throughput. It also discusses the importance of balancing these metrics when choosing the right model to reduce latency, which aligns with the correct answer's emphasis on speed and output quality. However, the generated answer does not explicitly mention "consistency," "structure," "bias," and "fairness," which are included in the correct answer. While the generated answer is detailed and covers many important aspects, the omission of these specific metrics means it does not fully align with the correct answer. Therefore, it should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   4%|▍         | 4/100 [00:15<05:53,  3.68s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides valid points about how Claude for Sheets can improve prompt engineering workflows, such as streamlined interaction and enhanced contextual awareness. However, it does not mention the ability to test prompts across evaluation suites in parallel, which is a key aspect of the correct answer. Therefore, while the generated answer contains relevant information, it is missing a critical piece of information regarding the speed advantage of parallel testing over sequential chained prompts. Thus, the generated answer is not fully correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   5%|▌         | 5/100 [00:17<05:11,  3.28s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not mention that the absence of the required "\n\nHuman:" and "\n\nAssistant:" turns will result in an API error, which is a critical piece of information provided in the correct answer. Instead, it focuses on the potential for confusion and lack of clarity in responses, which does not address the specific consequence of an API error. Therefore, the generated answer is missing a key aspect of the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   6%|▌         | 6/100 [00:19<04:26,  2.84s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that using tools in the Claude API results in additional tokens, which impacts pricing. However, it fails to mention that tool use requests are priced the same as regular API requests based on total input and output tokens, which is a critical piece of information from the correct answer. Therefore, the generated answer is missing important context regarding the pricing structure.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   7%|▋         | 7/100 [00:21<03:48,  2.45s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer states that the new features are expected to be available in early 2024, while the correct answer specifies a concrete date of June 27th, 2024. This is a critical piece of information that is missing in the generated answer, as it does not provide the exact date and instead gives a vague timeframe. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   8%|▊         | 8/100 [00:23<03:36,  2.35s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies two key factors to consider when deciding whether to use chain-of-thought (CoT): the complexity of the task and latency requirements. It emphasizes the need for in-depth reasoning for complex tasks and acknowledges the potential impact of CoT on latency, which aligns with the correct answer. Therefore, the generated answer captures the essential points and is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:   9%|▉         | 9/100 [00:28<04:29,  2.96s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a detailed process for using Claude to digest long PDF documents, including steps for text extraction, summarization, and asking specific questions. However, it does not explicitly mention the ability to upload PDFs directly to Claude for summarization, which is a key point in the correct answer. Therefore, while the generated answer contains useful information, it is missing the critical aspect of directly uploading PDFs for summarization.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  10%|█         | 10/100 [00:31<04:37,  3.08s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer states that you can view your organization's current API rate limits in the "Usage" section of the Anthropic Console, while the correct answer specifies that this information is found in the Rate Limits tab of the Developer Console. Since the generated answer refers to a different section ("Usage") instead of the specified "Rate Limits tab," it is missing critical information and does not accurately reflect the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>
Processed 10/100 questions. Current Accuracy: 0.1000


Evaluating End-to-End:  11%|█         | 11/100 [00:36<05:32,  3.74s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive list of metrics to evaluate the performance of the ticket classification system beyond accuracy, including precision, recall, F1 score, confusion matrix, ROC-AUC score, log loss, class distribution analysis, cross-validation, error analysis, and user feedback. However, it does not mention the 95th percentile response time and average cost per classification, which are specifically highlighted in the correct answer as important metrics for assessing production-readiness. Therefore, the generated answer is missing critical information that is present in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  12%|█▏        | 12/100 [00:40<05:25,  3.69s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly describes how to specify a system prompt using both the Text Completions API and the Messages API. It mentions that the Text Completions API uses a single input string for the prompt, which aligns with the correct answer's description of adding the system prompt before the first "\n\nHuman:" turn. Additionally, it accurately states that the Messages API uses a structured array of messages with a role of "system" for the prompt. Therefore, the generated answer captures the essential information and is consistent with the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


ERROR:root:XML parsing error: mismatched tag: line 3, column 600
Evaluating End-to-End:  13%|█▎        | 13/100 [00:46<06:21,  4.39s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a detailed explanation of how to use XML tags to structure prompts for Claude, emphasizing the importance of defining the structure, encouraging step-by-step reasoning, using attributes for additional context, and iterating to refine prompts. However, it lacks the specific mention of using <thinking> and <answer> tags as highlighted in the correct answer, which is a critical piece of information for combining XML tags with chain of thought reasoning. Therefore, the generated answer is missing essential details that are present in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  14%|█▍        | 14/100 [00:48<05:31,  3.85s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it mentions precision and recall as key metrics instead of the accuracy, 95th percentile response time, and average cost per request routing, which are the metrics specified in the correct answer. Additionally, it does not provide the specific results for accuracy, precision, and recall, which are critical pieces of information that were included in the correct answer. Therefore, the generated answer does not align with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  15%|█▌        | 15/100 [00:51<04:47,  3.39s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer includes some relevant points such as understanding goals and objectives, defining a target audience, and having evaluation criteria. However, it misses the specific mention of having a clear definition of success criteria, ways to empirically test against those criteria, and a first draft prompt to improve, which are critical components outlined in the correct answer. Therefore, the generated answer is not fully aligned with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  16%|█▌        | 16/100 [00:55<05:16,  3.77s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer correctly describes the differences between the Messages API and the Text Completions API regarding mid-response prompting. It accurately states that the Messages API allows for interactive dialogues and can adjust its output based on user input, while the Text Completions API processes the input at once without mid-response adjustments. However, it does not explicitly mention the specific mechanism of how the Messages API allows for continuation of responses by using the "assistant" role, nor does it mention the ability to pre-fill part of Claude's response in the Text Completions API. These details are critical to fully understanding the differences. Therefore, while the essence of the answer is correct, it lacks important specifics, leading to a conclusion of incorrectness.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  17%|█▋        | 17/100 [01:00<05:28,  3.96s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that Claude's response is tailored based on the assigned role, emphasizing the depth and specificity expected from a role-based prompt. It also notes that without a specific role, the response may be more general and less focused. However, it lacks the specific mention of the structured, actionable nature of the analysis and the specific recommendations that the correct answer highlights. While the generated answer captures the essence of the difference, it does not fully convey the level of insight and actionable recommendations that the correct answer specifies. Therefore, it is missing critical information regarding the structured and actionable aspects of the role-based response.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  18%|█▊        | 18/100 [01:05<06:07,  4.48s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies several quantitative metrics for evaluating a sentiment analysis model, including accuracy, precision, recall, F1 score, AUC-ROC, and the confusion matrix. It also discusses how to determine specific targets for these metrics, aligning well with the correct answer's emphasis on industry benchmarks, prior experiments, and expert knowledge. Therefore, the generated answer captures the essential points and is consistent with the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  19%|█▉        | 19/100 [01:07<04:58,  3.69s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer states that it does not have access to the specific documentation and cannot provide a power user tip, which is a critical piece of information missing from the correct answer. The correct answer provides a specific tip about combining XML tags with other prompt engineering techniques, which is not addressed in the generated answer. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  20%|██        | 20/100 [01:12<05:22,  4.03s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive and detailed approach to using an LLM like Claude to grade the outputs of other LLMs based on a rubric. It includes defining the grading rubric, inputting outputs, prompt design, batch processing, analyzing results, iterative improvement, and human oversight. These steps align well with the essence of the correct answer, which emphasizes using Claude to evaluate outputs against a rubric and providing a simple 'correct' or 'incorrect' result. Although the generated answer is more detailed, it does not omit any critical information and effectively captures the main idea of the correct answer. Therefore, it can be considered correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 20/100 questions. Current Accuracy: 0.2000


Evaluating End-to-End:  21%|██        | 21/100 [01:16<05:20,  4.06s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a general overview of accessing and deploying Voyage embeddings on AWS Marketplace, including steps like visiting the marketplace, selecting the product, subscribing, launching, configuring deployment, and accessing the embeddings. However, it lacks specific details mentioned in the correct answer, such as copying the Product ARN, creating a JupyterLab space in SageMaker Studio, and uploading Voyage's notebook. These are critical steps for deploying the model package that are not addressed in the generated answer. Therefore, it is missing essential information.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  22%|██▏       | 22/100 [01:22<05:51,  4.51s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive approach to obtaining JSON output from Claude, including defining the schema, using structured prompts, providing examples, and validating the output. However, it lacks the specific instruction to provide a single tool and to set the tool_choice explicitly, which are key elements mentioned in the correct answer. Therefore, it is missing critical information and should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  23%|██▎       | 23/100 [01:26<05:31,  4.30s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer accurately describes the key differences between the Claude Instant 1.2 model and the Claude 3 Haiku model, including improvements in model architecture, natural language understanding, response quality, performance metrics, adaptability, and training data. It aligns with the correct answer's points about performance, intelligence, and up-to-date training data. However, it does not explicitly mention the vision capabilities of Claude 3 Haiku, which is a critical piece of information present in the correct answer. Therefore, while the generated answer is largely correct, it is missing this important detail.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  24%|██▍       | 24/100 [01:27<04:29,  3.55s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that using examples helps clarify the desired output format and context, which aligns with the correct answer's emphasis on reducing misinterpretation of instructions. Both answers highlight the benefit of achieving more accurate outputs from Claude. Therefore, the generated answer is essentially saying the same thing as the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  25%|██▌       | 25/100 [01:29<03:43,  2.98s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that prompt engineering allows for quicker adaptations to new domains or tasks without extensive retraining, which aligns with the key advantage mentioned in the correct answer. It emphasizes the efficiency and reduced computational resources needed compared to fine-tuning, which is consistent with the essence of the correct answer. Therefore, the generated answer is correct as it conveys the same fundamental idea.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  26%|██▌       | 26/100 [01:33<03:54,  3.17s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a detailed step-by-step process for getting started with the Claude for Sheets extension, including installing the extension, accessing it, choosing a template, customizing it, and utilizing its features. However, it does not explicitly mention making a copy of the provided Claude for Sheets workbook template, which is a critical piece of information in the correct answer. Therefore, the generated answer is missing this key detail and does not align fully with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  27%|██▋       | 27/100 [01:39<04:53,  4.01s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer captures the general idea that the "index" field relates to the order of the text being streamed, but it lacks the specific detail that multiple deltas with the same index consecutively stream the text for a single content block. This critical piece of information is essential for understanding how the "index" functions in the context of streaming responses. Therefore, the generated answer is not fully correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  28%|██▊       | 28/100 [01:41<04:06,  3.42s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is mostly correct, but it is missing the specific requirement to include the image as a base64-encoded image in an "image" content block within the "messages" array. Additionally, it does not mention the WebP format as a supported image format, which is included in the correct answer. Therefore, it lacks critical information and is not fully accurate.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  29%|██▉       | 29/100 [01:44<04:03,  3.43s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer accurately describes the relationship between time to first token (TTFT) and latency, emphasizing that TTFT is a measure of the time taken to generate the first token after a prompt, which is indeed a specific aspect of latency. It also discusses the implications of TTFT and latency on user experience, aligning with the correct answer's focus on TTFT as an important component of overall latency and responsiveness. Therefore, the generated answer is correct and captures the essential points made in the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  30%|███       | 30/100 [01:48<04:05,  3.51s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly captures the essence of the correct answer by explaining how providing examples of handling edge cases like implicit requests and emotional prioritization can enhance Claude's performance in routing support tickets. It discusses the importance of recognizing implicit requests and emotional cues, which aligns with the points made in the correct answer. Both answers emphasize the improvement in ticket categorization and prioritization, as well as the overall customer experience. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 30/100 questions. Current Accuracy: 0.2667


Evaluating End-to-End:  31%|███       | 31/100 [01:52<04:11,  3.65s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly describes the significance of the "tool_use" stop_reason in the context of integrating external tools with Claude. It emphasizes the enhancement of Claude's capabilities through the use of external tools and the transition from internal processing to tool utilization. However, it lacks specific details about the process that follows the "tool_use" stop_reason, such as the need for the user to extract tool input, run the tool code client-side, and send the results back to Claude. This omission is critical as it describes the complete workflow and interaction required after the "tool_use" signal. Therefore, the generated answer is not fully correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  32%|███▏      | 32/100 [01:54<03:43,  3.29s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it states that the error event is "rate_limit_exceeded" with an HTTP error code of 429, while the correct answer specifies that the error event is "overloaded_error" with an HTTP error code of 529. This is a critical piece of information that is missing in the generated answer, and there is a direct contradiction between the two answers regarding the specific error event and corresponding HTTP error code.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  33%|███▎      | 33/100 [01:56<03:03,  2.74s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer incorrectly states that the two types of deltas are "insert" and "delete," while the correct answer specifies "text_delta" and "input_json_delta." Since the generated answer does not match the correct answer and presents different terms, it is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  34%|███▍      | 34/100 [01:57<02:41,  2.44s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it states that both Claude 3.5 Sonnet and tool use became generally available on March 13, 2024, while the correct answer specifies that Claude 3.5 Sonnet became available on June 20, 2024, and tool use on May 30, 2024. This is a direct contradiction regarding the dates of availability.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  35%|███▌      | 35/100 [01:59<02:29,  2.30s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it does not specify the order of the launches in terms of the timeline (May 2024 for Europe and June 2024 for Canada) and implies a different sequence by stating that Claude.ai was launched first without clarifying the specific regions or months. The correct answer clearly states the order and timing of the launches, which the generated answer fails to capture accurately.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  36%|███▌      | 36/100 [02:03<02:45,  2.59s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer captures the essence of the correct answer by explaining that a stop_reason of "tool_use" indicates the model's need to use an external tool to continue the conversation. However, it lacks specific details about extracting the tool name and input from Claude's request, executing the tool code client-side, and sending a new user message with the tool result back to Claude. These critical steps are essential for properly continuing the conversation after a "tool_use" stop reason. Therefore, the generated answer is missing important information and should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  37%|███▋      | 37/100 [02:05<02:30,  2.39s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it does not mention the anthropic Python library, which is specifically stated in the correct answer as the library used to interact with the Claude AI model. Instead, it lists other libraries that are not mentioned in the correct answer, which could lead to confusion about the specific tools used in the example code snippet.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  38%|███▊      | 38/100 [02:07<02:31,  2.45s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it omits the option of using the default AWS credential providers, such as the ~/.aws/credentials file or the AWS environment variables (AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID), which is a critical piece of information mentioned in the correct answer. Additionally, while it mentions AWS IAM roles, it does not explicitly state that this is an alternative to providing access keys directly, which is part of the correct answer's context.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  39%|███▉      | 39/100 [02:10<02:37,  2.59s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies two important factors: security and privacy, and usability and performance. However, it does not explicitly mention the potential reduction in prompt leaks or the risk of degraded model performance due to added complexity, which are critical components of the correct answer. Therefore, while the generated answer addresses relevant themes, it lacks the specific balance between prompt leaks and model performance degradation that is central to the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  40%|████      | 40/100 [02:14<02:54,  2.91s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly explains that selecting the appropriate Claude model can help reduce latency by optimizing for specific tasks and balancing performance and speed. It mentions the impact of model size and architecture on processing speed, which aligns with the correct answer's emphasis on choosing the right model for speed and output quality. Both answers convey the importance of matching the model's capabilities to the application's needs to minimize latency. Therefore, the generated answer is essentially saying the same thing as the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 40/100 questions. Current Accuracy: 0.2250


Evaluating End-to-End:  41%|████      | 41/100 [02:17<02:55,  2.98s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a correct method for streaming responses from the Anthropic API using the Python SDK. It mentions using the `stream=True` parameter when making a request, which aligns with the correct answer's emphasis on streaming responses. Both answers convey the essential idea of iterating over the response to handle streamed data. Therefore, the generated answer is correct as it captures the necessary information for streaming responses.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  42%|████▏     | 42/100 [02:19<02:36,  2.69s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly mentions using the "max_tokens" parameter to limit the length of the output, which aligns with the correct answer. However, it fails to mention that the pre-filled part of the response should be placed in the last position of the input messages list, which is a critical piece of information missing from the generated answer. Therefore, the generated answer is not fully correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  43%|████▎     | 43/100 [02:24<03:17,  3.47s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer presents a nuanced view on the importance of test cases with automated grading versus those graded by humans, discussing the benefits of both approaches. However, it ultimately contradicts the correct answer, which clearly states that it is better to prioritize a larger volume of test cases with automated grading over fewer high-quality human-graded cases. The generated answer suggests that the importance depends on specific goals and implies a balance between the two, which is not aligned with the correct answer's emphasis on prioritizing automated grading. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  44%|████▍     | 44/100 [02:26<02:41,  2.88s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer is incorrect because it states that the required fields are "text" and "timestamp," while the Correct Answer specifies "index" and "delta" as the required fields. This is a critical piece of information that is missing in the Generated Answer, making it incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


ERROR:root:XML parsing error: not well-formed (invalid token): line 3, column 119
Evaluating End-to-End:  45%|████▌     | 45/100 [02:28<02:29,  2.71s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer provides alternative interactive learning methods (interactive tutorials and live Q&A sessions) that are not mentioned in the Correct Answer. However, it fails to mention the specific resources highlighted in the Correct Answer, such as the Anthropic Cookbook and the Developer Console. Since the Generated Answer does not include critical pieces of information regarding the specific tools available for learning about Claude's capabilities, it cannot be considered correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  46%|████▌     | 46/100 [02:35<03:30,  3.89s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive explanation of why breaking a task into subtasks improves Claude's accuracy, covering aspects such as clarity, incremental problem solving, error isolation, structured reasoning, and enhanced contextual understanding. These points align well with the essence of the correct answer, which emphasizes that focusing on distinct subtasks allows Claude to give full attention to each part, thereby reducing errors. Since the generated answer captures the main idea and expands on it without contradicting the correct answer, it can be considered correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  47%|████▋     | 47/100 [02:38<03:19,  3.77s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that Messages streaming responses are structured for interactive dialogue and include metadata, while Text Completions streaming responses are more linear and focused on generating text completions. However, it does not explicitly mention that Messages responses can contain multiple content blocks of varying types, which is a key aspect of the correct answer. Therefore, while the generated answer captures the essence of the differences, it lacks a critical piece of information regarding the complexity of the Messages streaming format compared to Text Completions. Thus, it should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  48%|████▊     | 48/100 [02:40<02:46,  3.20s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides two ways to experiment with Claude: using the Claude API and accessing Claude through a web-based interface. However, it does not mention visiting claude.ai or using Anthropic's web Console, which are specifically stated in the correct answer. Therefore, the generated answer is missing critical information and is not fully aligned with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  49%|████▉     | 49/100 [02:43<02:42,  3.19s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer accurately captures the essence of the correct answer by explaining how chain prompts break down complex tasks into smaller components, allowing for focused attention and reducing errors. It also mentions the iterative refinement process, which enhances the quality of the output, aligning with the idea of minimizing inconsistencies. Overall, the generated answer conveys the same fundamental concepts as the correct answer without omitting any critical information.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  50%|█████     | 50/100 [02:45<02:17,  2.75s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer states that the overloaded_error event corresponds to the HTTP status code 429, while the correct answer specifies that it corresponds to HTTP 529. Since these two status codes are different and the generated answer contradicts the correct answer, it is deemed incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>
Processed 50/100 questions. Current Accuracy: 0.2400


Evaluating End-to-End:  51%|█████     | 51/100 [02:47<02:05,  2.57s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it does not mention the specific ways to specify the format of embeddings as described in the correct answer. The correct answer specifies that you can either leave the encoding_format parameter unspecified or set it to "base64", while the generated answer refers to using query parameters and the `Accept` header, which does not align with the details provided in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  52%|█████▏    | 52/100 [02:51<02:16,  2.85s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not accurately capture the specifics of how the input JSON deltas for tool_use content blocks are sent and parsed. It mentions incremental updates and maintaining local state, but it fails to specify that the deltas are sent as partial JSON strings in multiple content_block_delta events and that the complete JSON object is parsed after receiving a content_block_stop event. Additionally, it does not mention the use of libraries like Pydantic or helpers provided in Anthropic's SDKs for parsing. These omissions are critical to understanding the correct implementation, making the generated answer incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  53%|█████▎    | 53/100 [02:53<02:14,  2.87s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer incorrectly names the tutorials as "Prompting 101" and "Prompting 102," whereas the correct answer specifies a GitHub prompting tutorial and a Google Sheets prompting tutorial. Additionally, the generated answer does not mention the specific platforms or tools used in the tutorials, which is a critical piece of information. Therefore, the generated answer is not correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  54%|█████▍    | 54/100 [02:58<02:37,  3.42s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive overview of Claude's capabilities for enterprise use, including integration features, data privacy, scalability, customizability, advanced processing capabilities, user-friendly interface, and real-time processing. However, it fails to mention the specific 200K token context window and multimodal input capabilities that are highlighted in the correct answer. These are key aspects that contribute to Claude's suitability for high-trust industries and processing large volumes of sensitive data. Therefore, the generated answer is missing critical information and should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  55%|█████▌    | 55/100 [03:00<02:11,  2.93s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not provide any information regarding the specific regions where Anthropic's Claude.ai API and iOS app are available as of June 2024. It states a lack of access to this information and suggests consulting the official website instead. This is a critical omission, as the correct answer specifies that the services are available in the United States, Canada, and Europe. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  56%|█████▌    | 56/100 [03:05<02:39,  3.64s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer introduces two different approaches (API Integration and Chatbot Integration) that are not mentioned in the correct answer, which specifies push-based and pull-based approaches. While the generated answer discusses scalability and ease of implementation, it does not align with the specific approaches outlined in the correct answer. Therefore, it lacks critical information and does not accurately reflect the content of the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  57%|█████▋    | 57/100 [03:07<02:13,  3.11s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it states that the prompt generator tool was released on March 1, 2024, while the correct answer states the release date is May 10, 2024. Additionally, the generated answer mentions the tool is available through the Claude interface, whereas the correct answer specifies it is available through the Developer Console. These discrepancies indicate that the generated answer does not align with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  58%|█████▊    | 58/100 [03:10<02:12,  3.16s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not explicitly identify the Claude 3 Sonnet model as the one that balances intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing. Instead, it discusses the general trade-offs between speed and intelligence without naming the specific model that provides the best balance. This omission of the critical piece of information makes the generated answer incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  59%|█████▉    | 59/100 [03:13<02:04,  3.04s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly states that the similarity between two Voyage embedding vectors can be calculated using the cosine similarity measure, which is equivalent to the dot product of the two vectors since they are normalized to length 1. It also explains the meaning of the cosine similarity values, which adds clarity. Therefore, the generated answer is consistent with the correct answer and contains all the necessary information.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  60%|██████    | 60/100 [03:15<01:50,  2.77s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the key points made in the correct answer. It discusses how examples in prompts provide clear context, demonstrate expected formats, and help reduce ambiguity, which aligns with the correct answer's emphasis on reducing misinterpretation and enforcing consistency. Both answers highlight that examples serve as a guide for the desired output and improve Claude's ability to handle complex tasks. Therefore, the generated answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 60/100 questions. Current Accuracy: 0.2333


Evaluating End-to-End:  61%|██████    | 61/100 [03:19<02:04,  3.18s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer is incorrect because it describes two types of deltas that are not mentioned in the Correct Answer. The Correct Answer specifies "text deltas" and "input JSON deltas," while the Generated Answer refers to "Content Block Delta" and "Tool Use Delta," which do not match. Additionally, the details provided in the Generated Answer do not align with the specific fields mentioned in the Correct Answer, such as the "text" field and "partial_json" field. Therefore, there is a critical piece of information missing and a contradiction in the types of deltas described.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  62%|██████▏   | 62/100 [03:22<01:50,  2.92s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies two key capabilities of Claude: advanced natural language understanding and the ability to learn from user interactions. These capabilities align with the correct answer's emphasis on question answering and text analysis, as both involve understanding language and adapting to user preferences. Therefore, the generated answer captures the essence of the correct answer without missing any critical information or introducing contradictions.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  63%|██████▎   | 63/100 [03:27<02:09,  3.50s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not accurately reflect the key event types included in a raw HTTP stream response when using message streaming as outlined in the correct answer. It mentions connection establishment, response headers, data chunks, end of stream, and error handling, which are not the specific event types described in the correct answer. The correct answer specifies events like message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop, which are critical to understanding the streaming process. Therefore, the generated answer is missing key information and does not align with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  64%|██████▍   | 64/100 [03:28<01:48,  3.01s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not provide any specific information about the maximum number of images that can be included in a single request using the Anthropic API or the claude.ai interface. It states that it does not have access to those details, which is a critical piece of information missing compared to the correct answer. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  65%|██████▌   | 65/100 [03:32<01:48,  3.11s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer suggests prompting Claude again to continue from where it left off, which is a valid approach. However, it does not mention the critical step of increasing the max_tokens value, which is essential to ensure that the full response can be received. Therefore, the generated answer is missing a key piece of information compared to the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  66%|██████▌   | 66/100 [03:35<01:42,  3.03s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it does not mention the critical steps of developing test cases and reviewing Anthropic's guide to developing test cases. Instead, it refers to preparing the dataset and defining evaluation metrics, which are not the same as the steps outlined in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  67%|██████▋   | 67/100 [03:37<01:32,  2.80s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a broader context on how to influence Claude's response by discussing the use of context, tone, and specific instructions. However, it does not explicitly mention that the content parameter should be in the last position of the messages list with the "assistant" role to pre-fill part of Claude's response, which is a critical piece of information from the correct answer. Therefore, the generated answer is missing essential details that are present in the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  68%|██████▊   | 68/100 [03:41<01:39,  3.12s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies the preservation of general knowledge as an advantage of prompt engineering over fine-tuning, which aligns with the correct answer. However, it introduces a new point about flexibility and adaptability, which is not mentioned in the correct answer. The correct answer specifically emphasizes the effectiveness of prompt engineering in helping models understand and utilize external content, which is not addressed in the generated answer. Therefore, while the generated answer contains valid points, it misses a critical aspect of the correct answer regarding the utilization of external content, leading to a lack of completeness. Thus, the generated answer is not fully correct.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  69%|██████▉   | 69/100 [03:45<01:50,  3.56s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a general overview of the steps to get started with the Bedrock API, mentioning setting up the environment and making API requests. However, it lacks specific details about installing and configuring the AWS CLI and the need to install an SDK, such as the Python SDK, which are critical steps mentioned in the correct answer. Therefore, the generated answer is missing important information and is not fully aligned with the correct answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


ERROR:root:XML parsing error: mismatched tag: line 3, column 573
Evaluating End-to-End:  70%|███████   | 70/100 [03:49<01:47,  3.58s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer is incorrect because it suggests using the `aws sagemaker list-models` command, which is not the correct command for listing Claude models. The correct command is `aws bedrock list-foundation-models --region=<region> --by-provider anthropic --query "modelSummaries[*].modelId"`, which specifically targets the Bedrock service for listing foundation models provided by Anthropic. The generated answer does not mention the Bedrock service and could lead to confusion about how to correctly check for Claude models in a specific AWS region.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>
Processed 70/100 questions. Current Accuracy: 0.2143


Evaluating End-to-End:  71%|███████   | 71/100 [03:51<01:31,  3.16s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer accurately conveys the information provided in the Correct Answer. Both answers state that the `input_type` argument can be used to specify whether the input text is a "query" or a "document." There are no critical pieces of information missing, and there are no contradictions between the two answers. Therefore, the Generated Answer is correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  72%|███████▏  | 72/100 [03:56<01:43,  3.70s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a general overview of the differences between tool_use content blocks and text content blocks, focusing on their structure and the type of data they encapsulate. However, it fails to mention the specific detail that tool_use content block deltas contain partial JSON strings for the input field and that there may be delays between streaming events as the model emits one complete key-value pair at a time. These critical pieces of information are essential to understanding the differences in delta formats. Therefore, the generated answer is missing key information and should be marked as incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  73%|███████▎  | 73/100 [03:58<01:22,  3.07s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not provide the specific image file size limits for the API and claude.ai, which are critical pieces of information included in the correct answer. Instead, it suggests that the details are not provided in the documents, which is misleading. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  74%|███████▍  | 74/100 [04:00<01:10,  2.72s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly identifies that the model's size and complexity are important factors for achieving low latency, which aligns with the correct answer's emphasis on balancing speed and output quality. Both answers highlight the need to consider the specific requirements of the use case, although the generated answer focuses more on model size. Overall, the essence of both answers is similar, and the generated answer does not omit any critical information. Therefore, it can be considered correct.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  75%|███████▌  | 75/100 [04:02<01:02,  2.49s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer incorrectly states that Anthropic recommends the "Claude" embedding model for code retrieval, whereas the correct answer specifies the "voyage-code-2" embedding model. Additionally, the generated answer does not mention the specific performance improvement of 17% over alternatives, which is a critical piece of information. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  76%|███████▌  | 76/100 [04:05<01:10,  2.92s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer provides two valid ways the Anthropic Cookbook can help developers learn to use Anthropic's APIs: step-by-step tutorials and code examples. However, it does not mention the interactive Jupyter notebooks or the specific use case of uploading PDFs and working with embeddings, which are key aspects of the Correct Answer. Therefore, the Generated Answer is missing critical information and is not fully aligned with the Correct Answer.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  77%|███████▋  | 77/100 [04:10<01:15,  3.29s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer accurately reflects the key points made in the correct answer regarding the impact of the context window size on retrieval augmented generation (RAG). Both answers emphasize that a larger context window allows for more retrieved information to be incorporated, which enhances the model's ability to generate accurate and relevant responses. The generated answer also discusses the limitations of a smaller context window, which aligns with the correct answer's implication about improved performance with a larger context window. Therefore, the generated answer is correct as it conveys the same essential information without any critical omissions or contradictions.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  78%|███████▊  | 78/100 [04:13<01:13,  3.32s/it]

response_text:
<evaluation>
<content>
<explanation>The Generated Answer captures the essence of the Correct Answer by discussing the Evaluation tool's role in assessing prompt effectiveness, refining prompts, and improving user experience. It mentions the iterative process of prompt development and the importance of aligning prompts with desired outcomes. However, it lacks specific references to identifying edge cases, ensuring consistent performance across inputs, and reviewing results across test cases to spot patterns, which are critical elements mentioned in the Correct Answer. Therefore, while the Generated Answer is largely correct, it misses some key aspects that are essential for a complete understanding.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  79%|███████▉  | 79/100 [04:15<01:00,  2.88s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer does not provide the specific information that the Claude 3 Haiku model has the fastest comparative latency, which is a critical piece of information present in the correct answer. Instead, it states a lack of access to the comparison tables and does not affirmatively identify any model. Therefore, the generated answer is incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  80%|████████  | 80/100 [04:21<01:17,  3.86s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly describes the process of maintaining a multi-turn conversation using the Anthropic Messages API in Python. It emphasizes the need to keep a list of messages representing the conversation history, which aligns with the correct answer's point about sending the full conversation history with each request. Both answers convey that the API is stateless and that the entire context must be provided with each request. Therefore, the generated answer is correct and captures the essential information from the correct answer.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>
Processed 80/100 questions. Current Accuracy: 0.2375


Evaluating End-to-End:  81%|████████  | 81/100 [04:24<01:07,  3.56s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer correctly explains how using XML tags to provide a specific role or context can enhance Claude's analysis of a legal contract. It emphasizes the importance of focusing Claude's attention on relevant aspects of the contract and how this targeted approach can lead to better risk assessments and informed decision-making. While the generated answer does not explicitly mention the potential financial savings (millions of dollars) highlighted in the correct answer, it captures the essence of the benefits of using a role prompt. Therefore, it is considered correct as it conveys the same fundamental idea without critical omissions or contradictions.</explanation>
<is_correct>true</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  82%|████████▏ | 82/100 [04:27<01:01,  3.41s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer misrepresents the approaches of Claude 3 Opus and Claude 3 Sonnet. It states that Claude 3 Opus adopts a proactive approach by inferring missing information, which contradicts the correct answer that states Opus is more likely to ask the user for missing information. Additionally, the generated answer suggests that Sonnet is conservative and may request clarification, which aligns with the correct answer but does not accurately reflect the emphasis on inference for Opus. Therefore, the generated answer is incorrect as it contains critical misinterpretations of both models' behaviors regarding missing information.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


Evaluating End-to-End:  83%|████████▎ | 83/100 [04:32<01:06,  3.90s/it]

response_text:
<evaluation>
<content>
<explanation>The generated answer provides a comprehensive list of steps for deploying an automated ticket routing system using Claude, covering aspects such as requirements gathering, system design, data preparation, model training, testing, performance evaluation, monitoring, deployment strategy, user training, feedback loop, maintenance plan, and compliance and security. However, it lacks specific mention of implementing retry logic, error handling, and a gradual rollout process, which are critical components highlighted in the correct answer. Therefore, while the generated answer is thorough, it misses key elements necessary for ensuring reliability in deployment, making it incorrect.</explanation>
<is_correct>false</is_correct>
</content>
</evaluation>


In [27]:
!cat evaluation/json_results/evaluation_results_one.json 

{
  "name": "Basic RAG",
  "average_precision": 0.3933333333333335,
  "average_recall": 0.6183333333333334,
  "average_f1": 0.48081274025260856,
  "average_mrr": 0.7333333333333334,
  "end_to_end_accuracy": 0.27
}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
!cat evaluation/csvs/evaluation_results_detailed.csv

question,retrieval_precision,retrieval_recall,retrieval_mrr,e2e_correct
How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?,0.3333333333333333,0.5,1.0,False
"What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?",0.6666666666666666,1.0,1.0,False
"What are some key success metrics to consider when evaluating Claude's performance on a classification task, and how do they relate to choosing the right model to reduce latency?",0.6666666666666666,1.0,1.0,False
What are two ways that Claude for Sheets can improve prompt engineering workflows compared to using chained prompts?,0.3333333333333333,0.5,1.0,False
"What happens if a prompt for the Text Completions API is missing the ""\n\nHuman:"" and ""\n\nAssistant:"" turns?",0.6666666666666666,1.0,1.0,False
How do the additional tokens required for tool use in Claude API requests impact pricing compared to regular API reques

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
