### Load Questions from Question Lists (File)

In [1]:
%load_ext autoreload
%autoreload 2
%pip install ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [16]:

qa_file = 'output/bedrock-ug_sample_questions.jsonl'
document_name = 'bedrock-ug'
chunk_size = 1000
use_contextual = True

index_name = f"{'contextual_' if use_contextual else ''}{document_name}_{chunk_size}"


In [3]:
from config import Config
config = Config.load()
config.__dict__

from libs.bedrock_service import BedrockService
from libs.contextual_rag_service import ContextualRAGService
from libs.opensearch_service import OpensearchService
from libs.reranker import RerankerService

import json

In [14]:
evaluate_system_prompt = """
Evaluate the correctness of the generation on a continuous scale from 0 to 1. A generation can be considered correct (Score: 1) if it includes all the key facts from the ground truth and if every fact presented in the generation is factually supported by the ground truth or common sense.
Example:
Query: Can eating carrots improve your vision?
Answer: Yes, eating carrots significantly improves your vision, especially at night. This is why people who eat lots of carrots never need glasses. Anyone who tells you otherwise is probably trying to sell you expensive eyewear or doesn't want you to benefit from this simple, natural remedy. It's shocking how the eyewear industry has led to a widespread belief that vegetables like carrots don't help your vision. People are so gullible to fall for these money-making schemes.
Ground truth: Well, yes and no. Carrots won’t improve your visual acuity if you have less than perfect vision. A diet of carrots won’t give a blind person 20/20 vision. But, the vitamins found in the vegetable can help promote overall eye health. Carrots contain beta-carotene, a substance that the body converts to vitamin A, an important nutrient for eye health. An extreme lack of vitamin A can cause blindness. Vitamin A can prevent the formation of cataracts and macular degeneration, the world’s leading cause of blindness. However, if your vision problems aren’t related to vitamin A, your vision won’t change no matter how many carrots you eat.
Score: 0.1
Reasoning: While the generation mentions that carrots can improve vision, it fails to outline the reason for this phenomenon and the circumstances under which this is the case. The rest of the response contains misinformation and exaggerations regarding the benefits of eating carrots for vision improvement. It deviates significantly from the more accurate and nuanced explanation provided in the ground truth.
"""

eval_tools = {
    "tools": [
        {
            "toolSpec": {
                "name": "CorrectressGrader",
                "description": "Evaluate the correctness of the answer on a continuous scale from 0 to 1, and reasoning why the score is. A generation can be considered correct (Score: 1) if it includes all the key facts from the ground truth and if every fact presented in the generation is factually supported by the ground truth.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "score": {
                                "type": "number",
                                "description": "The correctress score [0.0, 1.0]"
                            },
                            "reason": {
                                "type": "string",
                                "description": "The reason about the score"
                            }
                        },
                        "required": ["score", "reason"]
                    }
                }
            }
        }
    ]
}


In [25]:
from tqdm.notebook import tqdm

bedrock_service = BedrockService(config.aws.region, config.aws.profile, config.bedrock.retries, config.bedrock.embed_model_id, config.bedrock.model_id, config.model.max_tokens, config.model.temperature, config.model.top_p)
opensearch_service = OpensearchService(config.aws.region, config.aws.profile, config.opensearch.prefix, config.opensearch.domain_name, config.opensearch.document_name, config.opensearch.user, config.opensearch.password)
reranker_service = RerankerService(config.reranker.aws_region, config.reranker.aws_profile, config.reranker.reranker_model_id, config.bedrock.retries)
rag_service = ContextualRAGService(bedrock_service=bedrock_service, opensearch_service=opensearch_service, reranker_service=reranker_service)

results = []

with open(qa_file, 'r') as f:
    lines = f.readlines()
    for line in tqdm(lines[5:10]):
        question_data = json.loads(line)
        question = question_data['question']
        ground_truth = question_data['ground_truth']
        question_embedding = bedrock_service.embedding(text=question)
        generated = rag_service.do(question=question, document_name=document_name, chunk_size=chunk_size, use_hybrid=True, use_contextual=True, search_limit=5)
        
        token_usage = generated['usage']

        # print(generated)

        # Evaluate each answer
        

        evaluate_user_template = f"""
        Query: {question}
        Answer: {generated['answer']}
        Ground Truth: {ground_truth}
        """

        sys_prompt = [{"text": evaluate_system_prompt}]
        user_prompt = [{"role": "user", "content": [{"text": evaluate_user_template}]}]
        temperature = 0.0
        top_p = 0.5
        inference_config = {"temperature": temperature, "topP": top_p}

        response = bedrock_service.converse_with_tools(
            messages=user_prompt,
            system_prompt=evaluate_system_prompt,
            tools=eval_tools,
            temperature=temperature,
            top_p=top_p,
            max_tokens=4096
        )

        stop_reason = response['stopReason']
        # print(response)

        if stop_reason == 'tool_use':
            tool_requests = response['output']['message']['content']
            

            for tool_request in [x for x in tool_requests if 'toolUse' in x]:
                if tool_request['toolUse']['name'] == 'CorrectressGrader':
                    res = tool_request['toolUse']['input']

                    result = {
                         "question": question,
                         "question_type": question_data['question_type'],
                         "generated_answer": generated['answer'],
                         "ground_truth": ground_truth,
                         "score": res['score']
                    }

                    results.append(result)


us-west-2


  0%|          | 0/5 [00:00<?, ?it/s]

In [26]:
results

[{'question': 'How does the implementation of invoking the Anthropic Claude model differ between the .NET, Go, and Java SDKs for AWS Bedrock, particularly in terms of request formatting and error handling?',
  'question_type': 'complex',
  'generated_answer': 'Based on the provided information, here are the key differences in implementing invocation of the Anthropic Claude model between the .NET, Go, and Java SDKs for AWS Bedrock:\n\n1. Request Formatting:\n\n- .NET: The example doesn\'t show the full request formatting, but it mentions creating a BedrockRuntime client and setting the model ID.\n\n- Go: \n  - Uses a custom `ClaudeRequest` struct to format the request\n  - Explicitly wraps the prompt with "Human: " and "\\n\\nAssistant:" tags\n  - Marshals the struct to JSON before sending\n\n- Java: The example is not provided in the given chunks.\n\n2. API Invocation:\n\n- .NET: Uses `AmazonBedrockRuntimeClient` with `InvokeModel` method\n\n- Go: Uses `bedrockruntime.NewFromConfig(sdk