# Using Amazon Bedrock

This tutorial will show you how to use Amazon Bedrock endpoints and LangChain.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#%pip install git+https://github.com/austinmw/ragas@bedrock
#%pip install "langchain>=0.0.336"

### Load sample dataset

In [3]:
# Sample dataset

from datasets import load_dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
ds = fiqa_eval['baseline']
ds

Found cached dataset fiqa (/home/ec2-user/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts'],
    num_rows: 30
})

Or use your own Parquet dataset. Required columns are:
- `question`: `str` — original question
- `ground_truths`: `List[str]` — ground truth answer(s) (accepts a list in case you'd like to provide multiple answer variations)
- `answer`: `str` — generated answer
- `contexts`: `List[str]` — retrieved document chunks

In [4]:
# from datasets import Dataset

# ds = Dataset.from_parquet('/path/to/data/rag_results.parquet')
# ds

In [5]:
# Inspect the dataset
df = ds.to_pandas()
df.head()

Unnamed: 0,question,ground_truths,answer,contexts
0,How to deposit a cheque issued to an associate...,[Have the check reissued to the proper payee.J...,\nThe best way to deposit a cheque issued to a...,[Just have the associate sign the back and the...
1,Can I send a money order from USPS as a business?,[Sure you can. You can fill in whatever you w...,"\nYes, you can send a money order from USPS as...",[Sure you can. You can fill in whatever you w...
2,1 EIN doing business under multiple business n...,[You're confusing a lot of things here. Compan...,"\nYes, it is possible to have one EIN doing bu...",[You're confusing a lot of things here. Compan...
3,Applying for and receiving business credit,"[""I'm afraid the great myth of limited liabili...",\nApplying for and receiving business credit c...,[Set up a meeting with the bank that handles y...
4,401k Transfer After Business Closure,[You should probably consult an attorney. Howe...,\nIf your employer has closed and you need to ...,[The time horizon for your 401K/IRA is essenti...


Lets import metrics that we are going to use

In [6]:
from ragas.metrics import (
    context_recall,
    context_precision,
    answer_relevancy,
    faithfulness,
    answer_similarity,
    answer_correctness,
)
from ragas.metrics.critique import (
    AspectCritique,
    coherence,
    conciseness,
)

Now lets swap out the default `ChatOpenAI` with `BedrockChat`. Init a new instance of `BedrockChat` with the `model_id` of the model you want to use. You will also have to change the `BedrockEmbeddings` in the metrics that use them, which in our case is `answer_relevance`.

In order to use the new `BedrockChat` llm instance with Ragas metrics, you have to create a new instance of `RagasLLM` using the `ragas.llms.LangchainLLM` wrapper. Its a simple wrapper around langchain that make Langchain LLM/Chat instances compatible with how Ragas metrics will use them.

In [7]:
import boto3
from botocore.config import Config
from ragas.llms import LangchainLLM
from langchain.chat_models import BedrockChat
from langchain.embeddings import BedrockEmbeddings

config = {
    # NOTE: This has only been tested with Claude models!
    "model_id": "anthropic.claude-instant-v1", # E.g "anthropic.claude-v2"
    # NOTE: No need to set Temperature: it is set by the individual metrics (to 0.0 or 0.2 depending on the metric)
    "model_kwargs": {
        "max_tokens_to_sample": 1000,
    },
    "embedding_model_id": "amazon.titan-embed-text-v1",
}

boto_config = Config(
    retries={
        "max_attempts": 20,
        "mode": "adaptive"
    },
)

bedrock_inference = boto3.client(
    service_name="bedrock-runtime",
    config=boto_config,
)

# Initialize BedrockChat
bedrock_model = BedrockChat(
    client=bedrock_inference,
    model_id=config["model_id"],
    model_kwargs=config["model_kwargs"],
)

# wrapper around BedrockChat
ragas_bedrock_model = LangchainLLM(bedrock_model)

# Initialize BedrockEmbeddings
bedrock_embeddings = BedrockEmbeddings(
    client=bedrock_inference,
    model_id=config["embedding_model_id"],
)

First we will swap in custom prompts for each metric that have been tuned perform better with Anthropic Claude models. We can also change other metrics' settings here. For metrics that support it, the `strictness` parameter will run a metric evaluation multiple times and select a mean or majority answer (depending on the metric).

In [8]:
from ragas.prompts.langchain import anthropic_prompts

context_recall.human_template = anthropic_prompts.CONTEXT_RECALL_HUMAN
context_recall.ai_template = anthropic_prompts.CONTEXT_RECALL_AI

context_precision.human_template = anthropic_prompts.CONTEXT_PRECISION_HUMAN
context_precision.ai_template = anthropic_prompts.CONTEXT_PRECISION_AI

answer_relevancy.human_template = anthropic_prompts.ANSWER_RELEVANCY_HUMAN
answer_relevancy.ai_template = anthropic_prompts.ANSWER_RELEVANCY_AI

faithfulness.statements_human_template = anthropic_prompts.FAITHFULNESS_STATEMENTS_HUMAN
faithfulness.statements_ai_template = anthropic_prompts.FAITHFULNESS_STATEMENTS_AI
faithfulness.verdict_human_template = anthropic_prompts.FAITHFULNESS_VERDICTS_HUMAN
faithfulness.verdict_ai_template = anthropic_prompts.FAITHFULNESS_VERDICTS_AI

answer_similarity.threshold = None

answer_correctness.answer_similarity = answer_similarity
answer_correctness.faithfulness = faithfulness

coherence.human_template = anthropic_prompts.ASPECT_CRITIQUE_HUMAN
coherence.ai_template = anthropic_prompts.ASPECT_CRITIQUE_AI
coherence.definition = (
    "Does the submission present ideas, information, or arguments in a "
    "logically sequential manner, clearly distinguishing main points from "
    "supporting details? Evaluate the structure rigorously, ensuring each "
    "part contributes directly to the overall message or argument. "
    "Disregard submissions with any tangential or poorly connected content. "
    "Be very strict!"
)
coherence.strictness = 3

conciseness.human_template = anthropic_prompts.ASPECT_CRITIQUE_HUMAN
conciseness.ai_template = anthropic_prompts.ASPECT_CRITIQUE_AI
conciseness.definition = (
    "Evaluate if the submission communicates its ideas or information using "
    "the fewest possible words, without loss of clarity. Reject submissions with "
    "any redundant, repetitive, cut-off, or extraneous details, regardless of their "
    "relevance to the main topic. The focus should be on brevity and directness. "
    "Be very strict!"
)

conciseness.strictness = 3

We can also define custom metrics with the `AspectCritique` class

In [9]:
awareness = AspectCritique(
    name="awareness",
    definition=(
        "Assess the submission's ability to correctly judge context sufficiency. Answer 'Yes' if the context "
        "is insufficient and the model identifies it as such, or if the context is sufficient and the model "
        "answers the question. Answer 'No' if the context is insufficient but the model fails to recognize "
        "this, or if the context is sufficient but the model incorrectly deems it insufficient. The focus "
        "should be on the model's accuracy in evaluating the sufficiency of the context for the given question."
    )
)
awareness.human_template = anthropic_prompts.ASPECT_GT_CRITIQUE_HUMAN
awareness.ai_template = anthropic_prompts.ASPECT_CRITIQUE_AI
awareness.strictness = 1

Next we'll list all of metrics we want to use

In [10]:
# NOTE: Comment out any metrics you don't want to use
metrics = [
    context_recall,
    context_precision,
    answer_relevancy,
    faithfulness,
    answer_similarity,
    answer_correctness,
    coherence,
    conciseness,
    awareness,
]

Finally, we'll change the LLM and embedding models used for each metric. Here we will separately list all metrics, since in some cases we'll want to apply the changes even to metrics that are not listed above. This is because some metrics have dependencies on others (e.g. `answer_correctness`).

In [11]:
# List of all metrics
all_metrics = [
    context_recall, context_precision, answer_relevancy,
    faithfulness, answer_similarity, answer_correctness,
    coherence, conciseness, awareness
]

# Set attributes on metrics
for m in all_metrics:
    m.__setattr__("llm", ragas_bedrock_model)
    m.__setattr__("embeddings", bedrock_embeddings)
    m.__setattr__("batch_size", 15)

### Evaluation

Running the evalutation is as simple as calling evaluate on the `Dataset` with the metrics of your choice.

In [12]:
import warnings
warnings.filterwarnings('ignore', message=".*promote has been superseded by mode='default'.*")

# NOTE: Only used when running in a Jupyter notebook, otherwise comment or remove this function.
import nest_asyncio
nest_asyncio.apply()

In [13]:
%%time

from ragas import evaluate

result = evaluate(ds, metrics=metrics)
result

evaluating with [context_recall]


100%|██████████| 2/2 [00:46<00:00, 23.19s/it]


evaluating with [context_precision]


100%|██████████| 2/2 [00:03<00:00,  1.80s/it]


evaluating with [answer_relevancy]


100%|██████████| 2/2 [00:19<00:00,  9.52s/it]


evaluating with [faithfulness]


100%|██████████| 2/2 [00:26<00:00, 13.05s/it]


evaluating with [answer_similarity]


100%|██████████| 2/2 [00:05<00:00,  2.80s/it]


evaluating with [answer_correctness]


100%|██████████| 2/2 [00:33<00:00, 16.59s/it]


evaluating with [coherence]


100%|██████████| 2/2 [00:29<00:00, 14.61s/it]


evaluating with [conciseness]


100%|██████████| 2/2 [00:22<00:00, 11.37s/it]


evaluating with [awareness]


100%|██████████| 2/2 [00:09<00:00,  4.99s/it]

CPU times: user 2.4 s, sys: 192 ms, total: 2.59 s
Wall time: 3min 15s





{'context_recall': 0.6840, 'context_precision': 0.8000, 'answer_relevancy': 0.8484, 'faithfulness': 0.9178, 'answer_similarity': 0.6880, 'answer_correctness': 0.6826, 'coherence': 1.0000, 'conciseness': 0.2667, 'awareness': 0.7000}

In [14]:
# Get the average results
average_results = result.copy()
average_results

{'context_recall': 0.6839682539682539,
 'context_precision': 0.8,
 'answer_relevancy': 0.8484158299742813,
 'faithfulness': 0.9177777777777778,
 'answer_similarity': 0.6879575691920328,
 'answer_correctness': 0.6825502131674449,
 'coherence': 1.0,
 'conciseness': 0.26666666666666666,
 'awareness': 0.7}

And there you have the it, all the scores you need. now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!

In [15]:
df = result.to_pandas()
df.head()

Unnamed: 0,question,contexts,answer,ground_truths,context_recall,context_precision,answer_relevancy,faithfulness,answer_similarity,answer_correctness,coherence,conciseness,awareness
0,How to deposit a cheque issued to an associate...,[Just have the associate sign the back and the...,\nThe best way to deposit a cheque issued to a...,[Have the check reissued to the proper payee.J...,0.5,1.0,0.941635,0.666667,0.740307,0.53682,1,1,1
1,Can I send a money order from USPS as a business?,[Sure you can. You can fill in whatever you w...,"\nYes, you can send a money order from USPS as...",[Sure you can. You can fill in whatever you w...,1.0,1.0,0.96212,1.0,0.871439,0.935719,1,0,1
2,1 EIN doing business under multiple business n...,[You're confusing a lot of things here. Compan...,"\nYes, it is possible to have one EIN doing bu...",[You're confusing a lot of things here. Compan...,0.8,1.0,0.917308,1.0,0.610433,0.63855,1,0,0
3,Applying for and receiving business credit,[Set up a meeting with the bank that handles y...,\nApplying for and receiving business credit c...,"[""I'm afraid the great myth of limited liabili...",1.0,1.0,0.667156,1.0,0.662891,0.831446,1,1,1
4,401k Transfer After Business Closure,[The time horizon for your 401K/IRA is essenti...,\nIf your employer has closed and you need to ...,[You should probably consult an attorney. Howe...,0.0,1.0,0.814123,1.0,0.259844,0.254922,1,0,0


### Logs

You can access the logs for each metric from metric objects themselves. For example:

In [16]:
context_recall.logs.keys()

dict_keys(['prompts', 'responses', 'sentences', 'scores'])