# Evaluate Bedrock Imported Models

In this notebook we will walk through evaluation of the custom imported models that were fine tuned using SageMaker, EC2 or others. 

To evaluate the models imported into Bedrock we will [FMEval](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-foundation-model-evaluate-auto-lib-custom.html) library. In this notebook we will evaluate an LLM fine tuned for Question Answering. We will use [SpeedOfMagic/trivia_qa_tiny](https://huggingface.co/datasets/SpeedOfMagic/trivia_qa_tiny) dataset for evaluating the Q&A Fine Tuned LLM.

Before proceeding futher kindly review the [FMEval License](https://github.com/aws/fmeval/blob/main/LICENSE) and [SpeedOfMagic/trivia_qa_tiny License](https://huggingface.co/datasets/SpeedOfMagic/trivia_qa_tiny).

## Install the pre-requisites

In [None]:
!rm -Rf ~/.cache/pip/*
!rm -Rf /opt/conda/lib/python3.10/site-packages/fsspec*
!rm -Rf /opt/conda/lib/python3.10/site-packages/pytz*
!pip3 uninstall autogluon --y
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall --quiet
!pip3 install pyarrow --upgrade --quiet

## Prepare the Dataset

We will download the [SpeedOfMagic/trivia_qa_tiny](https://huggingface.co/datasets/SpeedOfMagic/trivia_qa_tiny) dataset and reformat to OAI format. We will save the dataset to disk in JSON format which will be used for evaluation in FMEval evaluation.

In [None]:
from datasets import load_dataset, Dataset

def create_conversation(row):
    row["messages"] = [
            {
                "role": "user",
                "content": row["question"],
            },
            {
                "role": "assistant",
                "content": row["answer"]["value"]
            },
    ]
    return row
    
# Load dataset from the hub
dataset = load_dataset("SpeedOfMagic/trivia_qa_tiny")

# save datasets
dataset["test"].to_json(f"data/test_dataset.json", orient="records", force_ascii=False)

Lets check if the dataset to be used exists on the disk

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("data/test_dataset.json"):
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

## FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.qa_accuracy import QAAccuracy

### Define the parameters

In [None]:
model_id = "<<bedrock_imported_model_arn>>"

### Dataset Config 

Below we will set up a Dataset Config for the local dataset file that we wrote above.

In [None]:
config = DataConfig(
    dataset_name="trex_sample",
    dataset_uri="data/test_dataset.json",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answer"
)

### Model Runner Setup

The will create a Bedrock Model Runner below which will be used to perform inference on the dataset file using the Dataset config above.

In [None]:
bedrock_model_runner = BedrockModelRunner(
    model_id=model_id,
    output='outputs[0].text',
    content_template='{"prompt": $prompt, "max_tokens": 500}',
)

### Run Evaluation

We will use [QA Accuracy Evaluation Algorithm](https://aws.github.io/fmeval/fmeval/eval_algorithms.html#EvalAlgorithm.QA_ACCURACY) for evaluating the model given that the model has been fine tuned for Question and Answering.

Following are the different Evaluation Algorithms currently supported by FMEval Library:
-  prompt_stereotyping
-  factual_knowledge
-  toxicity
-  qa_toxicity
-  summarization_toxicity
-  general_semantic_robustness
-  accuracy
-  qa_accuracy
-  qa_accuracy_semantic_robustness
-  summarization_accuracy
-  summarization_accuracy_semantic_robustness
-  classification_accuracy
-  classification_accuracy_semantic_robustness

In [None]:
import warnings
with warnings.catch_warnings(record=True) as w:
    eval_algo = QAAccuracy()
    eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, 
                                     prompt_template="[INST]$model_input[/INST]", save=True)

#### Parse Evaluation Results

In [None]:
for op in eval_output:
    print(f"Eval Name: {op.eval_name}")
    for score in op.dataset_scores:
        print(f"{score.name} : {score.value}")

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd
import json

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open(eval_output[0].output_path, "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['f1_score'] = df['scores'].apply(lambda x: x[0]['value'])
df['exact_match_score'] = df['scores'].apply(lambda x: x[1]['value'])
df['quasi_exact_match_score'] = df['scores'].apply(lambda x: x[2]['value'])
df['precision_over_words'] = df['scores'].apply(lambda x: x[3]['value'])
df['recall_over_words'] = df['scores'].apply(lambda x: x[4]['value'])
df = df.drop(['scores'], axis=1)
df