## Evaluating Factual Knowledge with Bring-Your-Own Model Outputs

In this example we take a JumpStart Endpoint and run inference on an entire dataset, before running an evaluation. This example is for use-cases where the model output field is already pre-populated and we want to run an evaluation algo on the model output and the target output.

Environment:
- conda_python3 kernel
- Studio Notebook instance type: ml.g4dn.xlarge

In [None]:
#!pip3 install sagemaker

#!pip3 install -U pyarrow
#!pip3 install -U accelerate
#!pip3 install "ipywidgets>=8"
#!pip3 install jsonlines

In [None]:
import glob

# Check for beta wheel and built-in dataset
if not glob.glob("fmeval-0.1.0-py3-none-any.whl"):
    print("ERROR - please make sure file exists: fmeval-*-py3-none-any.whl")

if not glob.glob("tiny_dataset.jsonl"):
    print("ERROR - please make sure file exists: tiny_dataset.jsonl")

In [2]:
#
# Install the fmeval-*-py3-none-any.whl distribution.
#

#!rm -Rf ~/.cache/pip/*

#!pip3 install fmeval-0.1.0-py3-none-any.whl --upgrade --upgrade-strategy only-if-needed --force-reinstall
#!pip3 install boto3==1.28.65

### JumpStart Model Setup & Endpoint Creation

In [None]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel

# need for FMEval Model Runner Config
model_id, model_version, = (
    "huggingface-llm-falcon-7b-instruct-bf16",
    "*",
)

endpoint_name = "Enter endpoint name here if already existing"

In [None]:
%%time

try:
    # if endpoint already existing
    endpoint_name

except NameError:
    
    my_model = JumpStartModel(model_id=model_id)
    predictor = my_model.deploy()

#### Sample Inference

In [None]:
prompt = "Tell me about Amazon SageMaker."
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
    },
}

In [None]:
%%time

import boto3
import json
content_type = "application/json"
runtime = boto3.client("sagemaker-runtime")
    
try:
    endpoint_name = predictor.endpoint_name
    response = predictor.predict(payload)
    print(response[0]["generated_text"])

except NameError:
    # if you have an existing endpoint
    print(f"Utilizing invoke_endpoint API call for existing endpoint: {endpoint_name}")
    response = runtime.invoke_endpoint(EndpointName = endpoint_name, Body = json.dumps(payload), ContentType = content_type)
    result = json.loads(response['Body'].read().decode())
    print(result[0]['generated_text'])

### Model Inference on Dataset

In this case we need to run our model across the dataset prior to configuring an evaluation.

In [None]:
def shape_payload(question: str, 
                  payload_shape: dict = {
                    "inputs": prompt,
                    "parameters": {
                        "do_sample": True,
                        "top_p": 0.9,
                        "temperature": 0.8,
                        "max_new_tokens": 1024
                    },}) -> dict:
    """
    Function to shape payload for model inference
    Args:
        question (str): Question for the LLM
        payload_shape (dict): Adjust for the format your LLM expects
    
    Returns:
        payload_shape: Updated payload shape with the question passed in as an input for prompt
    """
    
    if len(question) == 0:
        raise ValueError("Empty question, please provide a full length question.")
    
    payload_shape = {
                    "inputs": question,
                    "parameters": {
                        "do_sample": True,
                        "top_p": 0.9,
                        "temperature": 0.8,
                        "max_new_tokens": 1024
                    },}
    return payload_shape

In [None]:
import jsonlines

input_file = "tiny_dataset.jsonl"
output_file = "updated_tiny_dataset.jsonl"

# open tiny dataset or your own and create a column for model inference results storage
with jsonlines.open(input_file) as lines, jsonlines.open(output_file, "w") as predictions:
    for line in lines:
        if "question" in line:
            question = line["question"]
            formatted_input = shape_payload(question)
            response = runtime.invoke_endpoint(EndpointName = endpoint_name, Body = json.dumps(formatted_input), ContentType = content_type)
            result = json.loads(response['Body'].read().decode())
            answer = result[0]['generated_text']
            line["model_output"] = answer
            predictions.write(line)

### FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig

#### Data Config Setup

You can either bring your own dataset or use our built-in datasets such as tiny_dataset. In this case we use the JSONlines dataset we have created with model output.

In [None]:
config = DataConfig(
    dataset_name="tiny_dataset_model_answers",
    dataset_uri="updated_tiny_dataset.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answer",
    model_output_location="model_output"
)

#### Evaluation Result Configuration

By default results are written to the tmp directory: /tmp/eval_results/factual_knowledge_tiny_dataset.jsonl. Here we adjust this and create our own results directory and set this as an environment variable.

In [None]:
import os
eval_dir = "results-evaluation-model-output"
curr_dir = os.getcwd()
eval_results_path = os.path.join(curr_dir, eval_dir) + "/"
os.environ["EVAL_RESULTS_PATH"] = eval_results_path
if os.path.exists(eval_results_path):
    print(f"Directory '{eval_results_path}' exists.")
else:
    os.mkdir(eval_results_path)

### Run Evaluation

In this case we run the evaluation without the model runner as we already have inference for our dataset, we simply pass our dataset config and prompt template for use-cases such as this.

In [None]:
eval_algo = FactualKnowledge(FactualKnowledgeConfig(target_output_delimiter="<OR>"))
eval_output = eval_algo.evaluate(dataset_config=config, prompt_template="$feature", save=True)

#### Parse Evaluation Results

In [None]:
eval_output

In [None]:
import json
print(json.dumps(eval_output, default=vars, indent=4))

In [None]:
import pandas as pd

data = []
with open("results-evaluation-model-output/factual_knowledge_tiny_dataset_model_answers.jsonl", "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[0]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[0]['value'])
df