## Evaluating a model on factual knowledge using pre-existing model outputs

In this example, we use a JumpStart endpoint to run inference on an entire dataset and store these results before running a factual knowledge evaluation. This example mimics use cases where the dataset used for the evaluation already contains the model outputs.

Environment:
- Base Python 3.0 kernel
- Studio Notebook instance type: ml.g4dn.xlarge

### Setup

In [None]:
# Install the fmeval package

!rm -Rf ~/.cache/pip/*
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("trex_sample.jsonl"):
    print("ERROR - please make sure the file, trex_sample.jsonl, exists.")

### JumpStart endpoint creation

In [None]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel


# Uncomment the line below and fill in the endpoint name if you have an existing endpoint.
# endpoint_name = "Enter your endpoint name here"


# Uncomment the lines below to deploy a new endpoint.
# model_id = "huggingface-llm-falcon-7b-instruct-bf16"
# my_model = JumpStartModel(model_id=model_id)
# predictor = my_model.deploy()

#### Sample endpoint invocation

In [None]:
prompt = "Tell me about Amazon SageMaker."
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
    },
}

In [None]:
%%time

import boto3
import json
content_type = "application/json"
runtime = boto3.client("sagemaker-runtime")
    
try:
    # Invoke the existing endpoint
    print(f"Utilizing invoke_endpoint API call for existing endpoint: {endpoint_name}")
    response = runtime.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload), ContentType=content_type)
    result = json.loads(response['Body'].read().decode())
    print(result[0]['generated_text'])
    
except NameError:
    # Invoke the predictor that we created earlier
    endpoint_name = predictor.endpoint_name
    response = predictor.predict(payload)
    print(response[0]["generated_text"])

### Performing model inference on the full dataset

Before configuring a factual knowledge evaluation, we will create a new dataset file that contains the same data as the original dataset, but with model outputs added in.

In [None]:
def create_payload(prompt: str, parameters: dict = {"do_sample": True, "top_p": 0.9, "temperature": 0.8, "max_new_tokens": 32}) -> dict:
    """
    Creates a model invocation payload.
    
    Args:
        prompt (str): Prompt for the LLM
        parameters (dict): Customizable model invocation parameters
    
    Returns:
        Payload to be used when invoking the model.
    """
    
    if len(prompt) == 0:
        raise ValueError("Please provide a non-empty prompt.")
    
    return {
        "inputs": prompt,
        "parameters": parameters
    }

In [None]:
!pip install jsonlines

In [None]:
import jsonlines

input_file = "trex_sample.jsonl"
output_file = "trex_sample_with_model_outputs.jsonl"

# For each line in `input_file`, invoke the model using the input from that line,
# augment the line with the invocation results, and write the augmented line to `output_file`.
with jsonlines.open(input_file) as input_fh, jsonlines.open(output_file, "w") as output_fh:
    for line in input_fh:
        if "question" in line:
            question = line["question"]
            print(f"Question: {question}")
            payload = create_payload(question)
            response = runtime.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload), ContentType=content_type)
            result = json.loads(response['Body'].read().decode())
            model_output = result[0]['generated_text']
            print(f"Model output: {model_output}")
            print("==============================")
            line["model_output"] = model_output
            output_fh.write(line)

### FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig

#### Data Config Setup

Below, we create a DataConfig for the local dataset file we just created, trex_sample_with_model_outputs.jsonl.
- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `model_input_location`, `target_output_location`, and `model_output_location` are JMESPath queries used to find the model inputs, target outputs, and model outputs within the dataset. The values that you specify here depend on the structure of the dataset itself. Take a look at trex_sample_with_model_outputs.jsonl to see where "question", "answers", and "model_output" show up.

In [None]:
config = DataConfig(
    dataset_name="trex_sample_with_model_outputs",
    dataset_uri="trex_sample_with_model_outputs.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    model_output_location="model_output"
)

### Run Evaluation

In other use cases that we showcase in the other example notebooks, we usually pass a model runner and prompt template to the `evaluate` method of our evaluation algorithm. However, since our dataset already contains all of model inference outputs, we only need to pass our dataset config.

In [None]:
eval_algo = FactualKnowledge(FactualKnowledgeConfig(target_output_delimiter="<OR>"))
eval_output = eval_algo.evaluate(dataset_config=config, save=True)

#### Parse Evaluation Results

In [None]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open("/tmp/eval_results/factual_knowledge_trex_sample_with_model_outputs.jsonl", "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[0]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[0]['value'])
df