## Running an evaluation using a custom HuggingFace Model Runner

In this notebook, we create a custom Model Runner using HuggingFace's `AutoModelForCausalLM` and evaluate the model on factual knowledge using the FMEval library.

Environment:
- Base Python 3.0 kernel
- Studio Notebook instance type: ml.m5.large

### Setup

In [None]:
# Install the fmeval package

!rm -Rf ~/.cache/pip/*
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("trex_sample.jsonl"):
    print("ERROR - please make sure file exists: trex_sample.jsonl")

In [None]:
import warnings
import sagemaker
import torch

from dataclasses import dataclass
from typing import Tuple, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
from fmeval.model_runners.model_runner import ModelRunner

### Create a subclass of `ModelRunner` for our custom HuggingFace model runner

In [None]:
@dataclass(frozen=True)
class HFModelConfig:
    """
    Configures a HuggingFaceCausalLLMModelRunner instance.

    :param model_name: A unique identifier tied to a HuggingFace model.
            See https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoModel.from_pretrained
    :param max_new_tokens: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
    :param remove_prompt_from_generated_text: Whether to remove the prompt from text that is generated by the model.
    :param do_sample: Whether to use sampling; greedy decoding is used during generation if False.
    """

    model_name: str
    max_new_tokens: int
    remove_prompt_from_generated_text: bool = True
    do_sample: bool = False


class HuggingFaceCausalLLMModelRunner(ModelRunner):
    def __init__(self, model_config: HFModelConfig):
        self.config = model_config
        self.model = AutoModelForCausalLM.from_pretrained(self.config.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name)

    def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:
        input_ids = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        generations = self.model.generate(
            **input_ids,
            max_new_tokens=self.config.max_new_tokens,
            pad_token_id=self.tokenizer.eos_token_id,
            do_sample=self.config.do_sample,
        )
        generation_contains_input = (
            input_ids["input_ids"][0] == generations[0][: input_ids["input_ids"].shape[1]]
        ).all()
        if self.config.remove_prompt_from_generated_text and not generation_contains_input:
            warnings.warn(
                "Your model does not return the prompt as part of its generations. "
                "`remove_prompt_from_generated_text` does nothing."
            )
        if self.config.remove_prompt_from_generated_text and generation_contains_input:
            output = self.tokenizer.batch_decode(generations[:, input_ids["input_ids"].shape[1] :])[0]
        else:
            output = self.tokenizer.batch_decode(generations, skip_special_tokens=True)[0]

        with torch.inference_mode():
            input_ids = self.tokenizer(self.tokenizer.bos_token + prompt, return_tensors="pt")["input_ids"]
            model_output = self.model(input_ids, labels=input_ids)
            probability = -model_output[0].item()

        return output, probability

#### Instantiate a custom model runner using GPT2

In [None]:
hf_config = HFModelConfig(model_name="gpt2", max_new_tokens=32)
model = HuggingFaceCausalLLMModelRunner(model_config=hf_config)
print(model.predict("London is the capital of?")[0]) # model.predict returns a tuple: (output, probability). We extract `output` using [0]

### FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig

### Evaluate the model on a single sample

In [None]:
eval_algo = FactualKnowledge(FactualKnowledgeConfig("<OR>"))

model_output = model.predict("London is the capital of?")[0]
print(model_output)

eval_algo.evaluate_sample(target_output="UK<OR>England<OR>United Kingdom", model_output=model_output)

### Evaluate the model using a dataset

#### Data Config Setup

Below, we create a DataConfig for the local dataset file, trex_sample.jsonl.
- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `model_input_location` and `target_output_location` are JMESPath queries used to find the model inputs and target outputs within the dataset. `category_location` similarly is used to find information about the category that the sample belongs to. The values that you specify here depend on the structure of the dataset itself. Take a look at trex_sample.jsonl to see where "question", "answers", and "knowledge_category" show up.

In [None]:
config = DataConfig(
    dataset_name="trex_sample",
    dataset_uri="trex_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers",
    category_location="knowledge_category",
)

#### Run Evaluation

In [None]:
eval_output = eval_algo.evaluate(model=model, dataset_config=config, prompt_template="$model_input", save=True)

#### Parse Evaluation Results

In [None]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []

# We obtain the path to the results file from "output_path" in the cell above
with open("/tmp/eval_results/factual_knowledge_trex_sample.jsonl", "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[0]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[0]['value'])
df