## Evaluating Falcon-7B-Instruct on prompt stereotyping using JumpStart

In this notebook, we use the FMEval library to evaluate the Falcon-7B-Instruct (available through JumpStart) on prompt stereotyping.

Environment:
- Base Python 3.0 kernel
- Studio Notebook instance type: ml.m5.xlarge

### Setup

In [None]:
# Install the fmeval package

!rm -Rf ~/.cache/pip/*
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall

In [None]:
import glob

# Check that the dataset file to be used by the evaluation is present
if not glob.glob("crows-pairs_sample.jsonl"):
    print("ERROR - please make sure file exists: crows-pairs_sample.jsonl")

### JumpStart Endpoint Creation

In [None]:
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel

# These are needed, even if you use an existing endpoint, by a cell later in this notebook.
model_id, model_version = "huggingface-llm-falcon-7b-instruct-bf16", "*"

# Uncomment the lines below and fill in the endpoint name if you have an existing endpoint.
# endpoint_name = "Insert your existing endpoint name here"
# predictor = sagemaker.predictor.Predictor(
#     endpoint_name=endpoint_name,
#     serializer=sagemaker.serializers.JSONSerializer(),
#     deserializer = sagemaker.deserializers.JSONDeserializer()
# )


# The lines below deploy a new endpoint. Delete them if you are using an existing endpoint.
my_model = JumpStartModel(model_id=model_id, model_version=model_version)
predictor = my_model.deploy()
endpoint_name = predictor.endpoint_name

#### Sample endpoint invocation

In [None]:
%%time

prompt = "London is the capital of"
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
        "decoder_input_details" : True,
        "details" : True
    },
}

response = predictor.predict(payload)
print(response[0]["generated_text"])

### FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.prompt_stereotyping import PromptStereotyping

#### Data Config Setup

Below, we create a DataConfig for the local dataset file, crows-pairs_sample.jsonl.
- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `sent_more_input_location`, `sent_less_input_location`, and `category_location` are JMESPath queries used to find the "sent_more" and "sent_less" model inputs (explained below), and the category type for each sample, within the dataset. The values that you specify here depend on the structure of the dataset itself. Take a look at crows-pairs_sample.jsonl to see where "sent_more", "sent_less", and "bias_type" show up.

For prompt stereotyping, we feed the model pairs of sentences where one sentence ("sent_more") exhibits a higher degree of stereotyping while the other ("sent_less") is less stereotypical. The continuations to these sentences that the model generates will be used when we evaluate the model.

In [None]:
config = DataConfig(
    dataset_name="crows-pairs_sample",
    dataset_uri="crows-pairs_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    sent_more_input_location="sent_more",
    sent_less_input_location="sent_less",
    category_location="bias_type",
)

#### Model Runner Setup

The model runner we create below will be used to perform inference on every sample in the dataset.

In [None]:
js_model_runner = JumpStartModelRunner(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
)

### Configuring the evaluation

By default, evaluation results will get written to a subdirectory of `/tmp/eval_results`. You can configure the evaluation to write to a different directory instead, by specifying the `EVAL_RESULTS_PATH` environment variable.

In [None]:
import os
eval_dir = "results-eval-prompt-stereotyping"
curr_dir = os.getcwd()
eval_results_path = os.path.join(curr_dir, eval_dir) + "/"
os.environ["EVAL_RESULTS_PATH"] = eval_results_path
if os.path.exists(eval_results_path):
    print(f"Directory '{eval_results_path}' exists.")
else:
    os.mkdir(eval_results_path)

### Run Evaluation

In [None]:
eval_algo = PromptStereotyping()
eval_output = eval_algo.evaluate(model=js_model_runner, dataset_config=config, prompt_template="$model_input", save=True)

In [None]:
# Pretty-print the evaluation output (notice the score).
import json
print(json.dumps(eval_output, default=vars, indent=4))

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

data = []
with open(os.path.join(eval_results_path, "prompt_stereotyping_crows-pairs_sample.jsonl"), "r") as file:
    for line in file:
        data.append(json.loads(line))
df = pd.DataFrame(data)
df['eval_algo'] = df['scores'].apply(lambda x: x[0]['name'])
df['eval_score'] = df['scores'].apply(lambda x: x[0]['value'])
df