## Evaluating Toxicity on Falcon Model in Jumpstart

In this example we deploy a JumpStart Falcon example and utilize the Toxicity Evaluation Algorithm on a subset of the built-in dataset: real_toxicity_sample.jsonl

Environment:
- conda_python3 kernel
- Studio Notebook instance type: ml.g4dn.2xlarge

### Setup

In [None]:
#!pip3 install sagemaker

#!pip3 install -U pyarrow
#!pip3 install -U accelerate
#!pip3 install "ipywidgets>=8"
#!pip3 install jsonlines

In [None]:
import glob

# Check for fmeval wheel and built-in dataset
if not glob.glob("fmeval-*-py3-none-any.whl"):
    print("ERROR - please make sure file exists: fmeval-*-py3-none-any.whl")

if not glob.glob("real_toxicity_sample.jsonl"):
    print("ERROR - please make sure file exists: real_toxicity_sample.jsonl")

In [None]:
#
# Install the fmeval-*-py3-none-any.whl distribution.
#

#!rm -Rf ~/.cache/pip/*

#!pip3 install fmeval-*-py3-none-any.whl --upgrade --upgrade-strategy only-if-needed --force-reinstall
#!pip3 install boto3==1.28.65

### JumpStart Endpoint Creation & Sample Inference

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

# need for FMEval Model Runner Config
model_id, model_version, = (
    "huggingface-llm-falcon-7b-instruct-bf16",
    "*",
)

In [None]:
%%time
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

In [None]:
%%time


prompt = "Tell me about Amazon SageMaker."
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
    },
}

response = predictor.predict(payload)
print(response[0]["generated_text"])

### FMEval Setup

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig
from fmeval.eval_algorithms.qa_toxicity import QAToxicity

#### Data Config Setup

For toxicity we only read in the model input, the continuation is what our model generates and is evaluated for toxicity.

In [None]:
#for larger toxicity dataset reference: "real_toxicity.jsonl"
config = DataConfig(
    dataset_name="real_toxicity",
    dataset_uri="real_toxicity_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="prompt",
)

#### JS Model Runner Setup

Optionally if you have the SageMaker Endpoint already created you can utilize the SageMaker Endpoint Model Runner.

In [None]:
js_model_runner = JumpStartModelRunner(
    endpoint_name=predictor.endpoint_name,
    model_id=model_id,
    model_version=model_version,
    output='[0].generated_text',
    content_template='{"inputs": $prompt, "parameters": {"do_sample": true, "top_p": 0.9, "temperature": 0.8, "max_new_tokens": 1024}}',
)

### Evaluation Report Configuration

We will push our results to the directory below.

In [None]:
import os
eval_dir = "results-evaluation-toxicity-subset"
curr_dir = os.getcwd()
eval_results_path = os.path.join(curr_dir, eval_dir) + "/"
os.environ["EVAL_RESULTS_PATH"] = eval_results_path
if os.path.exists(eval_results_path):
    print(f"Directory '{eval_results_path}' exists.")
else:
    os.mkdir(eval_results_path)

### Sample Evaluation

In [None]:
eval_algo = QAToxicity(ToxicityConfig()) #instantiate toxicity algo

In [None]:
# sample single inference
model_output = "The weather turned awful"
eval_algo.evaluate_sample(model_output=model_output)

In [None]:
# evaluation on entire 
eval_output = eval_algo.evaluate(model=js_model_runner, dataset_config=config, prompt_template="$feature", save=True)

In [None]:
# parse report
eval_output

In [None]:
import json
print(json.dumps(eval_output, default=vars, indent=4))