## Comparing Model Performance of Summarization Accuracy After Fine-Tuning

In this example, we will take the pre-existing SageMaker endpoints that you deployed in previous exercises and use them to generate data that can be leveraged for quality comparison. This data can be used to take a quantitative approach towards judge the efficacy of fine-tuning your models.

This example will run through samples of the [Samsum dataset](https://huggingface.co/datasets/Samsung/samsum) (paper [here](https://aclanthology.org/D19-5409/)) on the HuggingFace data hub to generate summaries of earnings calls transcripts and use the [fmeval library](https://github.com/aws/fmeval) for analysis on those summaries.

### Install Dependencies

In [None]:
# Install the fmeval package
!pip install -U datasets==2.21.0
!pip install -U jsonlines==4.0.0
!pip install -U fmeval==1.2.0
!pip install -U py7zr==0.22.0

Here you will use the HuggingFace datasets package to load the Samsum dataset. The dataset is pre-split into training and test data, so you can simply take that split using the API.

In [None]:
from datasets import load_dataset

test_dataset  = load_dataset("Samsung/samsum", split="test")

len(test_dataset)

You can see the test dataset has 819 items in it, and they can be accessed via index. The items include the transcription of the earnings call and a short summary of that dialogue.

In [None]:
test_dataset[204]

Create the client objects for calling SageMaker APIs, and supply the names of the SageMaker endpoints you created for the base and fine-tuned versions of the model. If you did not deploy both models, you can simply set them to the same endpoint name.

In [None]:
import sagemaker
import boto3


sess = sagemaker.Session()
boto_session = boto3.session.Session()
region = boto_session.region_name

## Validate endpoint functionality

### Reference your base and fine-tuned endpoints

# ***
# NOTE: PROVIDE YOUR UNIQUE ENDPOINTS HERE OR YOU WILL GET ERRORS
# ***

__If you will be evaluating a model with swappable LoRA adapters, you can use the same endpoint name for both base and tuned with varying adapter references in your inference payload.__

Omitting the adapter will result in the base model being used without any adapter, and specifying an adapter array with it's name will use that adapter for inference.

In [None]:
#ENTER YOUR ENDPOINTS HERE
base_endpoint_name = "<YOUR_MODEL_ENDPOINT_HERE>"
tuned_endpoint_name = "<YOUR_MODEL_ENDPOINT_HERE>"

### As a quick test, you will take a base prompt and sample from the dataset to verify that the endpoints provided will work for the upcoming test runs. 

You can also use this as a subjective comparison of the 3 models.

In [None]:
import json

prompt = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

{test_dataset[0]['dialogue']}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

base_payload = {"inputs": prompt,"parameters": {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 256,},}
tuned_payload = {"inputs": prompt,"parameters": {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 256,}, "adapters":["sum"]}
tuned5_payload = {"inputs": prompt,"parameters": {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 256,}, "adapters":["sum5"]}


base_predictor = sagemaker.Predictor(
    endpoint_name = base_endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

base_predictor_response = base_predictor.predict(base_payload)

print(f"Base Model:\n{base_predictor_response['generated_text']}")
print("\n ================ \n")

tuned_predictor = sagemaker.Predictor(
    endpoint_name = tuned_endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

tuned_predictor_response = tuned_predictor.predict(tuned_payload)


print(f"Fine-Tuned Model (1 Epoch):\n{tuned_predictor_response['generated_text']}")
print("\n ================ \n")

tuned5_predictor = sagemaker.Predictor(
    endpoint_name = tuned_endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

tuned5_predictor_response = tuned_predictor.predict(tuned5_payload)


print(f"Fine-Tuned Model (5 Epochs):\n{tuned5_predictor_response['generated_text']}")
print("\n ================ \n")

### Building the test dataset files

fmeval requires the test data to be in a flat file format, so this section will take the examples from the ECTSum dataset object and store them on the local filesystem in jsonlines format.

This file will include the source transcript and the summary so model outputs can be evaluated against ground truth data.

The code uses 10 samples as a base to show functionality, but you can increase that number to gather more datapoints. The more samples, the longer the evaluation will take. If you are running this in a live workshop, it is advised to not go beyond 50 for time purposes. (50 samples will take around 5 minutes to generate the analysis per model)

In [None]:
import jsonlines

#Change this to whatever number of samples you'd like to run your analysis on. Set to "max" to use the whole set if you dont want to set a number.
#number_of_samples_to_take="max"
number_of_samples_to_take=10

if number_of_samples_to_take == "max" or len(test_dataset) < number_of_samples_to_take:
    number_of_samples_to_take = len(test_dataset)
    
output_file = "samsum_summary_sample.jsonl"

# For each line in `input_file`, invoke the model using the input from that line,
# augment the line with the invocation results, and write the augmented line to `output_file`.

with jsonlines.open(output_file, "w") as output_fh:
    for i in range(number_of_samples_to_take):
        sample = test_dataset[i]
        line = {}
        text = sample["dialogue"]
        line["dialogue"] = text
        summary = sample["summary"]
        line["summary"] = summary
        print(f'\rCompleted Row {i+1} of {number_of_samples_to_take}', end="")        
        output_fh.write(line)

### fmeval Setup

Now that the source data is prepared, you are ready to configure fmeval to run your tests.

In [None]:
from fmeval.data_loaders.data_config import DataConfig
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy

#### Data Config Setup

Below, we create a DataConfig for the local dataset file we just created, `ectsum_summary_sample.jsonl`.

- `dataset_name` is just an identifier for your own reference
- `dataset_uri` is either a local path to a file or an S3 URI
- `dataset_mime_type` is the MIME type of the dataset. Currently, JSON and JSON Lines are supported.
- `model_input_location`, `target_output_location`, and `model_output_location` are JMESPath queries used to find the model inputs, target outputs, and model outputs within the dataset. The values that you specify here depend on the structure of the dataset itself. Take a look at trex_sample_with_model_outputs.jsonl to see where "question", "answers", and "model_output" show up.

Because fmeval creates its reports based on the dataset name, we will need to create 3 objects to map to both the different variants.

In [None]:
base_model_data_config = DataConfig(
    dataset_name="base_model_samsum_sample",
    dataset_uri="samsum_summary_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="dialogue",
    target_output_location="summary",
)

tuned_model_data_config = DataConfig(
    dataset_name="tuned_model_samsum_sample",
    dataset_uri="samsum_summary_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="dialogue",
    target_output_location="summary",
)

tuned5_model_data_config = DataConfig(
    dataset_name="tuned5_model_samsum_sample",
    dataset_uri="samsum_summary_sample.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="dialogue",
    target_output_location="summary",
)

fmeval provides a `SageMakerModelRunner` class to facilitate calling the models during evaluation. The model runner, combined with the dataset, provides fmeval with the resources it needs to run through the source data file, generate model responses, then calculate the scores to compare.

Note the `output` maps to `generated_text` which is the output parameter from your model, and that the 2 tuned options have references to `adapters` in their `content_template` to specify the LoRA adapter to swap in for that inference call.

In [None]:
from fmeval.model_runners.sm_model_runner import SageMakerModelRunner

base_sagemaker_model_runner = SageMakerModelRunner(
    endpoint_name=base_endpoint_name,
    output='generated_text',
    content_type='application/json',
    accept_type='application/json',
    content_template='{"inputs": $prompt, "parameters": {"do_sample": true, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 256}}',
)

tuned_sagemaker_model_runner = SageMakerModelRunner(
    endpoint_name=tuned_endpoint_name,
    output='generated_text',
    content_type='application/json',
    accept_type='application/json',
    content_template='{"inputs": $prompt, "parameters": {"do_sample": true, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 256}, "adapters":["sum"]}',
)

tuned5_sagemaker_model_runner = SageMakerModelRunner(
    endpoint_name=tuned_endpoint_name,
    output='generated_text',
    content_type='application/json',
    accept_type='application/json',
    content_template='{"inputs": $prompt, "parameters": {"do_sample": true, "top_p": 0.9, "temperature": 0.9, "max_new_tokens": 256}, "adapters":["sum5"]}',
)

### Evaluation run configuration

Out of the box, fmeval writes the reports for evaluation runs to /tmp on the filesystem. To keep those files for analysis, you will set the `EVAL_RESULTS_PATH` environment variable to a subdirectory of the workshop folder.

fmeval supports a variety of [evaluation algorithms](https://github.com/aws/fmeval/tree/main/src/fmeval/eval_algorithms). In this example you will use the `SummarizationAccuracy` one, which will output METEOR, ROUGE, and BertScore Metrics

For larger datasets and evaluation instance types that have sufficient memory, you can also modify the `PARALLELIZATION_FACTOR` environment variable.

In [None]:
import os
eval_dir = "eval-results"
curr_dir = os.getcwd()
eval_results_path = os.path.join(curr_dir, eval_dir) + "/"
os.environ["EVAL_RESULTS_PATH"] = eval_results_path
if not os.path.exists(eval_results_path):
    os.mkdir(eval_results_path)

os.environ["PARALLELIZATION_FACTOR"] = "8"

This is the prompt template that you will use for your evaluation runs.

In [None]:
evaluation_prompt_template = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

$model_input

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

Next you will run 3 evaluations, 1 for each variant of your model. Ignore the verbose outputs/warnings here.

For 10 items each test should take about 1-2 minutes on a `ml.m5.2xlarge` with an inference endpoint of `ml.g5.2xlarge`. The entire dataset will take 15-20 minutes per test.

In [None]:
%%time
eval_algo = SummarizationAccuracy()

base_eval_output = eval_algo.evaluate(
    model=base_sagemaker_model_runner,
    dataset_config=base_model_data_config, 
    prompt_template=evaluation_prompt_template,
    save=True,
    num_records=number_of_samples_to_take
)

In [None]:
%%time
eval_algo = SummarizationAccuracy()

tuned_eval_output = eval_algo.evaluate(
    model=tuned_sagemaker_model_runner,
    dataset_config=tuned_model_data_config, 
    prompt_template=evaluation_prompt_template,
    save=True,
    num_records=number_of_samples_to_take
)

In [None]:
%%time
eval_algo = SummarizationAccuracy()

tuned5_eval_output = eval_algo.evaluate(
    model=tuned5_sagemaker_model_runner,
    dataset_config=tuned5_model_data_config, 
    prompt_template=evaluation_prompt_template,
    save=True,
    num_records=number_of_samples_to_take
)

#### Parse Evaluation Results

Now that your evaluation runs are complete, you can graph the metrics from the eval run to determine the quality of the model output. In this example you will create a histogram of the different metrics and their frequency distribution.

The metrics that are part of the `SummarizationAccuracy` algorithm are:
- **METEOR**: (**M**etric for **E**valuation of **T**ranslation with **E**xplicit **OR**dering) is an evaluation metric for machine translation that calculates the harmonic mean of unigram precision and recall, with a higher weight on recall. It also incorporates a penalty for sentences that significantly differ in length from the reference translations. The harmonic mean of unigram precision and recall provides a balanced evaluation of a machine translation system’s performance by considering both precision and recall.
  
- **ROUGE**: (**R**ecall-**O**riented **U**nderstudy for **G**isting **E**valuation) is an evaluation metric used to assess the quality of NLP tasks such as text summarization and machine translation. It measures the overlap of N-grams between the system-generated summary and the reference summary, providing insights into the precision and recall of the system’s output. There are several variants of ROUGE, including ROUGE-N, which quantifies the overlap of N-grams, and ROUGE-L, which calculates the Longest Common Subsequence (LCS) between the system and reference summaries.
  
- **BERTScore**: (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers Score) is an evaluation metric for natural language processing (NLP) tasks that leverages the pre-trained BERT language model to measure the similarity between two sentences. It computes the cosine similarity between the contextualized embeddings of the words in the candidate and reference sentences. BERTScore has been shown to correlate better with human judgments and provides stronger model selection performance than existing metrics.Ï

Source: [https://plainenglish.io/community/evaluating-nlp-models-a-comprehensive-guide-to-rouge-bleu-meteor-and-bertscore-metrics-d0f1b1]()

First you will look at the average overall metrics for each model:

In [None]:
from matplotlib import pyplot as plt
colors = ["red", "green", "blue"]

fig, axs = plt.subplots(3,1, figsize=(10,6))

models = ["Base", "Tuned (1 Epoch)", "Tuned (5 Epochs)"]

meteor_scores = [base_eval_output[0].dataset_scores[0].value, tuned_eval_output[0].dataset_scores[0].value, tuned5_eval_output[0].dataset_scores[0].value]
rouge_scores = [base_eval_output[0].dataset_scores[1].value, tuned_eval_output[0].dataset_scores[1].value, tuned5_eval_output[0].dataset_scores[1].value]
bert_scores = [base_eval_output[0].dataset_scores[2].value, tuned_eval_output[0].dataset_scores[2].value, tuned5_eval_output[0].dataset_scores[2].value]

axs[0].bar(models, meteor_scores, color=colors)
axs[0].set_title("Average METEOR Score")
axs[1].bar(models, rouge_scores, color=colors)
axs[1].set_title("Average ROUGE Score")
axs[2].bar(models, bert_scores, color=colors)
axs[2].set_title("Average BertScore Score")

plt.subplots_adjust(hspace=0.6)

print(f"METEOR Scores: {meteor_scores}")
print(f"ROUGE Scores: {rouge_scores}")
print(f"BertScore Scores: {bert_scores}")

Next you can look at small samples of each dataset. You will notice the full prompt, model output (generated from the test), target output (ground truth) and the various scores from the evaluation run.

In [None]:
# Create a Pandas DataFrame to visualize the results
import pandas as pd

base_data = []
# We obtain the path to the results file from "output_path" in the cell above
with open("./eval-results/summarization_accuracy_base_model_samsum_sample.jsonl", "r") as file:
    for line in file:
        base_data.append(json.loads(line))
base_df = pd.DataFrame(base_data)
base_df['meteor_score'] = base_df['scores'].apply(lambda x: x[0]['value'])
base_df['rouge_score'] = base_df['scores'].apply(lambda x: x[1]['value'])
base_df['bert_score'] = base_df['scores'].apply(lambda x: x[2]['value'])
base_df.head()

In [None]:
tuned_data = []
# We obtain the path to the results file from "output_path" in the cell above
with open("./eval-results/summarization_accuracy_tuned_model_samsum_sample.jsonl", "r") as file:
    for line in file:
        tuned_data.append(json.loads(line))
tuned_df = pd.DataFrame(tuned_data)
tuned_df['meteor_score'] = tuned_df['scores'].apply(lambda x: x[0]['value'])
tuned_df['rouge_score'] = tuned_df['scores'].apply(lambda x: x[1]['value'])
tuned_df['bert_score'] = tuned_df['scores'].apply(lambda x: x[2]['value'])
tuned_df.head()

In [None]:
tuned5_data = []
# We obtain the path to the results file from "output_path" in the cell above
with open("./eval-results/summarization_accuracy_tuned5_model_samsum_sample.jsonl", "r") as file:
    for line in file:
        tuned5_data.append(json.loads(line))
tuned5_df = pd.DataFrame(tuned5_data)
tuned5_df['meteor_score'] = tuned5_df['scores'].apply(lambda x: x[0]['value'])
tuned5_df['rouge_score'] = tuned5_df['scores'].apply(lambda x: x[1]['value'])
tuned5_df['bert_score'] = tuned5_df['scores'].apply(lambda x: x[2]['value'])
tuned5_df.head()

Finally you will plot all the scores for the 3 variants. This helps to visualize the performance of one candidate versus the others.

In [None]:
from matplotlib import pyplot as plt
colors = ["red", "green", "blue"]

fig, axs = plt.subplots(3,1, figsize=(10,6))

base_df['meteor_score'].plot.hist(histtype="step", bins=20, alpha=0.5, ax=axs[0], title = "Meteor Score", xlabel="Score", ylabel="Frequency", color=colors[0], label="Base")
base_df['rouge_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[1], title="Rouge Score", xlabel="Score", ylabel="Frequency", color=colors[0], label="Base")
base_df['bert_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[2], title="Bert Score", xlabel="Score", ylabel="Frequency", color=colors[0], label="Base")

tuned_df['meteor_score'].plot.hist(histtype="step", bins=20, alpha=0.5, ax=axs[0], title = "Meteor Score", xlabel="Score", ylabel="Frequency", color=colors[1], label="Tuned 1 Epoch")
tuned_df['rouge_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[1], title="Rouge Score", xlabel="Score", ylabel="Frequency", color=colors[1], label="Tuned 1 Epoch")
tuned_df['bert_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[2], title="Bert Score", xlabel="Score", ylabel="Frequency", color=colors[1], label="Tuned 1 Epoch")

tuned5_df['meteor_score'].plot.hist(histtype="step", bins=20, alpha=0.5, ax=axs[0], title = "Meteor Score", xlabel="Score", ylabel="Frequency", color=colors[2], label="Tuned 5 Epoch")
tuned5_df['rouge_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[1], title="Rouge Score", xlabel="Score", ylabel="Frequency", color=colors[2], label="Tuned 5 Epoch")
tuned5_df['bert_score'].plot.hist(histtype="step", bins=20, alpha=0.5,ax=axs[2], title="Bert Score", xlabel="Score", ylabel="Frequency", color=colors[2], label="Tuned 5 Epoch")

axs[0].legend(loc="upper right")
axs[1].legend(loc="upper right")
axs[2].legend(loc="upper right")

plt.subplots_adjust(hspace=0.6) 

If you were to run a full dataset evaluation against all 3 variants, your graph would look something like this. Notice that there is a large boost in model performance with 1 epoch of fine tuning, with somewhat diminishing returns for the 5 epoch variant.

![](./overall_scores_with_details.png)
![](./full_dataset_eval_results.png)