# Evaluate LLMs performance by metrics using Amazon Bedrock Automatic Model Evaluation 

## Overview

Automatic model evaluation jobs allow you to quickly evaluate a model's ability to perform a task. You can either provide your own custom prompt dataset that you've tailored to a specific use case, or you can use an available built-in dataset.

For supported regions and models please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-support.html

In [None]:
# Install dependencies
%pip install --upgrade --quiet awscli boto3 seaborn matplotlib

## Pre-requisites

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-automatic.html for complete list of pre-requsites to run automatic model evaluation jobs which will be done in the following cells. 

**But make sure that following pre-requistes are met before running this notebook.**

*1. S3 bucket in the same region as Amazon Bedrock models.*

*2. IAM role running this notebook has privilege to create/update IAM roles and S3 bucket.*

In [None]:
# Import modules

import boto3
from botocore.exceptions import ClientError
import ipywidgets as widgets
import json
import time
import datetime

### Select models for evaluation
Select model1 and model2 for comparison. Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-support.html to choose from the supported models and you can get the modelID using AWS CLI or Boto3 as follows.

**AWS CLI:**
```
aws bedrock list-foundation-models
```

**Boto3:**
```
import boto3
bedrock_client = boto3.client('bedrock')
bedrock_client.list_foundation_models()
```

Once you have the model ID from the supported list, please update the options list in the next two cells **only if needed.**

Otherwise, you can choose between the given options as below without updating modelID.

In [None]:
model_1 = widgets.Dropdown(
    options=[
        'meta.llama3-1-8b-instruct-v1:0',
        'meta.llama3-1-70b-instruct-v1:0',
        'meta.llama3-1-405b-instruct-v1:0',
    ],
    value='meta.llama3-1-8b-instruct-v1:0',
    description='Select model1:',
    disabled=False,
)
model_1

In [None]:
model_2 = widgets.Dropdown(
    options=[
        'meta.llama3-2-3b-instruct-v1:0',
        'meta.llama3-1-70b-instruct-v1:0',
        'meta.llama3-2-1b-instruct-v1:0',
    ],
    value='meta.llama3-2-3b-instruct-v1:0',
    description='Select model2:',
    disabled=False,
)
model_2

In [None]:
print('You selected models {} and {} for evaluation'.format(model_1.value, model_2.value))

### Choose a S3 Bucket for Model Evaluation jobs

As part of the pre-requisites, we need a S3 bucket in the same region for input datasets and output of model evaluation jobs.

In this example, we used Sagemaker's default session bucket. But if you are running this notebook not in Sagemaker, please follow the comments with the cell.

In [None]:
import sagemaker #comment if Sagemaker is not used
import boto3
sess = sagemaker.Session() #comment if Sagemaker is not used

#If you want to use a custom s3 bucket or running this notebook without Sagemaker, please mention the bucket name as follows
#bucket = ""
bucket=None

if bucket is None and sess is not None: 
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=bucket) #comment if Sagemaker is not used
 

print(f"Model Evaluation bucket: {bucket}")


### Enable Cross Origin Resource Sharing (CORS) on S3 bucket

Automatic model evaluations jobs that are created using the Amazon Bedrock console require that you specify a CORS configuration on the S3 bucket.

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-security-cors.html for more details.

In [None]:
#Cors
# Define the configuration rules
cors_configuration = {
    'CORSRules': [
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "PUT",
            "POST",
            "DELETE"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [
            "Access-Control-Allow-Origin"
        ]
    }
]
}

# Set the CORS configuration
s3 = boto3.client('s3')
s3.put_bucket_cors(Bucket=bucket,
                   CORSConfiguration=cors_configuration)

### IAM service role

To run an automatic model evaluation job you must create a service role. The service role allows Amazon Bedrock to perform actions on your behalf in your AWS account. Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/automatic-service-roles.html.

In [None]:
#Create IAM role
iam = boto3.client('iam')
aws_acct = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

assume_role_policy_document = json.dumps({
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowBedrockToAssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": aws_acct
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:{}:{}:evaluation-job/*".format(region, aws_acct)
                }
            }
        }
    ]
})



In [None]:
import datetime

role_name="Amazon-Bedrock-model-eval-{}".format(str(datetime.datetime.now().timestamp()).split('.')[0])
create_role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument = assume_role_policy_document
)

In [None]:
role_arn = create_role_response["Role"]["Arn"]

role_arn

### Add Permissions to Amazon Bedrock and access S3 Bucket.

In [None]:
aws_s3_policy_doc = json.dumps({
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "AllowAccessToCustomDatasetsAndOutput",
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObject"
        ],
        "Resource": [
            "arn:aws:s3:::{}".format(bucket),
            "arn:aws:s3:::{}/outputs/".format(bucket),
            "arn:aws:s3:::{}/custom_datasets/".format(bucket),
            "arn:aws:s3:::{}/*".format(bucket),
        ]
    }
]
}
)

aws_br_policy_doc = json.dumps({
        "Version": "2012-10-17",
            "Statement": [
        {
            "Sid": "AllowAccessToBedrockResources",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream",
                "bedrock:CreateModelInvocationJob",
                "bedrock:StopModelInvocationJob",
                "bedrock:GetProvisionedModelThroughput",
                "bedrock:GetInferenceProfile", 
                "bedrock:ListInferenceProfiles",
                "bedrock:GetImportedModel",
                "bedrock:GetPromptRouter",
                "sagemaker:InvokeEndpoint"
            ],
            "Resource": [
                "arn:aws:bedrock:*::foundation-model/*",
                "arn:aws:bedrock:*:{}:inference-profile/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:provisioned-model/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:imported-model/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:application-inference-profile/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:default-prompt-router/*".format(aws_acct),
                "arn:aws:sagemaker:*:{}:endpoint/*".format(aws_acct),
                "arn:aws:bedrock:*:{}:marketplace/model-endpoint/all-access".format(aws_acct)
            ]
        }
    ]
}
)
    

In [None]:
iam_s3_response = iam.put_role_policy(
    RoleName=role_name,
    PolicyName="s3_access",
    PolicyDocument=aws_s3_policy_doc
)
iam_s3_response

In [None]:
iam_bedrock_response = iam.put_role_policy(
    RoleName=role_name,
    PolicyName="br_access",
    PolicyDocument=aws_br_policy_doc
)
iam_bedrock_response

#### Get ARNs for the selected models

In [None]:
import boto3
bedrock_client = boto3.client('bedrock')
region = boto3.session.Session().region_name
region_prefix = region.split('-')[0]
model_arns = []

for model in model_1.value, model_2.value:
    fm_response = bedrock_client.get_foundation_model(
        modelIdentifier=model
    )
    if fm_response['modelDetails']['inferenceTypesSupported'][0] == "ON_DEMAND":
        model_arn = fm_response['modelDetails']['modelArn']
    elif fm_response['modelDetails']['inferenceTypesSupported'][0] == "INFERENCE_PROFILE":
        model = "{}.{}".format(region_prefix, model)
        model_arn = bedrock_client.get_inference_profile(
            inferenceProfileIdentifier=model
        )['inferenceProfileArn']
    print(model_arn)
    model_arns.append(model_arn)

In [None]:
#for custom model
model_arns[1] = "arn:aws:bedrock:us-west-2:072851894905:imported-model/1vb1vige5vll"

#### Function to submit automatic model evaluation jobs

In [None]:


def model_eval(model_arn, dataset, task_type, output_path, job_name, metric_names, custom_ds=False, custom_ds_s3=None):
    if custom_ds:
        ds = {
                'name': dataset,
                'datasetLocation': {
                    's3Uri': custom_ds_s3
                        }
            }
    else:
        ds = {
                'name': dataset
            }
    job_request = bedrock_client.create_evaluation_job(
        jobName=job_name,
        jobDescription="Bedrock Model evaluation job",
        roleArn=role_arn,
        inferenceConfig={
            "models": [
                {
                    "bedrockModel": {
                        "modelIdentifier":model_arn,
                        "inferenceParams":"{\"inferenceConfig\":{\"maxTokens\": 1024,\"temperature\":0.3,\"topP\":0.5}}"
                    }

                }
            ]

        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [
                    {
                        "taskType": task_type,
                        "dataset": ds,
                        "metricNames": metric_names
                    }
                ]
            }
        }
    )

    return job_request

## <ins> Automatic Model evaluation using Builtin Dataset </ins>

### Define taskType, Dataset and metrics for evaluation

**Task Type:**
Model evaluation supports the following task types that assess different aspects of the model's performance:

* General text generation – the model performs natural language processing and text generation tasks.
* Text summarization – the model performs summarizes text based on the prompts you provide.
* Question and answer – the model provides answers based on your prompts.
* Text classification – the model categorizes text into predefined classes based on the input dataset.

**Metrics:**
You can choose from the following the metrics that you want the model evaluation job to create.

* Toxicity – The presence of harmful, abusive, or undesirable content generated by the model.
* Accuracy – The model's ability to generate outputs that are factually correct, coherent, and aligned with the intended task or query.
* Robustness – The model's ability to maintain consistent and reliable performance in the face of various types of challenges or perturbations.

**Datasets:**
Amazon Bedrock provides multiple built-in prompt datasets that you can use in an automatic model evaluation job. Each built-in dataset is based off an open-source dataset. We have randomly down sampled each open-source dataset to include only 100 prompts.

For complete list of supported datasets, Task Types and metrics, please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html. 

In [None]:
import datetime

### Use any one of the following examples combinations of task_type, dataset and metrics or from supported built-in task_types, metrics and datasets from 
### https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html#model-evaluation-prompt-datasets-builtin

#### Example-1 #####
task_type = "QuestionAndAnswer"
dataset = "Builtin.NaturalQuestions"
metric_names = ["Builtin.Accuracy", "Builtin.Robustness", "Builtin.Toxicity"]

#### Example-1 #####
#task_type = "Classification"
#dataset = "Builtin.WomensEcommerceClothingReviews"
#metric_names = ["Builtin.Accuracy", "Builtin.Robustness"] 

output_path = "s3://{}/outputs/".format(bucket)

eval_jobs = []
for model_arn in model_arns:
    job_name = "model-eval-{}-{}".format(model_arn.split('/')[-1].split(':')[0], str(datetime.datetime.now().timestamp()).split('.')[0])
    job_name = job_name.replace(".", "-")
    job = model_eval(model_arn, dataset, task_type, output_path, job_name, metric_names)
    eval_jobs.append(job)


In [None]:
## Function to check the job status in a loop until "COMPLETED" or "FAILED" post submission.
def check_job_status(eval_jobs):
    # Loop through and wait for the evaluation jobs to complete . 
    from IPython.display import clear_output
    import time
    from datetime import datetime
    
    max_time = time.time() + 2*60*60 # 2 hours - Update the max time if needed
    while time.time() < max_time:
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        get_eval_job1 = bedrock_client.get_evaluation_job(
            jobIdentifier=eval_jobs[0]['jobArn']
        )

        job1_status = get_eval_job1["status"]
        get_eval_job2 = bedrock_client.get_evaluation_job(
            jobIdentifier=eval_jobs[1]['jobArn']
        )

        job2_status = get_eval_job2["status"]
        clear_output(wait=True)
        print(f"{current_time} : Model evluation job1 is {job1_status} and job2 is {job2_status}.")

        if (job1_status == "Completed" or job1_status == "Failed") and (job2_status == "Completed" or job2_status == "Failed"):
            break

        time.sleep(60)
    return get_eval_job1, get_eval_job2

In [None]:
#Check jobs status and go to loop until finish
get_eval_job1, get_eval_job2 = check_job_status(eval_jobs)

In [None]:
# Function to get the S3 output location of model evaluation job.
s3_client = boto3.client('s3')
def get_output_jsonl(bucket, eval_job_response, model, task_type, dataset):
    prefix = "{}{}/{}/models/{}/taskTypes/{}/datasets/{}".format("/".join(eval_job_response["outputDataConfig"]["s3Uri"].split('/')[3:]), eval_job_response["jobName"], eval_job_response["jobArn"].split("/")[1], model, task_type, dataset)
    response = s3_client.list_objects(
        Bucket=bucket,
        Prefix=prefix,
    )
    return response['Contents'][0]['Key']

In [None]:
model_val1 = get_eval_job1['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
model_val2 = get_eval_job2['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
job1_output = get_output_jsonl(bucket, get_eval_job1, model_val1, task_type, dataset)
job2_output = get_output_jsonl(bucket, get_eval_job2, model_val2, task_type, dataset)

In [None]:
# Function to retrieve metrics from the output
import json
s3_res = boto3.resource('s3')

def retrieve_metrics(bucket, output_jsonl):
    content_object = s3_res.Object(bucket, output_jsonl)
    jsonl_content = content_object.get()['Body'].read().decode('utf-8')
    output_content = [json.loads(jline) for jline in jsonl_content.splitlines()]
    return output_content

job1_metrics =  retrieve_metrics(bucket, job1_output)
job2_metrics =  retrieve_metrics(bucket, job2_output)

In [None]:
# Function to filter and load the metrics in pandas DataFrame
import pandas as pd

def pd_metrics(model1, model2, metric, job1_metrics, job2_metrics):
    met1 = []
    met2 = []
    met_index = [job1_metrics[0]['automatedEvaluationResult']['scores'].index(i) for i in job1_metrics[0]['automatedEvaluationResult']['scores'] if i["metricName"]==metric]
    for i, (x, y) in enumerate(zip(job1_metrics, job2_metrics)):
        met1.append(x['automatedEvaluationResult']['scores'][met_index[0]]['result'])
        met2.append(y['automatedEvaluationResult']['scores'][met_index[0]]['result'])
    met = pd.DataFrame({model1.split(':')[0]: met1, model2.split(':')[0]: met2})
    return met

In [None]:
metrics = [m.split('.')[1] for m in metric_names]
stats_list = []
for metric in metrics:
    met_pd = pd_metrics(model_1.value, model_2.value, metric, job1_metrics, job2_metrics)
    stats_list.append(met_pd)

In [None]:
# Function to line plot for model comparison per metric
import seaborn as sns
import matplotlib.pyplot as plt

def plot_line_metrics(metrics, stats_list):
    for metric, df in zip(metrics, stats_list):
        plt.figure(figsize=(12, 6))
        sns.set_style("whitegrid")
        sns.lineplot(data=df, markers=True, palette="flare")
        plt.legend(title='Model')
        plt.xlabel('Inference test')
        plt.ylabel(metric)
        plt.title(metric)
        plt.show();

In [None]:
plot_line_metrics(metrics, stats_list)

In [None]:
# Function to plot bar chart for avg accuracy per model
def plt_acc_bar(df, metric):
    # Calculate the average of each column
    column_averages = df.mean()

    # Create a bar plot
    plt.figure()
    sns.barplot(x=column_averages.index, y=column_averages.values)

    # Customize the plot
    plt.title("Average metric - {}".format(metric))
    plt.xlabel('Models')
    plt.ylabel('Average Value')

    # Rotate x-axis labels if there are many columns
    plt.xticks(rotation=45, ha='right')

    # Add value labels on top of each bar
    for i, v in enumerate(column_averages.values):
        plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')

    plt.tight_layout()
    plt.show()

In [None]:
#Average Accuracy
plt_acc_bar(stats_list[0], metrics[0])

In [None]:
#Function to bin the accuracy data in different accuracy(in percentage) bins [0, 20, 40, 60, 80, 100] and compare between models


def bin_data(series, bins_list):
    bins = pd.cut(series, bins=bins_list)
    return bins, bins.value_counts().index

def plot_bin_accuracy(df, bins_list):
    # Apply binning to both columns
    df_binned = df.apply(lambda x: bin_data(x, bins_list)[0])
    bin_edges = bin_data(df.values.flatten(), bins_list)[1]

    # Melt the DataFrame to long format
    df_melted = df_binned.melt(var_name='model', value_name='bin')

    # Count the occurrences of each bin for each model
    df_counted = df_melted.groupby(['model', 'bin']).size().reset_index(name='count')

    # Create the plot
    plt.figure(figsize=(12, 6))
    sns.barplot(x='bin', y='count', hue='model', data=df_counted)

    # Customize the plot
    plt.title('Comparison of Accuracy Range Across Two Models')
    plt.xlabel('Accuracy Range')
    plt.ylabel('Count')
    plt.legend(title='Model')

    # Set x-axis labels to actual bin ranges
    plt.xticks(range(len(bin_edges)), [f'({interval.left:.2f}, {interval.right:.2f}]' for interval in bin_edges], rotation=45, ha='right')

    plt.tight_layout()
    plt.show()

In [None]:
plot_bin_accuracy(stats_list[0], bins_list=[0, 0.2, 0.4, 0.6, 0.8, 1.0]) #update the bin values as needed

## <ins> Automatic Model Evaluation using  Custom Dataset </ins>

Now lets start evaluating the same models with a custom dataset. 

*For this demo purpose only, we use Databricks Dolly-15k Dataset from HuggingFace.*

**Note: Customers may use their own validation(groundtruth) dataset in the given format below based on their workload.**


You can create a custom prompt dataset in an automatic model evaluation jobs. Custom prompt datasets must be stored in Amazon S3, and use the JSON line format and use the .jsonl file extension. Each line must be a valid JSON object. There can be up to 1000 prompts in your dataset per automatic evaluation job.

**Custom dataset must use the following keys value pairs format.**

`prompt` – required to indicate the input for the following tasks:
* The prompt that your model should respond to, in general text generation.
* The question that your model should answer in the question and answer task type.
* The text that your model should summarize in text summarization task.
* The text that your model should classify in classification tasks.

`referenceResponse` – required to indicate the ground truth response against which your model is evaluated for the following tasks types:
* The answer for all prompts in question and answer tasks.
* The answer for all accuracy, and robustness evaluations.

`category` – (optional) generates evaluation scores reported for each category.

As an example, accuracy requires both the question asked, and a answer to check the model's response against. In this example, use the key `prompt` with the value contains the question, the key `referenceResponse` with the value contains the answer and the key `category` contains the category of the question as follows.

```
{"prompt": "Are The Smiths a good band?", 
"referenceResponse": "The Smiths were one of the most critically acclaimed bands to come from England in the 1980s. Typically classified as an \"indie rock\" band, the band released 4 albums from 1984 until their breakup in 1987. The band members, notably Morrissey and Johnny Marr, would go on to accomplish successful solo careers.",
"category": "general_qa"}
```

Please refer to https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets.html#model-evaluation-prompt-datasets-custom for more details.

In [None]:
# Download dataset
!wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

### In this example, we will sample 100 records of "open_qa" category from dolly-15k dataset

In [None]:
# Function to filter and select 100 records from dolly dataset
import json

def filter_jsonl(data, key, value):
    filtered_data = []
    for item in data:
        if item.get(key) == value:
            filtered_data.append(item)
    return filtered_data

with open('databricks-dolly-15k.jsonl', 'r') as file:
    data = [json.loads(line) for line in file]

filtered_data = filter_jsonl(data, "category", "open_qa")[:100]
print(len(filtered_data))

In [None]:
# Function to modify the format as needed for custom dataset

custom_jsonl = './custom_dataset.jsonl'

def write_jsonl(data, filename):
    with open(filename, 'w') as f:
        for item in data:
            item_mod = {}
            item_mod['prompt'] = item['instruction']
            item_mod['referenceResponse'] = item['response']
            item_mod['category'] = item['category']
            f.write(json.dumps(item_mod) + '\n')

# Write to JSONL file
write_jsonl(filtered_data, custom_jsonl)
                     

In [None]:
#Copy dataset jsonl to S3 Bucket
import boto3

s3_res = boto3.resource('s3')
s3_res.Bucket(bucket).upload_file(custom_jsonl, 'custom_datasets/dolly/')


In [None]:
# Choose task_type, metrics and s3 input/output path
task_type = "QuestionAndAnswer"
metric_names = ["Builtin.Accuracy", "Builtin.Robustness", "Builtin.Toxicity"] #Add or remove metrics within the list format
output_path = "s3://{}/outputs/".format(bucket)
cus_ds_s3 = "s3://{}/custom_datasets/dolly/".format(bucket)

In [None]:
# Submit automatic model evaluation jobs with custom dataset
import datetime
cust_eval_jobs = []
for model_arn in model_arns:
    job_name = "model-eval-custom-{}-{}".format(model_arn.split('/')[-1].split(':')[0], str(datetime.datetime.now().timestamp()).split('.')[0])
    job_name = job_name.replace(".", "-")
    job = model_eval(model_arn, "custom", task_type, output_path, job_name, metric_names, custom_ds=True, custom_ds_s3=cus_ds_s3)
    cust_eval_jobs.append(job)


In [None]:
# Track evluation job status in a loop until "COMPLETED" or "FAILED"
get_cust_eval_job1, get_cust_eval_job2 = check_job_status(cust_eval_jobs)

In [None]:
# Get evaluation jobs output
model_val1 = get_cust_eval_job1['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
model_val2 = get_cust_eval_job2['inferenceConfig']['models'][0]['bedrockModel']['modelIdentifier'].split('/')[-1]
cust_job1_output = get_output_jsonl(bucket, get_cust_eval_job1, model_val1, task_type, dataset="custom")
cust_job2_output = get_output_jsonl(bucket, get_cust_eval_job2, model_val2, task_type, dataset="custom")

In [None]:
# Retrieve metrics
cust_job1_metrics =  retrieve_metrics(bucket, cust_job1_output)
cust_job2_metrics =  retrieve_metrics(bucket, cust_job2_output)

In [None]:
metrics = [ m.split('.')[1] for m in metric_names]
cust_stats_list = []
for metric in metrics:
    met_pd = pd_metrics(model_1.value, model_2.value, metric, cust_job1_metrics, cust_job2_metrics)
    cust_stats_list.append(met_pd)


In [None]:
# Draw line plot for model comparison per metric
plot_line_metrics(metrics, cust_stats_list)

In [None]:
#Average Accuracy per model
plt_acc_bar(cust_stats_list[0], metrics[0])

In [None]:
#Plot across different ranges of accuracy and compare
plot_bin_accuracy(cust_stats_list[0], bins_list=[0, 0.2, 0.4, 0.6, 0.8, 1.0]) #update the bin values as needed