# Fine Tuning Large Language Model with AutoMLV2

In this notebook, we will explore the process of fine-tuning a LLaMA-2 7B model for question-answering tasks in the healthcare domain. The steps involve loading a specialized healthcare dataset and leveraging Amazon SageMaker's capabilities, specifically focusing on using SageMaker Pipelines and AutoMLv2. Our end goal is to deploy the fine-tuned model to perform either real-time or batch inference, enabling it to make accurate predictions based on the input questions.

## Setup

This notebook has been run in Amazon SageMaker Studio. The space is configured with a t3.medium, with image SageMaker Distribution 1.4 and kernel Python 3 (ipykernel).

In [24]:
!pip install -U awscli sagemaker boto3 botocore fmeval==1.0.3 datasets --quiet

[0m

In [2]:
# SageMaker Python SDK Dependencies
import sagemaker
from sagemaker import ModelPackage
from sagemaker.image_uris import retrieve
from sagemaker.lambda_helper import Lambda
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.callback_step import CallbackStep
from sagemaker.workflow.lambda_step import LambdaStep
from sagemaker.workflow.parameters import ParameterInteger, ParameterString, ParameterFloat
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, CacheConfig
from sagemaker.workflow.function_step import step
from sagemaker.automl.automlv2 import AutoMLV2, AutoMLTextGenerationConfig, AutoMLDataChannel
#other dependencies
import boto3
import os

# Helpers
import time
from datetime import datetime
from time import gmtime, sleep, strftime

# Dataset
from datasets import load_dataset, Dataset

os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


2024-07-23 08:18:24,344	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.6.5 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-07-23 08:18:24,784	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.6.5 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
INFO:datasets:PyTorch version 2.3.1 available.


## Initialization

In [3]:
boto_session = boto3.session.Session()
aws_region = boto_session.region_name
sagemaker_client = boto_session.client("sagemaker")
lambda_client = boto_session.client("lambda")
sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session, sagemaker_client=sagemaker_client
)
sqs_client = boto_session.client(
    "sqs",
    region_name=aws_region,
    endpoint_url=f"https://sqs.{aws_region}.amazonaws.com",
)
BUCKET_NAME = sagemaker_session.default_bucket()

sagemaker.config INFO - Fetched defaults config from location: /root/autoMLV2


## IAM Permissions

This notebook simplifies the IAM permissions configuration when creating required IAM roles that can be assumed by the SageMaker and Lambda services. The following managed policies are sufficient to run this notebook but should be further scoped down to improve security (least privilege principle).

- Lambda Execution Role:
    - AmazonSageMakerFullAccess
    - AmazonSQSFullAccess
- SageMaker Execution Role:
    - AmazonSageMakerFullAccess
    - AWSLambda_FullAccess
    - AmazonSQSFullAccess

In [4]:
lambda_execution_role_name = "LambdaExecutionRole"
aws_account_id = boto3.client("sts").get_caller_identity().get("Account")
LAMBDA_EXECUTION_ROLE_ARN = f"arn:aws:iam::{aws_account_id}:role/{lambda_execution_role_name}"  # to be assumed by the Lambda service
SAGEMAKER_EXECUTION_ROLE_ARN = (
    sagemaker.get_execution_role()
)  

# SageMaker Training Pipeline 

## Pipeline Parameters

In the following, we set the different parameters needed in the pipeline steps

In [5]:
cache_config = CacheConfig(enable_caching=False)
autopilot_job_name = ParameterString(
    name="AutopilotJobName",
    default_value="autopilot-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
)
# Name of the mode to be fine-tuned. Change this to the model of your choice
base_model_name = ParameterString(name="BaseModelName", default_value="Llama2-7B")
s3_path= ParameterString(
    name="S3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value
    ),
)

#S3 Output paths
training_output_s3_path = ParameterString(
    name="TrainingOutputS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "training-output"
    ),
)
output_s3_path = ParameterString(
    name="OutputS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "output"
    ),
)


## Step 1: Load & Split Dataset
Let's first define the S3 paths to store the different samples of the dataset.

`train_dataset_s3_path`: path of S3 bucket to store the training dataset

`validation_dataset_s3_path`: path of S3 bucket to store the validation dataset. To be used during evaluation step.


In [6]:
#Dataset S3 input paths
train_dataset_s3_path = ParameterString(
    name="TrainDatasetS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "data/train", "train.csv"
    ),
)
validation_dataset_s3_path = ParameterString(
    name="ValidationDatasetS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "data/validation", "validation.csv"
    ),
)

s3://sagemaker-us-east-1-146025735673/autopilot-2024-07-18-13-34-47/data/train/train.csv


We load a set of data from huggingFace to accompany this notebook. The dataset is a set of questions/answers on multiple healthcare cases. You may use our synthetic dataset, or alter this notebook to accommodate your own data. As a note, the next cell will load the dataset to your S3 bucket.
IMPORTANT: When training the model using AutoMLV2, your dataset should contain only 2 columns: 'input' and 'output'.

In [22]:
import load_split_dataset
step_load_dataset = step(load_split_dataset.load_split_dataset, name="load_split_dataset")(train_dataset_s3_path.default_value, validation_dataset_s3_path.default_value)

## Step 2: Create an AutoMLV2 Job to train the model

### Define the AutoML text generation config

Let's define the configuration parameters we'll be passing to the Text Generation Job Configuration.

In [35]:
#hyperparameters
epoch_count = ParameterString(name="epochCount", default_value="3")
learning_rate = ParameterString(name="learningRate", default_value="0.00001")
batch_size = ParameterString(name="batchSize", default_value="1") 
learning_rate_warmup_steps = ParameterString(name="learningRateWarmupSteps", default_value="0") 

### Configuration highlights
`epochCount`
* __Description__: Determines how many times the model goes through the entire training dataset.<br>
* __Value '3'__: One epoch means the Llama2 model has been exposed to all 10000 samples and had a chance to learn from them. You can stick to 3, or increase the number, if the model doesn’t converge with just 3 epochs

`learning_rate`
* __Description__: Controls the step size at which a model's parameters are updated during training. It determines how quickly or slowly the model's parameters are updated during training.<br>
* __Value '0.00001'__: A learning rate of 1e-5 or 2e-5 is a good standard when fine-tuning LLMs like llama2.

`batch_size`
* __Description__: Defines the number of data samples used in each iteration of training. It can affect the convergence speed and memory usage.<br>
* __Value '1'__: Start with 1 to avoid out of memory error

`learning_rate_warmup_steps`
* __Description__: Specifies the number of training steps during which the learning rate gradually increases before reaching its target or maximum value.<br>
* __Value '1'__: Start with a value 1

### Create an AutoMLV2 Job to train the model

In [24]:
lambda_start_autopilot_job = Lambda(
    function_name="StartSagemakerAutopilotJob",
    execution_role_arn=LAMBDA_EXECUTION_ROLE_ARN,
    script="start_autopilot_job.py",
    handler="start_autopilot_job.lambda_handler",
    session=sagemaker_session,
)
lambda_start_autopilot_job.upsert()
step_start_autopilot_job = LambdaStep(
    name="StartAutopilotJobStep",
    lambda_func=lambda_start_autopilot_job,
    inputs={
        "TrainDatasetS3Path": train_dataset_s3_path.default_value,
        "TrainingOutputS3Path": training_output_s3_path.default_value,
        "AutopilotJobName": autopilot_job_name.default_value,
        "BaseModelName": base_model_name.default_value,
        "epochCount": epoch_count.default_value,
        "learningRate": learning_rate.default_value,
        "batchSize": batch_size.default_value,
        "learningRateWarmupSteps": learning_rate_warmup_steps.default_value,
        "AutopilotExecutionRoleArn": SAGEMAKER_EXECUTION_ROLE_ARN,
    },
    cache_config=cache_config,
    depends_on=[step_load_dataset]
)

## Step 3: Check Completion Status of the AutoMLV2 Job 

In [25]:
lambda_check_autopilot_job_status_function_name = "CheckSagemakerAutopilotJobStatus"
lambda_check_autopilot_job_status = Lambda(
    function_name=lambda_check_autopilot_job_status_function_name,
    execution_role_arn=LAMBDA_EXECUTION_ROLE_ARN,
    script="check_autopilot_job_status.py",
    handler="check_autopilot_job_status.lambda_handler",
    session=sagemaker_session,
    timeout=15,
)
lambda_check_autopilot_job_status.upsert()
queue_url = sqs_client.create_queue(
    QueueName="AutopilotSagemakerPipelinesSqsCallback",
    Attributes={"DelaySeconds": "5", "VisibilityTimeout": "300"},
)[
    "QueueUrl"
]  # 5 minutes timeout
# Add event source mapping
try:
    response = lambda_client.create_event_source_mapping(
        EventSourceArn=sqs_client.get_queue_attributes(
            QueueUrl=queue_url, AttributeNames=["QueueArn"]
        )["Attributes"]["QueueArn"],
        FunctionName=lambda_check_autopilot_job_status_function_name,
        Enabled=True,
        BatchSize=1,
    )
except lambda_client.exceptions.ResourceConflictException:
    pass
step_check_autopilot_job_status_callback = CallbackStep(
    name="CheckAutopilotJobStatusCallbackStep",
    sqs_queue_url=queue_url,
    inputs={"AutopilotJobName": autopilot_job_name},
    outputs=[],
    depends_on=[step_start_autopilot_job],
)

## Step 4: Create Model

This step is focused on creating a SageMaker model from the best candidate generated by the AutoML job and retrieving insights about the model:

- automl_pipeline_model is created using the create_model method of the AutoMLV2 object, which packages the best candidate model for deployment. This model is identified by best_candidate_name and originates from the best_candidate data structure.

- model_insights_report and model_explainability_report are extracted from the best_candidate's properties. These reports provide valuable insights into the model's performance and explainability, offering an in-depth understanding of how the model makes its predictions.

In [7]:
model_name = ParameterString(name="ModelName")
model_metrics_report_param = ParameterString(name="ModelMetricsReport")
metrics_report_s3_path = ParameterString(
    name="MetricsReportS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "metrics/model_metrics.json"
    ),
)
endpoint_name = ParameterString(name="EndpointName", default_value=" ")

In [39]:
from sagemaker.workflow.lambda_step import LambdaOutput, LambdaOutputTypeEnum
output_param_1 = LambdaOutput(
    output_name="model_name_param", output_type=LambdaOutputTypeEnum.String
)
output_param_2 = LambdaOutput(
    output_name="model_metrics_report_param", output_type=LambdaOutputTypeEnum.String
)

output_param_3 = LambdaOutput(
    output_name="endpoint_name_param", output_type=LambdaOutputTypeEnum.String
)

lambda_create_autopilot_model = Lambda(
    function_name="CreateSagemakerAutopilotModel",
    execution_role_arn=LAMBDA_EXECUTION_ROLE_ARN,
    script="create_autopilot_model.py",
    handler="create_autopilot_model.lambda_handler",
    session=sagemaker_session,
    timeout=15,
)
lambda_create_autopilot_model.upsert()
step_create_autopilot_model = LambdaStep(
    name="CreateAutopilotModelStep",
    lambda_func=lambda_create_autopilot_model,
    inputs={
        "AutopilotJobName": autopilot_job_name.default_value,
        "AutopilotExecutionRoleArn": SAGEMAKER_EXECUTION_ROLE_ARN,
        "MetricsReportS3Path": metrics_report_s3_path,
    },
    outputs=[output_param_1, output_param_2, output_param_3],
    cache_config=cache_config,
    depends_on=[step_check_autopilot_job_status_callback]
)

model_name = step_create_autopilot_model.properties.Outputs["model_name_param"]
model_metrics_report = step_create_autopilot_model.properties.Outputs["model_metrics_report_param"]
endpoint_name = step_create_autopilot_model.properties.Outputs["endpoint_name_param"]

## Create and Run Training Pipeline

In [40]:
pipeline = Pipeline(
    name="training-pipeline",
    parameters=[
        autopilot_job_name,
        train_dataset_s3_path,
        validation_dataset_s3_path,
        base_model_name,
        epoch_count,
        learning_rate,
        metrics_report_s3_path,
        batch_size,
        learning_rate_warmup_steps,
        output_s3_path
    ],
    steps=[
        step_load_dataset,
        step_start_autopilot_job,
        step_check_autopilot_job_status_callback,
        step_create_autopilot_model,
    ],
    sagemaker_session=sagemaker_session,
)
pipeline.upsert(role_arn=SAGEMAKER_EXECUTION_ROLE_ARN)
pipeline_execution = pipeline.start()
pipeline_execution.wait(delay=20, max_attempts=24 * 60 * 3)  # max wait: 24 hours

# SageMaker Inference Pipeline 

To evaluate the deployed model, we will use __fmeval__, a library to evaluate Large Language Models (LLMs) in order to help select the best LLM to register. The library contains algorithms to evaluate LLMs for Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping across different tasks.
Also, implementations of the ModelRunner interface. ModelRunner encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing the predict method.

In [23]:
print(endpoint_name.default_value)

ep-autopilot-model-2024-07-22-06-41-20-automl


## Step 1: Preprocess Evaluation data

Let's define the output s3 path to store preprocessed evaluation data.

In [61]:
#Dataset S3 input paths
eval_dataset_s3_path = ParameterString(
    name="EvalDatasetS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "data/evaluation", "dataset_evaluation.jsonl"
    ),
)

In [62]:
import preprocess_evaluation
step_preprocess_eval_dataset = step(preprocess_evaluation.preprocess_evaluation, name="preprocess_evaluation")(eval_dataset_s3_path.default_value, validation_dataset_s3_path.default_value)

## Step 2: Evaluate Model

In [63]:
#Dataset S3 input paths
eval_metrics_output_key = ParameterString(
    name="EvalMetricsOutputKey",
    default_value=f"{autopilot_job_name.default_value}/metrics/evaluation_metrics.json"
)

eval_metrics_output_s3_path = ParameterString(
    name="EvalMetricsOutputS3Path",
    default_value=sagemaker.s3.s3_path_join(
        "s3://", BUCKET_NAME, autopilot_job_name.default_value, "metrics", "evaluation_metrics.json"
    ),
)

In [64]:
import evaluate_model
step_evaluate_model = step(evaluate_model.evaluate_model, name="evaluate")(step_preprocess_eval_dataset, endpoint_name.default_value, BUCKET_NAME, eval_metrics_output_key.default_value)

## Step 3: Register Model

In [65]:
#registry parameters
model_approval_status = ParameterString(name="ModelApprovalStatus", default_value="Approved")
model_package_name = ParameterString(
    name="ModelPackageName",
    default_value=autopilot_job_name.default_value + "-model-package",
)
instance_type = ParameterString(name="InstanceType", default_value="ml.m5.xlarge")

In [66]:
lambda_register_autopilot_model = Lambda(
    function_name="RegisterSagemakerAutopilotModel",
    execution_role_arn=LAMBDA_EXECUTION_ROLE_ARN,
    script="register_autopilot_model.py",
    handler="register_autopilot_model.lambda_handler",
    session=sagemaker_session,
    timeout=15,
)
lambda_register_autopilot_model.upsert()
step_register_autopilot_model = LambdaStep(
    name="RegisterAutopilotModelStep",
    lambda_func=lambda_register_autopilot_model,
    inputs={
        "AutopilotJobName": autopilot_job_name,
        "ModelPackageName": model_package_name.default_value,
        "ModelApprovalStatus": model_approval_status.default_value,
        "InstanceType": instance_type.default_value,
        "EvalMetricsOutputS3Path": eval_metrics_output_s3_path.default_value,
        "AutopilotExecutionRoleArn": SAGEMAKER_EXECUTION_ROLE_ARN,
    },
    cache_config=cache_config,
)

## Step 4: Conditional step

In [67]:
from sagemaker.workflow.conditions import ConditionGreaterThan, ConditionGreaterThanOrEqualTo
from sagemaker.workflow.execution_variables import ExecutionVariables
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import Join, JsonGet
eval_metrics_output_s3_path = Join(
    on="/",
    values=["s3:/",
        BUCKET_NAME, 
        eval_metrics_output_key.default_value
    ]
)

json_metrics_path = "[0][0]['dataset_scores'][0]['f1_score']['value']"
step_conditional_registration = ConditionStep(
        name="ConditionalRegistrationStep",
        conditions=[
            ConditionGreaterThanOrEqualTo(
                left=JsonGet(
                    step_name="evaluate",
                    s3_uri= eval_metrics_output_s3_path,
                    json_path=json_metrics_path,
                ),
                right=0.1,
            )
        ],
        if_steps=[step_register_autopilot_model],
        else_steps=[],  # pipeline end
    )

### Create and Run Inference Pipeline

In [68]:
pipeline = Pipeline(
    name="inference-pipeline",
    parameters=[
        autopilot_job_name,
        validation_dataset_s3_path,
        eval_dataset_s3_path,
        endpoint_name,
        BUCKET_NAME,
        eval_metrics_output_key,
        base_model_name,
        metrics_report_s3_path,
        model_approval_status,
        model_package_name,
        instance_type,
        output_s3_path
    ],
    steps=[
        step_preprocess_eval_dataset,
        step_evaluate_model,
        step_conditional_registration
        #add other steps
    ],
    sagemaker_session=sagemaker_session,
)
pipeline.upsert(role_arn=SAGEMAKER_EXECUTION_ROLE_ARN)
pipeline_execution = pipeline.start()
pipeline_execution.wait(delay=20, max_attempts=24 * 60 * 3)  # max wait: 24 hours

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


2024-07-23 10:13:52,695 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-146025735673/inference-pipeline/preprocess_evaluation/2024-07-23-10-13-49-649/function
2024-07-23 10:13:52,835 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-146025735673/inference-pipeline/preprocess_evaluation/2024-07-23-10-13-49-649/arguments
2024-07-23 10:13:53,071 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpdx5_029e/requirements.txt'
2024-07-23 10:13:53,099 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-146025735673/inference-pipeline/preprocess_evaluation/2024-07-23-10-13-49-649/pre_exec_script_and_dependencies'
2024-07-23 10:13:53,276 sagemaker.remote_function INFO     Copied user workspace to '/tmp/tmp4ewb8gqo/temp_workspace/sagemaker_remote_function_workspace'
2024-07-23 10:

sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.Dependencies
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.IncludeLocalWorkDir
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.CustomFileFilter.IgnoreNamePatterns
sagemaker.config INFO - Applied value from config key = SageMaker.PythonSDK.Modules.RemoteFunction.InstanceType


2024-07-23 10:13:56,379 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-146025735673/inference-pipeline/evaluate/2024-07-23-10-13-49-649/function
2024-07-23 10:13:56,444 sagemaker.remote_function INFO     Uploading serialized function arguments to s3://sagemaker-us-east-1-146025735673/inference-pipeline/evaluate/2024-07-23-10-13-49-649/arguments
2024-07-23 10:13:56,516 sagemaker.remote_function INFO     Copied dependencies file at './requirements.txt' to '/tmp/tmpyb5yebml/requirements.txt'
2024-07-23 10:13:56,551 sagemaker.remote_function INFO     Successfully uploaded dependencies and pre execution scripts to 's3://sagemaker-us-east-1-146025735673/inference-pipeline/evaluate/2024-07-23-10-13-49-649/pre_exec_script_and_dependencies'
2024-07-23 10:13:57,452 sagemaker.remote_function INFO     Uploading serialized function code to s3://sagemaker-us-east-1-146025735673/inference-pipeline/preprocess_evaluation/2024-07-23-10-13-57-452/functio