# Deploy Finetuned Mistral on Amazon SageMaker

Mistral 7B is the open LLM from Mistral AI.

This sample is modified from this documentation 
https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_mistral_7b.html

You only need to provide the new model location to deploy the Mistral fine tune version

This notebook has been tested on Amazon SageMaker Notebook Instances with single GPU on ml.g5.2xlarge

The deployment has been tested on Amazon SageMaker real time inference endpoint with single GPU on ml.g5.2xlarge

## Setup development environment

In [7]:
# TODO: update when container is added to sagemaker sdk
!pip install sagemaker huggingface_hub jinja2 --upgrade --quiet

In [6]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")


sagemaker role arn: arn:aws:iam::70768*******:role/service-role/AmazonSageMaker-ExecutionRole-20191024T163188
sagemaker session region: us-east-1
sagemaker version: 2.209.0


In [8]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
instance_type = 'ml.g5.2xlarge'  # instances type used for the deployment 

## Download the model and upload to s3

We recommend to first save the model in a S3 location and provide the S3 url in the serving.properties file. This allows faster downloads times.


<span style="color:red">*If you already download the model to the notebook, please ingore this steps*</span>.

In [4]:
# from huggingface_hub import snapshot_download
# from pathlib import Path
# import os

# # - This will download the model into the current directory where ever the jupyter notebook is running
# local_model_path = Path(".")
# local_model_path.mkdir(exist_ok=True)

# # Only download pytorch checkpoint files
# allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model", "*.safetensors"]

# # - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
# model_download_path = snapshot_download(
#     repo_id=model_id,
#     cache_dir=local_model_path,
#     allow_patterns=allow_patterns,
# )

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

#### If you have run the notebook of Local_Finetune_Mistral.ipynb, you should have the model artifact in Mistral-Finetuned-Merged

#### If you have run the notebook of Finetune_Mistral_7B_on_Amazon_SageMaker.ipynb, you should have the model artifact in ./results/training_job, downloaded from SageMaker Training job


In [17]:
model_download_path = "Mistral-Finetuned-Merged"
#model_download_path = "./results/training_job"


Define where the model should upload in S3

In [15]:
# define a variable to contain the s3url of the location that has the model
s3_model_prefix = f"{model_id}/lmi"  # folder within bucket where model artifact will go
pretrained_model_location = f"s3://{sagemaker_session_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

Pretrained model will be uploaded to ---- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi/


We upload the model files to s3 bucket, please be patient, this will takes serveral （around 20） minutes.

In [16]:
model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.model_id={model_artifact}")

Model uploaded to --- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi
We will set option.model_id=s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi


## Choose an inference image

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving. Here we demo how to use VLLM to host the model.

In [18]:
import sagemaker 

inference_image_uri = sagemaker.image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_region_name,
        version="0.26.0"
    )

#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-deepspeed0.12.6-cu121"
print(f"Docker image: {inference_image_uri}")

Docker image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.25.0-deepspeed0.11.0-cu118


### Prepare serving.properties file

In [19]:
code_folder = "code_vllm"
#pretrained_model_location = "TheBloke/mixtral-8x7b-v0.1-AWQ"

In [20]:
from pathlib import Path
code_path = Path(code_folder)
code_path.mkdir(exist_ok=True)

Large Model Inference Configurations

https://docs.djl.ai/docs/serving/serving/docs/lmi/configurations_large_model_inference_containers.html

mistralai/Mistral-7B-Instruct-v0.1

In [21]:
%%writefile ./{code_folder}/serving.properties
engine = Python
option.model_id = {{s3url}}
option.dtype=fp16
option.tensor_parallel_degree = 1
option.output_formatter = json
option.task=text-generation
option.model_loading_timeout = 1200
option.rolling_batch=vllm
option.device_map=auto
option.max_model_len=2048

Writing ./code_vllm/serving.properties


In [22]:
import jinja2
jinja_env = jinja2.Environment()
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(f"{code_folder}/serving.properties").open().read())
Path(f"{code_folder}/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize {code_folder}/serving.properties | cat -n

     1	[36mengine[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mPython[39;49;00m[37m[39;49;00m
     2	[36moption.model_id[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33ms3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/lmi/[39;49;00m[37m[39;49;00m
     3	[36moption.dtype[39;49;00m=[33mfp16[39;49;00m[37m[39;49;00m
     4	[36moption.tensor_parallel_degree[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m1[39;49;00m[37m[39;49;00m
     5	[36moption.output_formatter[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mjson[39;49;00m[37m[39;49;00m
     6	[36moption.task[39;49;00m=[33mtext-generation[39;49;00m[37m[39;49;00m
     7	[36moption.model_loading_timeout[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m600[39;49;00m[37m[39;49;00m
     8	[36moption.rolling_batch[39;49;00m=[33mvllm[39;49;00m[37m[39;49;00m
     9	[36moption.device_map[39;49;00m=[33mauto[39;49;00m[37m[39;49;00m


### create tarball and upload to s3 location


In [23]:
!rm -f model.tar.gz
!rm -rf {code_folder}/.ipynb_checkpoints
!tar czvf model.tar.gz -C {code_folder} .

./
./serving.properties


In [24]:
s3_code_prefix = f"{model_id}/{code_folder}"
s3_code_artifact = sess.upload_data("model.tar.gz", sagemaker_session_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-70768*******/mistralai/Mistral-7B-Instruct-v0.1/code_vllm/model.tar.gz


## Create the endpoint

In [25]:
## Deploy Mixtral 8x7B to Amazon SageMaker

import json
from sagemaker import Model
from datetime import datetime

# sagemaker config
print(model_id)
print(instance_type)
health_check_timeout = 1200

timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

# create HuggingFaceModel with the image uri
llm_model = Model(
  role=role,
  name=f"vllm-{model_id.replace('/', '-').lower().replace('.', '')}-{timestamp}",
  model_data=s3_code_artifact, 
  image_uri=inference_image_uri,
)


mistralai/Mistral-7B-Instruct-v0.1
ml.g5.2xlarge


After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.48xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.

In [26]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)


-----------!

In [27]:
endpoint_name = llm_model.endpoint_name
print(endpoint_name)

vllm-mistralai-mistral-7b-instruct-v01--2024-02-28-07-04-46-288


## Run inference and chat with the model

See below as one example of prompt and response 

[INST] Below is the question based on the context. Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States.
Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. Write a response that appropriately completes the request.[/INST]

#### Dataset ground truth:
Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.</s>

In [28]:
import logging

import boto3
import json

# Create a Boto3 client for SageMaker Runtime
sagemaker_client = boto3.client("sagemaker-runtime")

max_tokens_to_sample = 200
 
# Define the prompt and other parameters
prompt = """
<s>[INST] Below is the question based on the context. 
Question: Given a reference text about Lollapalooza, where does it take place, who started it and what is it?. 
Below is the given the context Lollapalooza /ˌlɒləpəˈluːzə/ (Lolla) is an annual American four-day music festival held in Grant Park in Chicago. 
It originally started as a touring event in 1991, but several years later, Chicago became its permanent location. Music genres include but are not limited to alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. Lollapalooza has also featured visual arts, nonprofit organizations, and political organizations. 
The festival, held in Grant Park, hosts an estimated 400,000 people each July and sells out annually. Lollapalooza is one of the largest and most iconic music festivals in the world and one of the longest-running in the United States. Lollapalooza was conceived and created in 1991 as a farewell tour by Perry Farrell, singer of the group Jane's Addiction.. 
Write a response that appropriately completes the request.[/INST]
"""

# hyperparameters for llm
parameters = {
    "max_new_tokens": max_tokens_to_sample,
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.5,
}

contentType = 'application/json'

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})



In [29]:
response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name, Body=body, ContentType=contentType)

# Process the response
response_body = json.loads(response.get('Body').read())

In [30]:
print(response_body['generated_text'])

Response: Lollapalooze is an annual musical festival held in Grant Park in Chicago, Illinois. It was started in 1991 as a farewell tour by Perry Farrell, singe of the group Jane's Addiction. The festival includes an array of musical genres including alternative rock, heavy metal, punk rock, hip hop, and electronic dance music. The festivals welcomes an estimated 400,000 people each year and sells out annually. Some notable headliners include: the Red Hot Chili Peppers, Chance the Rapper, Metallica, and Lady Gage. Lollapalooza is one of the largest and most iconic festivals in the world and a staple of Chicago.


## Clean up


In [None]:
#llm.delete_model()
#llm.delete_endpoint()