# Makrdown cell點開來有標好TODO要幫忙修改

# Train and Deploy open LLMs with Amazon SageMaker

In this sagemaker example, we are going to learn how to fine-tune open LLMs, like [Mistral](https://huggingface.co/models?other=mistral) using [QLoRA](https://arxiv.org/abs/2305.14314) and how to deploy them afterwards using the <!-- TODO: -->  [Hugging Face LLM Inference DLC](https://huggingface.co/blog/sagemaker-huggingface-llm) @shiun

In our example, we are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). <!-- TODO: --> We will also make use of new and efficient features and methods including, Flash Attention, Datset Packing and Mixed Precision Training. @richie

In Detail you will learn how to:
🛫

## 1. Setup Development Environment

In [None]:
# !pip install "transformers==4.44.2" "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "huggingface_hub[cli]" --upgrade --quiet

If you are going to use your personalize hugging face token you need to login hugging face account, to use your token for accessing the gated repository. 如果你想要使用私人的hugging face account來download模型, 你必須登入自己的hf帳號, 產生token並且同意模型的使用條款, 也需要針對程式碼做修改, 這次workshop將不會大家完成這步, 我們會把model放到s3

In [None]:
!pip install -U sagemaker transformers

In [None]:
# from dotenv import load_dotenv
import os

# load_dotenv(dotenv_path="../.env")
# os.getenv("HF_TOKEN")
!huggingface-cli login --token YOUR_HF_TOKEN

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


## 2. Load and prepare the dataset

We will use [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) an open source dataset of instruction-following records on categories outlined in the [InstructGPT paper](https://arxiv.org/abs/2203.02155), including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

```python
{
  "instruction": "What is world of warcraft",
  "context": "",
  "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"
}
```

To load the `dolly` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [None]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011


To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a `formatting_function` that takes a sample and returns a string with our format instruction.

In [None]:
def format_dolly(sample):
    instruction = f"### Instruction\n{sample['instruction']}"
    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
    response = f"### Answer\n{sample['response']}"
    # join all the parts together
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    return prompt


lets test our formatting function on a random example.

{"messages":[{"content":"你是aws占卜師, 你會收到user的問題和回答, 你需要用一些很白癡、好笑、有趣、聊天、朋友、諧音梗的口氣來回答user","role":"system"},

In [None]:
# from datasets import load_dataset

# # Convert dataset to OAI messages
# system_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""

# def create_conversation(sample):
#     if sample["messages"][0]["role"] == "system":
#         return sample
#     else:
#       sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]
#       return sample

# # Load dataset from the hub
# dataset = load_dataset("HuggingFaceH4/no_robots")

# # Add system message to each conversation
# columns_to_remove = list(dataset["train"].features)
# columns_to_remove.remove("messages")
# dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)

# # Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system message
# dataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)
# dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)

In [None]:
# # save train_dataset to s3 using our SageMaker session
# input_path = f's3://{sess.default_bucket()}/datasets/llama3'

# # save datasets to s3
# dataset["train"].to_json(f"{input_path}/train/dataset.json", orient="records")
# train_dataset_s3_path = f"{input_path}/train/dataset.json"
# dataset["test"].to_json(f"{input_path}/test/dataset.json", orient="records")
# test_dataset_s3_path = f"{input_path}/test/dataset.json"

# print(f"Training data uploaded to:")
# print(train_dataset_s3_path)
# print(test_dataset_s3_path)
# print(f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/")

In [None]:
from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training. This means that we are stacking multiple samples to one sequence and split them with an EOS Token. This makes the training more efficient. Packing/stacking samples can be done during training or before. We will do it before training to save time. We created a utility method [pack_dataset](./scripts/utils/pack_dataset.py) that takes a dataset and a packing function and returns a packed dataset.


In [None]:
!pip install tensorflow --quitet

In [None]:
from transformers import AutoTokenizer

model_id = "stabilityai/stablelm-2-zephyr-1_6b" 
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# from transformers import AutoModelForSeq2SeqLM

# tokenizer = AutoModelForSeq2SeqLM.from_pretrained('google-t5/t5-3b', token=True)

To pack/stack our dataset we need to first tokenize it and then we can pack it with the `pack_dataset` method. To prepare our dataset we will now: 
1. Format our samples using the template method and add an EOS token at the end of each sample
2. Tokenize our dataset to convert it from text to tokens
3. Pack our dataset to 2048 tokens


In [None]:
from random import randint
# add utils method to path for loading dataset
import sys
sys.path.append("../scripts/utils") 
from pack_dataset import pack_dataset


# template dataset to add prompt to each sample
def template_dataset(sample):
    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
    return sample

# apply prompt template per sample
dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))
# print random sample
print(dataset[randint(0, len(dataset))]["text"])

# tokenize dataset
dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)

# chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

# Print total number of samples
print(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
from random import randint, sample

sampled_indices = sample(range(len(lm_dataset)), 10)
lm_dataset = lm_dataset.select(sampled_indices)


In [None]:
lm_dataset

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/mistral/dolly/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

## 3. Fine-Tune Mistral 7B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2305.14314)" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is: 

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a [run_qlora.py](./scripts/run_qlora.py), which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code. The model will be temporally offloaded to disk, if it is too large to fit into memory.

In Addition to QLoRA we will leverage the new [Flash Attention 2 integrationg with Transformers](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flash-attention-2) to speed up the training. Flash Attention 2 is a new efficient attention mechanism that is up to 3x faster than the standard attention mechanism. 

In [None]:
from huggingface_hub import HfFolder


# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'num_train_epochs': 3,                            # number of training epochs
  'per_device_train_batch_size': 1,                 # batch size for training
  'gradient_accumulation_steps': 2,                 # Number of updates steps to accumulate 
  'gradient_checkpointing': True,                   # save memory but slower backward pass
  'fp16': True ,
  'learning_rate': 2e-4,                            # learning rate
  'max_grad_norm': 0.3,                             # Maximum norm (for gradient clipping)
  'warmup_ratio': 0.03,                             # warmup ratio
  "lr_scheduler_type":"constant",                   # learning rate scheduler
  'save_strategy': "epoch",                         # save strategy for checkpoints
  "logging_steps": 10,                              # log every x steps
  'merge_adapters': True,                           # wether to merge LoRA into the model (needs more memory)
  'use_flash_attn': True,                           # Whether to use Flash Attention
  'output_dir': '/tmp/run',                         # output directory, where to save assets during training
}

if HfFolder.get_token() is not None:
    hyperparameters['hf_token'] = HfFolder.get_token() # huggingface token to access gated models, e.g. llama 2

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Amazon SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`. Then, it starts the training job by running.

> Note: Make sure that you include the `requirements.txt` in the `source_dir` if you are using a custom training script. We recommend to just clone the whole repository.

In [None]:
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
job_name = f'huggingface-qlora-{hyperparameters["model_id"].replace("/","-").replace(".","-")}'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_qlora.py',    # train script
    source_dir           = '../scripts',      # directory which includes all the files needed for training
    #TODO: 這邊可以確定一下還有哪些instace可以用 https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html
    instance_type        = 'ml.p3.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.36',            # the transformers version used in the training job
    pytorch_version      = '2.1',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    disable_output_compression = True         # not compress output to save training time and cost
)

> You can also use `g5.2xlarge` instead of the `g5.4xlarge` instance type, but then it is not possible to use `merge_weights` parameter, since to merge the LoRA weights into the model weights, the model needs to fit into memory. But you could save the adapter weights and merge them using [merge_adapter_weights.py](./scripts/merge_adapter_weights.py) after training.

We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

In our example for Mistral 7B, the SageMaker training job took `13968 seconds`, which is about `3.9 hours`. The ml.g5.4xlarge instance we used costs `$2.03 per hour` for on-demand usage. As a result, the total cost for training our fine-tuned Mistral model was only ~`$8`. 

Now lets make sure SageMaker has successfully uploaded the model to S3. We can use the `model_data` property of the estimator to get the S3 path to the model. Since we used `merge_weights=True` and `disable_output_compression=True` the model is stored as raw files in the S3 bucket. 

In [None]:
huggingface_estimator.model_data["S3DataSource"]["S3Uri"].replace("s3://", "https://s3.console.aws.amazon.com/s3/buckets/")

### Compress the model data and register model

We need to compress the model data for model register, the following cell may take about 10 mins
reference: https://discuss.huggingface.co/t/deploy-from-s3-failed/52165/3

In [None]:
import boto3
import os
import tarfile

# 初始化 S3 客戶端
s3 = boto3.client('s3')

# 從 huggingface_estimator 取得 S3 路徑
s3_uri = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

# 解析 S3 bucket 和路徑
s3_path = s3_uri.replace("s3://", "")
bucket_name, prefix = s3_path.split('/', 1)

# 本地資料夾名稱
local_dir = '/tmp/model'

# 創建本地資料夾
os.makedirs(local_dir, exist_ok=True)

# 列出 S3 中的所有檔案
objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

# 下載每個檔案到本地
for obj in objects.get('Contents', []):
    file_key = obj['Key']
    file_name = os.path.join(local_dir, file_key.split('/')[-1])
    s3.download_file(bucket_name, file_key, file_name)

# 壓縮下載的檔案成 .tar.gz
tar_path = '/tmp/model.tar.gz'
with tarfile.open(tar_path, 'w:gz') as tar:
    tar.add(local_dir, arcname=os.path.basename(local_dir))

print(f"Model files are compressed into {tar_path}")

# 上傳 .tar.gz 檔案回 S3
upload_key = prefix.rstrip('/') + '/model.tar.gz'
s3.upload_file(tar_path, bucket_name, upload_key)

# 生成 S3 連結並轉換成 AWS 控制台 URL
s3_url = f"s3://{bucket_name}/{upload_key}"
console_url = s3_url.replace("s3://", "https://s3.console.aws.amazon.com/s3/buckets/")

print(f"File uploaded to S3: {console_url}")


### Create Model Group and register version

In [None]:
import boto3
import hashlib

# 初始化 SageMaker 客戶端
sagemaker_client = boto3.client('sagemaker')

# 從 huggingface_estimator 取得模型名稱 (例如從 training_job_name)
model_name = huggingface_estimator.latest_training_job.name if huggingface_estimator.latest_training_job else "default-model"

# 使用哈希縮短名稱或直接裁剪名稱確保其不超過 63 個字元
if len(model_name) > 50:
    model_name_hash = hashlib.sha256(model_name.encode()).hexdigest()[:8]
    model_group_name = f"{model_name[:45]}-{model_name_hash}-group"
else:
    model_group_name = f"{model_name}-group"

# 確保 model_group_name 的長度不超過 63 個字元
model_group_name = model_group_name[:63]

# 創建 Model Group
response = sagemaker_client.create_model_package_group(
    ModelPackageGroupName=model_group_name,
    ModelPackageGroupDescription=f'This group contains versions of the model: {model_name}.'
)

print(f"Model Group ARN: {response['ModelPackageGroupArn']}")


In [None]:
from sagemaker import image_uris

# 設定模型的 S3 路徑，這是你已經上傳好的 model.tar.gz 檔案的路徑
model_data_url = s3_url

# 設定 Model Group 名稱
model_package_group_name = model_group_name

# 使用 SageMaker 提供的 PyTorch 推理容器
inference_image = image_uris.retrieve(
    framework="pytorch",  # 使用 PyTorch
    region=sess.boto_session.region_name,
    version="1.9.0",  # 這裡使用 1.9.0，根據你模型訓練的 PyTorch 版本選擇
    image_scope="inference",
    py_version="py38",
    instance_type="ml.m5.large"
)

# 註冊模型版本
response = sagemaker_client.create_model_package(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageDescription='Model version for stablelm-2-zephyr-1_6b',  # 可選的描述
    InferenceSpecification={
        'Containers': [
            {
                'Image': inference_image,
                'ModelDataUrl': model_data_url,
            }
        ],
        'SupportedContentTypes': ['application/json'],
        'SupportedResponseMIMETypes': ['application/json'],
    },
    ModelApprovalStatus='PendingManualApproval',  # 或 'Approved' 如果你不需要手動審核
)

model_package_arn = response['ModelPackageArn']
print(f"Model Package ARN: {model_package_arn}")

You should see a similar folder structure and files in your S3 bucket:

![S3 Bucket](../assets/s3.png)

Now, lets deploy our model to an endpoint. 🚀

## Deploy Fine-tuned Mistral 7B on Amazon SageMaker

We are going to use the [Hugging Face LLM Inference DLC](https://huggingface.co/blog/sagemaker-huggingface-llm#what-is-hugging-face-llm-inference-dlc) a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/index) solution for deploying and serving Large Language Models (LLMs).

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.4.5",
  session=sess,
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

We can now create a `HuggingFaceModel` using the container uri and the S3 path to our model. We also need to set our TGI configuration including the number of GPUs, max input tokens. You can find a full list of configuration options [here](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher).

## (Optional) Deploy Serverless Inference Endpoint
If you prefer to deploy a serverless endpoint, please proceed by uncommenting the following cells.

The following cells demonstrate the serverless inference endpoint deployment, You may skip this section if you'd prefer to deploy an on-demand inference endpoint.

In [None]:
from sagemaker import ModelPackage
from sagemaker.serverless import ServerlessInferenceConfig

# 模型的 ARN，從 Model Registry 取得
model_package_arn = model_package_arn

# 創建 ModelPackage 對象
model = ModelPackage(
    role=role,
    model_package_arn=model_package_arn,
    sagemaker_session=sess,
    env={  # 添加環境變數
        'HF_MODEL_ID': '',  # 設定為空字串或占位符
        'SM_NUM_GPUS': '0',  # Serverless 不支援 GPU，設置為 0
        'MAX_INPUT_LENGTH': '1024',
        'MAX_TOTAL_TOKENS': '2048'
    }
)

# Serverless Inference 配置
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096,
    max_concurrency=10
)

# 部署模型到 Serverless Inference
predictor = model.deploy(
    serverless_inference_config=serverless_config,
    container_startup_health_check_timeout=300
)


In [None]:
from sagemaker.huggingface import HuggingFaceModel

# 獲取模型版本
model_data_url = s3_url

# 設置模型配置
config = {
    'HF_MODEL_ID': 'stabilityai/stablelm-2-zephyr-1_6b',  # 設定正確的模型ID
    'SM_NUM_GPUS': '0',  # Serverless 不支援 GPU，設為0
    'MAX_INPUT_LENGTH': '1024',
    'MAX_TOTAL_TOKENS': '2048'
}

# 創建 Hugging Face 模型
llm_model = HuggingFaceModel(
    role='YourRole',
    model_data=model_data_url,
    env=config  # 設定環境變數
)

# 部署 Serverless Inference endpoint
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096,  # 記憶體大小
    max_concurrency=10
)

llm_model.deploy(
    serverless_inference_config=serverless_config,
    container_startup_health_check_timeout=300
)


## Real-time Inferece Endpoint

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel # API reference: https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html

# s3 path where the model will be uploaded
# if you try to deploy the model to a different time add the s3 path here
model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}}, # The Amazon S3 location of a SageMaker model data .tar.gz file.
  env=config
)

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method.

In [None]:

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes.


## 5. Stream Inference Requests from the Deployed Model

[Amazon SageMaker supports streaming responses](https://aws.amazon.com/de/blogs/machine-learning/elevating-the-generative-ai-experience-introducing-streaming-support-in-amazon-sagemaker-hosting/) from your model. We can use this to stream responses, we can leverage this to create a streaming gradio application with a better user experience.

We created a sample application that you can use to test your model. You can find the code in [gradio-app.py](../demo/sagemaker_chat.py). The application will stream the responses from the model and display them in the UI. You can also use the application to test your model with your own inputs.

In [None]:
!pip install gradio

In [None]:
import gradio as gr
import boto3
import json
import io

# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:", "<|endoftext|>", " User:", "###"],
}

system_prompt = "You are an helpful Assistant, called Falcon. Knowing everyting about AWS."


# Helper for reading lines from a stream
class LineIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])


# helper method to format prompt
def format_prompt(message, history, system_prompt):
    prompt = ""
    if system_prompt:
        prompt += f"System: {system_prompt}\n"
    for user_prompt, bot_response in history:
        prompt += f"User: {user_prompt}\n"
        prompt += f"Falcon: {bot_response}\n"  # Response already contains "Falcon: "
    prompt += f"""User: {message}
Falcon:"""
    return prompt


def create_gradio_app(
    endpoint_name,
    session=boto3,
    parameters=parameters,
    system_prompt=system_prompt,
    format_prompt=format_prompt,
    concurrency_count=4,
    share=True,
):
    smr = session.client("sagemaker-runtime")

    def generate(
        prompt,
        history,
    ):
        formatted_prompt = format_prompt(prompt, history, system_prompt)

        request = {"inputs": formatted_prompt, "parameters": parameters, "stream": True}
        resp = smr.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            Body=json.dumps(request),
            ContentType="application/json",
        )

        output = ""
        for c in LineIterator(resp["Body"]):
            c = c.decode("utf-8")
            if c.startswith("data:"):
                chunk = json.loads(c.lstrip("data:").rstrip("/n"))
                if chunk["token"]["special"]:
                    continue
                if chunk["token"]["text"] in request["parameters"]["stop"]:
                    break
                output += chunk["token"]["text"]
                for stop_str in request["parameters"]["stop"]:
                    if output.endswith(stop_str):
                        output = output[: -len(stop_str)]
                        output = output.rstrip()
                        yield output

                yield output
        return output

    demo = gr.ChatInterface(generate, title="Chat with Amazon SageMaker", chatbot=gr.Chatbot(layout="panel"))

    demo.queue(concurrency_count=concurrency_count).launch(share=share)


In [None]:
# add apps directory to path ../apps/
import sys
sys.path.append("../demo") 
# from sagemaker_chat import create_gradio_app

# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "stop": ["###", "</s>"],
}

# define format function for our input
def format_prompt(message, history, system_prompt):
    prompt = ""
    for user_prompt, bot_response in history:
        prompt += f"### Instruction\n{user_prompt}\n\n"
        prompt += f"### Answer\n{bot_response}\n\n"  # Response already contains "Falcon: "
    prompt += f"### Instruction\n{message}\n\n### Answer\n"
    return prompt

# create gradio app
create_gradio_app(
    llm.endpoint_name,           # Sagemaker endpoint name
    session=sess.boto_session,   # boto3 session used to send request 
    parameters=parameters,       # Request parameters
    system_prompt=None,          # System prompt to use -> Mistral does not support system prompts
    format_prompt=format_prompt, # Function to format prompt
    # concurrency_count=4,         # Number of concurrent requests
    share=True,                  # Share app publicly
)

demo.launch(max_threads=10)

![gradio](../assets/gradio.png)

Don't forget to delete the endpoint after you are done with the example. 

In [None]:
llm.delete_model()
llm.delete_endpoint()