# Deploying instruction-tuned LLM (Falcon 7B Instruct) on Amazon SageMaker

In this tutorial, we deploy a recently released open-source large language models:

1. Falcon 7B Instruct. This is a 7B Falcon model that was recently changed to use the permissive license Apache 2.0. [HF link](https://huggingface.co/tiiuae/falcon-7b-instruct)

This notebook is tested on SageMaker Studio with Data Science 2.0 kernel and is meant to deploy a `ml.g4dn.xlarge` instance in the `ap-southeast-1` region.

## Setup of IAM role
We use the default SageMaker Studio execution role for the model execution role. Change this if you have a specific role.

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

## (Optional) Download and packaging of model code
To run this example, we need to upgrade certain packages and change the inference.py file within the HuggingFace container. This requires packaging the model into a single zipped archive file. To speed up the tutorial, you can download the packaged zipped file with a provided link by me. If you want to package it directly, uncomment the next cell to install the necessary packages, download the original HF files, write the additional files and zip the folder.

### Download HF Files

In [None]:
# ## Install HuggingFace Hub
# !pip install huggingface_hub

In [None]:
# ## Download Falcon 7B Instruct
# from huggingface_hub import snapshot_download
# from pathlib import Path
# import shutil
# model_id = "tiiuae/falcon-7b-instruct"
# model_tar_dir = Path(model_id.split("/")[-1])
# if model_tar_dir.exists():
#     shutil.rmtree(str(model_tar_dir))
# model_tar_dir.mkdir(exist_ok=True)
# ignore_patterns = ["*.msgpack", "*.h5","*weight.bin"]
# # Download model from Hugging Face into model_dir
# snapshot_download(model_id, local_dir=str(model_tar_dir), local_dir_use_symlinks=False,ignore_patterns=ignore_patterns)

### Add inference.py and requirements.txt

In [None]:
# ## Create code directory
# code_path = Path("./falcon-7b-instruct/code")
# code_path.mkdir(exist_ok=True)

In [None]:
# %%writefile falcon-7b-instruct/code/inference.py
# import torch
# from transformers import pipeline, AutoTokenizer

# def model_fn(model_dir):
#     tokenizer = AutoTokenizer.from_pretrained(model_dir)
#     instruct_pipeline = pipeline(
#         'text-generation',
#         model=model_dir,
#         torch_dtype=torch.bfloat16,
#         trust_remote_code=True,
#         device_map="auto",
#         tokenizer=tokenizer,
#         model_kwargs={"load_in_8bit": True},
#     )

#     return instruct_pipeline

In [None]:
# %%writefile falcon-7b-instruct/code/requirements.txt
# transformers==4.27.4
# accelerate==0.18.0
# bitsandbytes==0.38.1
# einops

### Package into zip and upload to S3

In [None]:
# !apt-get update && apt-get -y install pigz

In [None]:
# import os
# parent_dir=os.getcwd()

# # change to model dir
# os.chdir("falcon-7b-instruct")
# # use pigz for faster and parallel compression
# !tar -cf model.tar.gz --use-compress-program=pigz *
# # change back to parent dir
# os.chdir(parent_dir)

In [None]:
# from sagemaker.s3 import S3Uploader

# sess = sagemaker.Session()
# role = sagemaker.get_execution_role()
# s3_model_uri = S3Uploader.upload(
#     local_path=str(model_tar_dir.joinpath("model.tar.gz")),
#     desired_s3_uri=f"s3://{sess.default_bucket()}/falcon-7b-instruct/{model_tar_dir}",
# )

## Start here with the pre-packaged tar file
Comment out the cell below if you are packaging directly

In [None]:
from sagemaker.s3 import S3Uploader

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
s3_model_uri = S3Uploader.upload(
    local_path=str(os.chdir().joinpath("model.tar.gz")),
    desired_s3_uri=f"s3://{sess.default_bucket()}/falcon-7b-instruct/",
)

## Deployment of model
Now we will deploy the model to Amazon SageMaker as a real-time endpoint

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import json

use_quantization = True # whether to use quantization or not
instance_type = "ml.g4dn.xlarge" # instance type to use for deployment
number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  model_data=s3_model_uri,
  role=role,
  image_uri="763104351884.dkr.ecr.ap-southeast-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04",
  model_server_workers=1,
  env={
    'HF_MODEL_QUANTIZE': json.dumps(use_quantization),
    'SM_NUM_GPUS': json.dumps(number_of_gpu)
  }
)

In [None]:
import uuid
endpoint_name = 'falcon-7b-instruct-{}'.format(str(uuid.uuid4()))
predictor = llm_model.deploy(
  endpoint_name=endpoint_name,
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout
)

## Test our deployed model directly with SageMaker SDK

In [None]:
payload = """Summarize the following passage.
Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. 
It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don't have to manage servers. 
It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. 
With native support for bring-your-own-algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your specific workflows. 
Deploy a model into a secure and scalable environment by launching it with a few clicks from SageMaker Studio or the SageMaker console.<endoftext>"""

parameters = {
  "max_new_tokens":512,"temperature":0.7
}

# Run prediction
result = predictor.predict({
	"inputs":payload,
  "parameters" :parameters
})
print(result[0]['generated_text'].split('<endoftext>')[1])

In [None]:
payload = """Write a Java program that reverses a string and duplicates it 10 times.<endoftext>"""

parameters = {
  "max_new_tokens":512,"temperature":0.7, "do_sample": True, "top_p":0.1, "top_k":50
}

# Run prediction
result = predictor.predict({
	"inputs":payload,
  "parameters" :parameters
})
print(result[0]['generated_text'].split('<endoftext>')[1])

## Deploy your own Chatbot backed by Amazon SageMaker
We use Gradio to deploy a simple chatbot backed by the model that you've just deployed.

In [None]:
!pip install gradio --upgrade

In [None]:
import gradio as gr

# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.1,
    "temperature": 0.7,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03
  }

with gr.Blocks() as demo:
    gr.Markdown("## Chat with Falcon 7B Instruct")
    with gr.Column():
        chatbot = gr.Chatbot()
        with gr.Row():
            with gr.Column():
                message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False)
            with gr.Column():
                with gr.Row():
                    submit = gr.Button("Submit")
                    clear = gr.Button("Clear")

    def respond(message, chat_history):
        # convert chat history to prompt
        converted_chat_history = ""
        if len(chat_history) > 0:
          for c in chat_history:
            converted_chat_history += f"<|prompter|>{c[0]}<|endoftext|><|assistant|>{c[1]}<|endoftext|>"
        prompt = f"{converted_chat_history}<|prompter|>{message}<|endoftext|><|assistant|>"

        # send request to endpoint
        llm_response = predictor.predict({"inputs": prompt, "parameters": parameters})

        # remove prompt from response
        parsed_response = llm_response[0]["generated_text"][len(prompt):]
        if 'User' in parsed_response:
            parsed_response.split('User')[0]
        if '<|endoftext|>' in parsed_response:
            parsed_response.split('<|endoftext|>')[0]
        chat_history.append((message, parsed_response))
        return "", chat_history

    submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)