# Falcon SageMaker Inference with CTranslate2

Sample code to deploy [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b-instruct) using CTranslate2 for faster inference.

This notebook is tested on SageMaker Studio Notebook with ml.m5.rxlarge using PyTorch 2.0.0 Python 3.10 CPU Optimized container.

In [10]:
!pip install "sagemaker>=2.143.0" -U
!pip install ctranslate2 transformers torch einops

Collecting sagemaker>=2.143.0
  Using cached sagemaker-2.165.0-py2.py3-none-any.whl
Collecting attrs<24,>=23.1.0 (from sagemaker>=2.143.0)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting PyYAML==6.0 (from sagemaker>=2.143.0)
  Using cached PyYAML-6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (682 kB)
Installing collected packages: PyYAML, attrs, sagemaker
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 5.4.1
    Uninstalling PyYAML-5.4.1:
      Successfully uninstalled PyYAML-5.4.1
  Attempting uninstall: attrs
    Found existing installation: attrs 22.2.0
    Uninstalling attrs-22.2.0:
      Successfully uninstalled attrs-22.2.0
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.153.0
    Uninstalling sagemaker-2.153.0:
      Successfully uninstalled sagemaker-2.153.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that 

In [11]:
import sagemaker, boto3, json
from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.huggingface import HuggingFace

role = get_execution_role()
region = boto3.Session().region_name
sess = sagemaker.Session()
bucket = sess.default_bucket()

sagemaker.__version__

'2.165.0'

## Convert Model

Convert model to CTranslate2 optimized format. This process requires significant CPU-memory.

In [14]:
!rm -rf scripts/model
!ct2-transformers-converter --low_cpu_mem_usage --model tiiuae/falcon-40b-instruct --quantization int8 --output_dir scripts/model  --trust_remote_code

Loading checkpoint shards:   0%|                          | 0/9 [00:00<?, ?it/s]^C


KeyboardInterrupt: 

In [13]:
!ls -l scripts/

total 4
drwxr-xr-x 3 root root 6144 Jun 17 00:02 code


## Package and Upload Model

In [9]:
!apt update -y
!apt install pigz -y

Hit:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done
Building dependency tree       
Reading state information... Done
64 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
pigz is already the newest version (2.4-1).
0 upgraded, 0 newly installed, 0 to remove and 64 not upgraded.


In [10]:
%cd scripts
# !tar -czvf ../package.tar.gz *
!tar cv ./ | pigz -p 8 > ../package.tar.gz # 8 並列でアーカイブ
%cd -

/root/aws-ml-jp/tasks/generative-ai/text-to-text/fine-tuning/instruction-tuning/CTranslate2/scripts
./
./.ipynb_checkpoints/
./code/
./code/requirements.txt
./code/.ipynb_checkpoints/
./code/.ipynb_checkpoints/inference-checkpoint.py
./code/.ipynb_checkpoints/requirements-checkpoint.txt
./code/inference.py
/root/aws-ml-jp/tasks/generative-ai/text-to-text/fine-tuning/instruction-tuning/CTranslate2


In [11]:
model_path = sess.upload_data('package.tar.gz', bucket=bucket, key_prefix=f"Falcon40B-Inference-CTranslate2")
model_path

's3://sagemaker-us-west-2-867115166077/Falcon40B-Inference-CTranslate2/package.tar.gz'

## Deploy Model

In [None]:
from sagemaker.serializers import JSONSerializer

endpoint_name = "Falcon40B-Inference-CTranslate"

huggingface_model = PyTorchModel(
    model_data=model_path,
    framework_version="2.0",
    py_version='py310',
    role=role,
    name=endpoint_name,
    env={
        "model_params": json.dumps({
            "tokenizer": "tiiuae/falcon-40b-instruct",
            "model": "model",
            "prompt_input": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",
            "prompt_no_input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n"
        }),
        "SAGEMAKER_MODEL_SERVER_TIMEOUT": "3600"
    }
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.12xlarge',
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
)

## Inference

In [None]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor_client=Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)
data = {
    "instruction": """ヴァージン・オーストラリアはいつから運航を開始したのですか？完結に答えてください。""".replace("\n", "<NL>"),  # システム
    "input": """ヴァージン・オーストラリア航空（Virgin Australia Airlines Pty Ltd）の商号で、オーストラリアを拠点とする航空会社です。ヴァージン・ブランドを使用する航空会社の中で、保有機材数では最大の航空会社である。2000年8月31日にヴァージン・ブルーとして、2機の航空機で単一路線で運航を開始した[3]。2001年9月のアンセット・オーストラリアの破綻後、突然オーストラリア国内市場の大手航空会社としての地位を確立した。その後、ブリスベン、メルボルン、シドニーをハブとして、オーストラリア国内の32都市に直接乗り入れるまでに成長した[4]。""".replace("\n", "<NL>"),  # ユーザー
    "max_new_tokens": 64,
    "sampling_temperature": 0.3,
    "stop_ids": [0, 1],
}
response = predictor_client.predict(
    data=data
)
print(response.replace("<NL>", "\n"))

## Benchmark

1.36 s ± 320 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%timeit response = predictor_client.predict(data=data)

## Delete Endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint()