# Deploy LLaVA-v1.5-7B model on Amazon SageMaker

***This notebook works best with the `conda_python3` kernel on a `ml.t3.large` machine***.

---

In this notebook we download the [LLaVA-v1.5-7B](https://huggingface.co/anymodality/llava-v1.5-7b) and deploy it on SageMaker. We use the `huggingface-pytorch-inference` container and deploy this model on a `ml.g5.xlarge` instance type. 

The downloaded model files are archived into a `model.tar.gz` file that is uploaded to the default SageMaker S3 bucket. The `inference.py` file is overwritten with a [`llava_inference.py`](./llava_inference.py) file that has code to run inference on an image stored in S3.

## Step 1. Setup

Install the required Python packages and import the relevant files.

In [6]:
import sys
!{sys.executable} -m pip install -r requirements.txt



In [7]:
import os
import shutil
import logging
import sagemaker
import globals as g
import requests as req
from typing import Dict
from pathlib import Path
from utils import get_bucket_name
from sagemaker.s3 import S3Uploader
from sagemaker import get_execution_role
from huggingface_hub import snapshot_download
from sagemaker.huggingface.model import HuggingFaceModel

[2024-01-09 21:20:50,380] p15034 {credentials.py:1075} INFO - Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [8]:
# global constants
!pygmentize globals.py

[33m"""[39;49;00m
[33mGlobal variables used throughout the code.[39;49;00m
[33m"""[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# model deployment[39;49;00m[37m[39;49;00m
HF_MODEL_ID = [33m"[39;49;00m[33manymodality/llava-v1.5-13b[39;49;00m[33m"[39;49;00m[37m[39;49;00m
HF_MODEL_ID: [36mstr[39;49;00m = [33m"[39;49;00m[33manymodality/llava-v1.5-7b[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[37m[39;49;00m
HF_TASK: [36mstr[39;49;00m = [33m"[39;49;00m[33mquestion-answering[39;49;00m[33m"[39;49;00m[37m[39;49;00m
TRANSFORMERS_VERSION: [36mstr[39;49;00m = [33m"[39;49;00m[33m4.28.1[39;49;00m[33m"[39;49;00m[37m[39;49;00m
PYTORCH_VERSION: [36mstr[39;49;00m = [33m"[39;49;00m[33m2.0.0[39;49;00m[33m"[39;49;00m[37m[39;49;00m
PYTHON_VERSION: [36mst

In [9]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

In [10]:
bucket_name: str = get_bucket_name(g.CFN_STACK_NAME)
s3_model_uri: str = os.path.join("s3://", bucket_name, g.BUCKET_PREFIX, os.path.basename(g.HF_MODEL_ID))

In [11]:
model_dir: str = g.HF_MODEL_ID.split("/")[-1]
model_tar_gz_path: str = os.path.join(os.path.dirname(os.getcwd()), f"model_{model_dir}.tar.gz")
logger.info(f"HF_MODEL_ID={g.HF_MODEL_ID}, model_dir={model_dir}, model_tar_gz_path={model_tar_gz_path}")

[2024-01-09 21:20:52,324] p15034 {2234532258.py:3} INFO - HF_MODEL_ID=anymodality/llava-v1.5-7b, model_dir=llava-v1.5-7b, model_tar_gz_path=/home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog1-TitanEmbeddings-LVM/model_llava-v1.5-7b.tar.gz


## Step 2: Prepare the `model.tar.gz`

1. Download the model files from HuggingFace.

1. Update the `inference.py` with [`llava_inference.py`](./llava_inference.py)

1. Zip the model directory.

Download the model files. **This takes about 5 minutes**.

In [12]:
%%time
model_path: str = os.path.join(os.path.dirname(os.getcwd()), model_dir)
Path(model_path).mkdir(exist_ok=True)
# Download model from Hugging Face into model_dir
snapshot_download(g.HF_MODEL_ID, local_dir=model_path, local_dir_use_symlinks=False)

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

deploy_llava.ipynb:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

code/inference.py:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/27.1k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

code/requirements.txt:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CPU times: user 17 s, sys: 32.8 s, total: 49.8 s
Wall time: 3min 2s


'/home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog1-TitanEmbeddings-LVM/llava-v1.5-7b'

In [13]:
# update the inference script
inf_dest: str = os.path.join(model_path, 'code', 'inference.py')
shutil.copyfile("llava_inference.py", inf_dest)

'/home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog1-TitanEmbeddings-LVM/llava-v1.5-7b/code/inference.py'

Create a .tar.gz file. **This step takes about 10 minutes**.

In [14]:
%%time
# Create SageMaker model.tar.gz artifact
!cd {model_path};tar -cf {model_tar_gz_path} --use-compress-program=pigz *;cd -

/home/ec2-user/SageMaker/multimodal-rag-on-slide-decks/Blog1-TitanEmbeddings-LVM/notebooks
CPU times: user 9.15 s, sys: 1.01 s, total: 10.2 s
Wall time: 10min 7s


Upload the model.tar.gz to S3. **This steps takes about 3 minutes**.

In [15]:
%%time
# upload model.tar.gz to s3
S3Uploader.upload(local_path=model_tar_gz_path, desired_s3_uri=s3_model_uri)
logger.info(f"model uploaded to: {s3_model_uri}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


[2024-01-09 21:37:41,602] p15034 {<timed exec>:3} INFO - model uploaded to: s3://multimodal-bucket-731963050968/multimodal/llava-v1.5-7b


CPU times: user 1min 48s, sys: 2min 4s, total: 3min 52s
Wall time: 3min 38s


## Step 3: Deploy the model on SageMaker

Here we deploy the model on SageMaker. We use the [HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) class from the SageMaker SDK. **This steps takes about 10 minutes**.

In [16]:
%%time

# set the env vars for the model
config: Dict = dict(HF_TASK=g.HF_TASK)

model_data: str = os.path.join(s3_model_uri, f"model_{os.path.basename(g.HF_MODEL_ID)}.tar.gz")
instance_type: str = "ml.g5.xlarge"
instance_count: int = 1
logger.info(f"going to deploy {g.HF_MODEL_ID} model, model_data={model_data}, instance_type={instance_type}, instance_count={instance_count}")

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_data,   
   role=get_execution_role(),                                  
   transformers_version=g.TRANSFORMERS_VERSION,  
   pytorch_version=g.PYTORCH_VERSION,            
   py_version=g.PYTHON_VERSION,                
   model_server_workers=1,
   env=config
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(initial_instance_count=instance_count,
                                     instance_type=instance_type)
logger.info(f"finished deploying model")

[2024-01-09 21:37:41,618] p15034 {<timed exec>:7} INFO - going to deploy anymodality/llava-v1.5-7b model, model_data=s3://multimodal-bucket-731963050968/multimodal/llava-v1.5-7b/model_llava-v1.5-7b.tar.gz, instance_type=ml.g5.xlarge, instance_count=1


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


[2024-01-09 21:37:42,257] p15034 {session.py:3645} INFO - Creating model with name: huggingface-pytorch-inference-2024-01-09-21-37-42-256
[2024-01-09 21:37:42,858] p15034 {session.py:5321} INFO - Creating endpoint-config with name huggingface-pytorch-inference-2024-01-09-21-37-42-858
[2024-01-09 21:37:43,173] p15034 {session.py:4223} INFO - Creating endpoint with name huggingface-pytorch-inference-2024-01-09-21-37-42-858


--------------!

[2024-01-09 21:45:15,145] p15034 {<timed exec>:23} INFO - finished deploying model


CPU times: user 507 ms, sys: 11.5 ms, total: 518 ms
Wall time: 7min 33s


The [HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) encapsulated several defaults, lets examine the parameters for the deployed model to review the model settings.

In [17]:
logger.info(f"model info -> {vars(huggingface_model)}")

[2024-01-09 21:45:15,162] p15034 {985546471.py:1} INFO - model info -> {'framework_version': '4.28.1', 'pytorch_version': '2.0.0', 'tensorflow_version': None, 'py_version': 'py310', 'model_data': 's3://multimodal-bucket-731963050968/multimodal/llava-v1.5-7b/model_llava-v1.5-7b.tar.gz', 'image_uri': None, 'predictor_cls': <class 'sagemaker.huggingface.model.HuggingFacePredictor'>, 'name': 'huggingface-pytorch-inference-2024-01-09-21-37-42-256', '_base_name': 'huggingface-pytorch-inference', 'sagemaker_session': <sagemaker.session.Session object at 0x7f77fc431c60>, 'algorithm_arn': None, 'model_package_arn': None, '_sagemaker_config': {}, 'role': 'arn:aws:iam::731963050968:role/multimodal-stack-SMExecutionRole-JpUn1cG4CQ5G', 'vpc_config': None, 'endpoint_name': 'huggingface-pytorch-inference-2024-01-09-21-37-42-858', '_is_compiled_model': False, '_compilation_job_name': None, '_is_edge_packaged_model': False, 'inference_recommender_job_results': None, 'inference_recommendations': None, '

Save the name of the deployed endpoint so that the other notebooks can create a [`Predictor`](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) and use this model.

In [18]:
_ = Path(g.ENDPOINT_FILENAME).write_text(predictor.endpoint_name)