# Standard instruction for using LMI container on SageMaker
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install sagemaker boto3 awscli --upgrade  --quiet

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers, multidatamodel

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [None]:
%%writefile model.py
from djl_python import Input, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None

def get_model(properties):
    model_name = properties['model_id']
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    dtype = torch.float16
    model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=dtype)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()['prompt']
    result = predictor(data, do_sample=True)
    return Output().add(result)


In [None]:
import shutil
import os
models_to_run=["facebook/opt-350m", "bigscience/bloomz-560m", "EleutherAI/gpt-neo-125M", "cerebras/Cerebras-GPT-590M"]
model_folders = [model.split("/")[1].lower() for model in models_to_run]

for folder, model in zip(model_folders, models_to_run):
    if os.path.exists(folder):
        shutil.rmtree(folder)
    os.makedirs(folder)
    with open(os.path.join(folder, "serving.properties"), "w") as f:
        f.write(f"engine=Python\noption.model_id={model}\n")
    shutil.copyfile("model.py", f"{folder}/model.py")

### DJLServing memory management for MME

In DJLServing, you could control how many memory allocated for each CPU/GPU on SageMaker. It works like below:

- `required_memory_mb` CPU/GPU required memory in MB
- `reserved_memory_mb` CPU/GPU reserved memory for computation
- `gpu.required_memory_mb` GPU required memory in MB
- `gpu.reserved_memory_mb` GPU reserved memory for computation

If you need 20GB CPU memory and 2GB GPU memory, you could set

```
required_memory_mb=20480
gpu.required_memory_mb=2048
```

in the following code, we will create a bomb model that plans to take over all GPU memory and let's see how that would impact the result. For more information on settings, please find them [here](https://docs.djl.ai/docs/serving/serving/docs/modes.html#servingproperties).

In [None]:
%%writefile serving.properties
engine=Python
option.model_id=facebook/opt-350m
gpu.reserved_memory_mb=30000

In [None]:
%%sh
cp -r opt-350m/ bomb/
mv serving.properties bomb/
tar czvf opt-350m.tar.gz opt-350m/
tar czvf bloomz-560m.tar.gz bloomz-560m/
tar czvf gpt-neo-125m.tar.gz gpt-neo-125m/
tar czvf cerebras-gpt-590m.tar.gz cerebras-gpt-590m/
tar czvf bomb.tar.gz bomb/
rm -rf opt-350m/ bloomz-560m/ gpt-neo-125m/ cerebras-gpt-590m/ bomb/ model.py

## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

Available framework are:
- djl-deepspeed (0.20.0, 0.21.0)
- djl-fastertransformer (0.21.0)

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )

### Upload artifact on S3 and create SageMaker model

In [None]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
for model_name in model_folders:
    code_artifact = sess.upload_data(f"{model_name}.tar.gz", bucket, s3_code_prefix)
    print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")
env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}
model_s3_folder = os.path.dirname(code_artifact) + "/"

model = multidatamodel.MultiDataModel("LMITestModel", model_s3_folder, image_uri=image_uri, env=env, role=role)

### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

## Step 5: Test and benchmark the inference

In [None]:
print(predictor.predict( {"prompt": "Large model inference is"}, target_model="opt-350m.tar.gz"))
print(predictor.predict({"prompt": "Large model inference is"}, target_model="bloomz-560m.tar.gz"))
print(predictor.predict({"prompt": "Large model inference is"}, target_model="gpt-neo-125m.tar.gz"))
print(predictor.predict({"prompt": "Large model inference is"}, target_model="cerebras-gpt-590m.tar.gz"))

### Testing a bomb model

Now let's see if I have a model need 30GB GPU memory and what will happen:

In [None]:
code_artifact = sess.upload_data(f"bomb.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")
try:
    predictor.predict({"prompt": "Large model inference is"}, target_model="bomb.tar.gz")
except Exception as e:
    print("Loading failed...You can still load more models that are smaller than the gpu sizes")

The model loading failed since the total GPU memory is 24GB and cannot holds a 30GB model. You will find the model server is still alive. Behind the scence, SageMaker will unload all models to spare spaces. So currently there is no model loaded. You could rerun the 4 prediction above and model server will reload the model back again.

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()