In this notebook I will walk you through how you can deploy a tuned LLM to sagemaker realtime endpoint.
The LLm used here is a tuned Llama2-7b model.

A few things sagemaker offers:
* A variety of [hosting options](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html)
* Conatiner [logs](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html) and [metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html) all available in cloudwatch
* Automatic metadata capture for model lineage
* Robust selection of managed hosting images from popular [frameworks](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html)

In this example we would use a [Deep java library Serving](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/using_djl.html) framework in hosting a tuned Llama7b model.

To host on Sagemaker, model files must be in s3. For language models we expect that the model weights, model config, and tokenizer config are provided in S3.
For example:
```
my_bucket/my_model/
|- config.json
|- added_tokens.json
|- config.json
|- pytorch_model-*-of-*.bin # model weights can be partitioned into multiple checkpoints
|- tokenizer.json
|- tokenizer_config.json
|- vocab.json
```

The sagemaker managed DJL images come with a default inference image for serving, so you do not need to provide one. However, if you decide to provide one, you can pass it as a local path or an s3 uri (when using s3 uri, inference artifacts -sourcedir- must be compressed in a `tar.gz` format)
Here is an example of a dir containing my inference artifacts:
```
sourcedir/
|- script.py # Inference handler code
|- serving.properties # Model Server configuration file
|- requirements.txt # Additional Python requirements that will be installed at runtime via PyPi
|- lib
    |- *.whl files # In contratst to a requirements.txt, package wheel files to be installed at runtime 
```



IMPORT MODULES

In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

The inference script I used to serve this model is the same used to host the LLama2 Models on Jumpstart. You can write your inference script to meet your needs.
All SageMaker JumpStart artifacts are hosted in an s3 bucket managed by the service team.
The inference artifacts for LLama2 models on JumpStart can be found here:
* s3://jumpstart-cache-prod-{region}/source-directory-tarballs/meta/inference/textgeneration/v1.1.0/sourcedir.tar.gz

In [7]:
!pygmentize code/inference.py

[34mimport[39;49;00m [04m[36mitertools[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36menum[39;49;00m [34mimport[39;49;00m Enum
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Any
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Dict
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m List
[34mfrom[39;49;00m [04m[36mtyping[39;49;00m [34mimport[39;49;00m Optional

[34mimport[39;49;00m [04m[36mdeepspeed[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mfrom[39;49;00m [04m[36mdjl_python[39;49;00m [34mimport[39;49;00m Input
[34mfrom[39;49;00m [04m[36mdjl_python[39;49;00m [34mimport[39;49;00m Output
[34mfrom[39;49;00m [04m[36mdjl_python[39;49;00m[04m[36m.[39;49;00m[04m[36mdeepspeed[39;49;00m [34mimport[39;49;00m DeepSpeedService
[34mfrom[39;49;00m [04m[36msagemaker_jumpstart_huggingface_script_utilities[39;49;00m[04m[3

We would be using the [DJL Deepspeed](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#deepspeedmodel) image to host our LLama2 model. It comes prepackaged with certain [modules](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html). For a full list of supported DL frameworks see  [deep-learning-containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)

In [23]:
from sagemaker.djl_inference.model import DeepSpeedModel
djl_model = DeepSpeedModel(
    "s3://<path to model files>", # S3 uri containing model files. This can also be a HuggingFace Hub model id
    role, # sagemaker role  
   source_dir="code", # local dir holding your custom inference script and other dependencies
    entry_point="inference.py", # inference script located within the source_dir path
     dtype="fp16", # The data type to use for loading your model.
     task="text-generation",
    model_loading_timeout=3600, 
    tensor_parallel_degree=1, # number of gpus to partition the model across using tensor parallelism
    max_tokens=4096  #The maximum number of tokens (input + output tokens) the DeepSpeed engine is configured for
)

In [24]:
predictor = djl_model.deploy("ml.g5.4xlarge", # Instance type
                             initial_instance_count=1)

-----------------!

In [43]:
prompt="What is the capital of Nigeria?"

In [44]:
payload = {
  "inputs":  prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.1,
    "top_k": 5,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,   
  }
}

# send request to endpoint
response = predictor.predict(payload,
                             custom_attributes='accept_eula=true')

In [45]:
response

[{'generation': '\nWhat is the capital of Nigeria?\nThe capital of Nigeria is Abuja.\nWhat is the capital of Nigeria? The capital of Nigeria is Abuja. Check out this story on USATODAY.com: http://usat.ly/1bY43ZI\nAP Published 12:00 a.m. ET March 17, 2013 | Updated 12:00 a.m. ET March 17, 2013\nAbuja, Nigeria(Photo: AP)\nThe capital of Nigeria is Abuja.\nRead or Share this story: http://usat.ly/1bY43ZI'}]