# Deploy a fine-tuned TinyLlama-1.1B model for generative AI inference

## Specify the LMI container image

[SageMaker LMI containers](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) use [DJLServing](https://github.com/deepjavalibrary/djl-serving), a model server that is integrated with the [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) library to support tensor parallelism across NeuronCores. The DJL model server and transformers-neuronx library serve as core components of the container, which also includes the [Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/). This setup facilitates the loading of models onto [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia2.html) accelerators, parallelizes the model across multiple [NeuronCores](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html#neuroncores-v2-arch), and enables serving via HTTP endpoints. This uses SageMaker Library.

In [5]:
import logging 
sagemaker_config_logger = logging.getLogger("sagemaker.config") 
sagemaker_config_logger.setLevel(logging.WARNING)

# Import SageMaker SDK, setup our session
import sagemaker
from sagemaker import Model, image_uris, serializers
import boto3

# NOTE: We currently need to use us-east-2 for model deployment when running this notebook in an AWS Workshop Studio event.
boto3_sess = boto3.Session(region_name="ap-northeast-1")

sess = sagemaker.session.Session(boto_session = boto3_sess)  # sagemaker session for interacting with different AWS APIs
role = sagemaker.get_execution_role()  # execution role for the endpoint

Couldn't call 'get_role' to get Role ARN from role name SSMDefaultRoleForPVREReporting to get Role path.


In [6]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.24.0"
    )
image_uri

'763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/djl-inference:0.24.0-neuronx-sdk2.14.1'

## Prepare Model Serving Artifacts

The LMI container supports loading models from an Amazon Simple Storage Service (Amazon S3) bucket or Hugging Face Hub. You need  parameters required in *`serving.properties`* file to load and host the model. 

In [35]:
# Create the serving.properties file required by the model server

file_content = f"""engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=TinyLlama/TinyLlama-1.1B-Chat-v1.0
option.batch_size=1
option.neuron_optimize_level=1
option.tensor_parallel_degree=2
option.load_in_8bit=false
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16"""

with open("serving.properties","w") as f:
    f.write(file_content)


Construct the tarball containing *`serving.properties`* and upload it to an S3 bucket. 

In [36]:
%%sh
cp serving.properties mycode/
# tar czvf mycode.tar.gz mycode/
# rm -rf mycode

## Create Container
Next, we create the Container endpoint with the model configuration defined earlier. Model deployment will usually take 4-5 minutes as model is compiled during the process.

In [32]:
!aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [None]:
!docker run -it --rm  --rm --network=host \
  -v $(pwd)/mycode:/opt/ml/model/ \
  -v `pwd`/logs:/opt/djl/logs \
  -u djl \
  --device /dev/neuron0 \
  --device /dev/neuron1 \
  --device /dev/neuron2 \
  --device /dev/neuron3 \
  --device /dev/neuron4 \
  --device /dev/neuron5 \
  --device /dev/neuron6 \
  --device /dev/neuron7 \
  --device /dev/neuron8 \
  --device /dev/neuron9 \
  --device /dev/neuron10 \
  --device /dev/neuron11 \
  -e MODEL_LOADING_TIMEOUT=7200 \
  -e PREDICT_TIMEOUT=360 \
  {image_uri} serve

[32mINFO [m [92mModelServer[m Starting model server ...
[32mINFO [m [92mEc2Utils[m DJL will collect telemetry to help us better understand our users? needs, diagnose issues, and deliver additional features. If you would like to learn more or opt-out please go to: https://docs.djl.ai/docs/telemetry.html for more information.
[32mINFO [m [92mModelServer[m Starting djl-serving: 0.24.0 ...
[32mINFO [m [92mModelServer[m 
Model server home: /opt/djl
Current directory: /opt/djl
Temp directory: /tmp
Command line: -Dlog4j.configurationFile=/usr/local/djl-serving-0.24.0/conf/log4j2.xml -Xmx1g -Xms1g -Xss2m -XX:+ExitOnOutOfMemoryError
Number of CPUs: 192
Number of Neuron cores: 24
Max heap size: 1024
Config file: /opt/djl/conf/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Default job_queue_size: 1000
Default batch_size: 1
Default max_batch_delay: 100
Default max_idle_time: 60
Model Store: /opt/ml/model
Initial Models: ALL
Netty th

## Inference tests
After the SageMaker endpoint has been created, we can make real-time predictions against SageMaker endpoints using the Predictor object:
- Create a predictor for submit inference requests and receive reponses
- Requests and responses are in json format

In [None]:
curl -X POST "http://127.0.0.1:8080/predictions/model" \
     -H 'Content-Type: application/json' \
     -d '{"seq_length":512,
          "inputs":
                    "Welcome to Amazon Elastic Compute Cloud,"
          }'

Lets submit an inference requests to model server and receive inference result

In [None]:
review_text = "I couldn't believe this was the same director as Antonia's Line.<br /><br />This film has it all, \
a boring plot, disjointed flashbacks, a subplot that has nothing to do with the main plot what so ever, \
and totally uninteresting characters.It was painful to watch. Soooo, painful."

In [None]:
prompt = f"###Query: Classify the following movie review as positive or negative\n \
###Review: {review_text}\n \
###Classification:"

In [None]:
result = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":32, "do_sample":"true"}}
)
result

In [None]:
review_text = "This movie is one of my all-time favorites. I think that Sean Penn did a great job acting. \
It is one of the few true stories that made it to film that I really like. It is in my top 10 films of all-time. \
I watch it over and over and never get tired of it. Great movie!"

In [None]:
prompt = f"###Query: Classify the following movie review as positive or negative\n \
###Review: {review_text}\n \
###Classification:"

In [None]:
result = predictor.predict(
    {"inputs": prompt, "parameters": {"max_new_tokens":32, "do_sample":"true"}}
)
result

## Cleanup the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

Congratulations on completing the LLM deployment for the inference module!

## (Optional) Deploy original TinyLlama model from Hugging Face hub

If you have spare time, you can also consider an optional step of deploying the original TinyLlama model from [Hugging Face hub](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) for even more fun !

In this scenario, you can specify the name of the Hugging Face model using the *`model_id`* parameter to download the model directly from the Hugging Face repo. The remaining steps of the process remain the same as before.

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.24.0"
    )
image_uri

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=TinyLlama/TinyLlama-1.1B-Chat-v0.4
option.batch_size=1
option.neuron_optimize_level=1
option.tensor_parallel_degree=2
option.load_in_8bit=false
option.n_positions=512
option.rolling_batch=auto
option.dtype=fp16

In [None]:
%%sh
mkdir mycode
mv serving.properties mycode/
tar czvf mycode.tar.gz mycode/
rm -rf mycode

In [None]:
s3_code_prefix = "neuron_events2024/large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mycode.tar.gz", bucket, s3_code_prefix)
print(f"Code uploaded to --- > {code_artifact}")

In [None]:
instance_type = "ml.inf2.xlarge"
endpoint_name = sagemaker.utils.name_from_base("tinyllama-original-model")

In [None]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role, sagemaker_session = sess)

model._is_compiled_model = True

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=500,
             volume_size=256,
             endpoint_name=endpoint_name)

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer()
)

In [None]:
prompt = "How to get in a good university?"
formatted_prompt = (
    f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)

In [None]:
result = predictor.predict(
    {"inputs": formatted_prompt, "parameters": {"max_new_tokens":512, "do_sample":"true"}}
)

In [None]:
import json
print(json.loads(result)["generated_text"])

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()