# LLAMA 7B Chat model demo deployment guide for Neuron instances
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install sagemaker --upgrade  --quiet

In [None]:
import json
import io
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model, in this specific example you would replace the s3 bucket locations with the location of your models and compiled model artifacts.
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [None]:
%%writefile serving.properties
engine=Python
option.model_id=TheBloke/Llama-2-7B-Chat-fp16
option.max_rolling_batch_size=4
option.tensor_parallel_degree=12
option.n_positions=1280
option.rolling_batch=auto
option.model_loading_timeout=2400
option.output_formatter=jsonlines

In [None]:
%%writefile model.py
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
from djl_python import Input, Output
from djl_python.rolling_batch.neuron_rolling_batch import NeuronRollingBatch
from djl_python.properties_manager.tnx_properties import TransformerNeuronXProperties
from djl_python.transformers_neuronx import TransformersNeuronXService
from djl_python.neuron_utils.utils import task_from_config

class CustomTNXService(TransformersNeuronXService):
    def __init__(self) -> None:
        super().__init__()
        self.rolling_batch_config = dict()

    def set_configs(self, properties):
        self.config = TransformerNeuronXProperties(**properties)
        if self.config.rolling_batch != "disable":
            """batch_size needs to match max_rolling_batch_size for precompiled neuron models running rolling batch"""
            self.config.batch_size = self.config.max_rolling_batch_size
            if "output_formatter" in properties:
                self.rolling_batch_config["output_formatter"] = properties.get(
                    "output_formatter")

        self.model_config = AutoConfig.from_pretrained(
            self.config.model_id_or_path, revision=self.config.revision)

        self.set_model_loader_class()
        if not self.config.task:
            self.config.task = task_from_config(self.model_config)

    def set_rolling_batch(self):
        if self.config.rolling_batch != "disable":
            self.rolling_batch = NeuronRollingBatch(
                self.model, self.tokenizer, self.config.batch_size,
                self.config.n_positions, **self.rolling_batch_config)


_service = CustomTNXService()

def handle(inputs: Input):
    global _service
    if not _service.initialized:
        _service.initialize(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warm up the model on startup
        return None

    return _service.inference(inputs)

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
mv model.py mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [None]:
image_uri = image_uris.retrieve(
        framework="djl-neuronx",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

### Upload artifact on S3 and create SageMaker model

In [None]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print("S3 Code or Model tar ball uploaded")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.inf2.24xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-demo")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=2400,
             volume_size=256,
             endpoint_name=endpoint_name)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

## Step 5: Test inference

The LineIterator class is used to smooth the output of the token stream.

In [None]:
class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

This is a simple method that we will use for running our line iterator on a prompt, and providing a maximum number of tokens to infer.

In [None]:
def chat_stream_inference(prompt, tokens):
    body = {"inputs": prompt, "parameters": {"max_new_tokens": tokens, "details":True}}
    resp = sess.sagemaker_runtime_client.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
    event_stream = resp['Body']

    for line in LineIterator(event_stream):
        resp = json.loads(line)
        if resp.get("token").get("text") is not None:
            print(resp.get("token").get("text"), end='')

### Text Generation

In [None]:
first_prompt = "[INST] Tell me the story of Red Riding Hood in a 4 act play with lots of emojis [/INST]"

chat_stream_inference(first_prompt, 1280)

### Translation

In [None]:
second_prompt = """Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese => """

chat_stream_inference(second_prompt, 3)

### Question Answering

In [None]:
third_prompt = "Could you remind me when was the C programming language invented?"

chat_stream_inference(third_prompt, 25)

### Classification

In [None]:
fourth_prompt = """Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been :+1:"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:"""

chat_stream_inference(fourth_prompt, 2)

### Summarization

In [None]:
fifth_prompt = """Your task is to summarize the following article into five bullet points. The article will be denoted by three backticks.
```
Amazon SageMaker is a cloud based machine-learning platform that allows the creation, training, and deployment by developers of machine-learning (ML) models on the cloud. It can be used to deploy ML models on embedded systems and edge-devices. SageMaker was launched in November 2017.

Capabilities
SageMaker enables developers to operate at a number of different levels of abstraction when training and deploying machine learning models. At its highest level of abstraction, SageMaker provides pre-trained ML models that can be deployed as-is. In addition, SageMaker provides a number of built-in ML algorithms that developers can train on their own data. Further, SageMaker provides managed instances of TensorFlow and Apache MXNet, where developers can create their own ML algorithms from scratch. Regardless of which level of abstraction is used, a developer can connect their SageMaker-enabled ML models to other AWS services, such as the Amazon DynamoDB database for structured data storage, AWS Batch for offline batch processing, or Amazon Kinesis for real-time processing.

Development interfaces
A number of interfaces are available for developers to interact with SageMaker. First, there is a web API that remotely controls a SageMaker server instance. While the web API is agnostic to the programming language used by the developer, Amazon provides SageMaker API bindings for a number of languages, including Python, JavaScript, Ruby, Java, and Go. In addition, SageMaker provides managed Jupyter Notebook instances for interactively programming SageMaker and other applications.
```
In summary what are the main points of the article above?"""

chat_stream_inference(fifth_prompt, 250)

## Clean up the environment

Uncomment the cell below and run to cleanup endpoint after testing

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()