# Deploy Llama3 to Samgemaker

## Configuration

In [104]:
import dotenv
import os
import sagemaker
import sagemaker.huggingface
import boto3

dotenv.load_dotenv(".env", override=True)

role_name = os.environ.get("role_name")
model_id = os.environ.get("model_id")
image_uri = sagemaker.huggingface.get_huggingface_llm_image_uri("huggingface")

try:
    role = sagemaker.get_execution_role() # If you online sagemaker notebook
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName=role_name)["Role"]["Arn"]


instance_type = "ml.g5.2xlarge"
endpoint_name = "llama-3-8b-tgi"

role_name, model_id, image_uri

Couldn't call 'get_role' to get Role ARN from role name wellxie to get Role path.


('HuggingfaceExecuteRole',
 'shenzhi-wang/Llama3-8B-Chinese-Chat',
 '763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.2-gpu-py310-cu121-ubuntu22.04')

## Deloyment


In [105]:

model = sagemaker.huggingface.HuggingFaceModel(
  image_uri=image_uri,
  role=role,
  env={
    'HF_MODEL_ID': model_id,
    'SM_NUM_GPUS': "1", # Number of GPU used per replica
    'MAX_INPUT_LENGTH': "4000",  # Max length of input text
    'MAX_TOTAL_TOKENS': "8192",  # Max length of the generation (including input text)
    # 'MAX_BATCH_TOTAL_TOKENS': "4096",  # Limits the number of tokens that can be processed in parallel during the generation
    # 'MESSAGES_API_ENABLED': "True", # You should disable the messages API, or it will split a full json to chunk when you wanna streaming, but that's not streaming.
    # 'HUGGING_FACE_HUB_TOKEN': "<REPLACE WITH YOUR TOKEN>", # Token used to login to huggingface Hub.
  }
)


llm = model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  endpoint_name=endpoint_name,
  container_startup_health_check_timeout=1800
)


------------!

## Testing

In [107]:

import boto3
import json

smr = boto3.client('sagemaker-runtime')


endpoint_name = "llama-3-8b-tgi"


# prompt = "Write a sonnet about spring."
prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant. Answer briefly!<|eot_id|>
<|start_header_id|>user<|end_header_id|>
write a sonnet about spring.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""


response_model = smr.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(
        { 
            "inputs": prompt,
            "parameters": {  
                    "top_p": 0.6,
                    "max_new_tokens": 8000, # Input validation error: `inputs` tokens + `max_new_tokens` must be <= 2048. So you must control you inputs length
                    "temperature": 0.9, 
                    "return_full_text": False,
                    "stop": ["<|eot_id|>"]
                    },
            "stream": True,
        }
    ),
    ContentType="application/json",
)


stream = response_model.get("Body")

if stream:
    chunk_str_full = ''
    for event in stream:
        chunk = event.get("PayloadPart")
        if chunk:
            str00 =chunk.get("Bytes").decode()
            chunk_str_full = chunk_str_full + str00

            if chunk_str_full.strip().endswith("}") and chunk_str_full.strip().startswith("data:"):
                chunk_str_full = chunk_str_full.replace("data:", "")
                chunk_obj = json.loads(chunk_str_full) 
                chunk_str_full = ""
                print(chunk_obj.get("token").get("text"), end="")


As spring awakens from its winter sleep,
The world regains its vibrant hue and life,
Buds burst forth, and flowers bloom with glee,
In every color, a beauty to strife.

The sun shines bright, and warmth fills the air,
Birds sing their sweet melodies so free,
A time of renewal, hope and joy are there,
A season of growth, a time to be.

The earth awakens from its winter's sleep,
And all around, new life begins to creep,
A time of promise, a time to reap,
A season of beauty, a time to keep.

So let us cherish this time of year,
And let our spirits soar, without a fear.<|eot_id|>

## Config BRConnector

Once you have successfully configured Sagemaker, you can proceed with configuring the BRConnector.

First, open the Manager UI and add a new model, configuring the parameters as follows:

```json
{
  "regions": [
    "us-west-2"
  ],
  "endpointName": "llama-3-8b-tgi"
}
```


![image.png](./llama3-config.jpg)

Remember to correct the region and endpoint name for your model deployment and authorize the model for a Group or API Key.
