# Hugging Face Text Generation Inference available for AWS Inferentia2

Text Generation Inference (TGI), is a purpose-built solution for deploying and serving Large Language Models (LLMs) for production workloads at scale. TGI enables high-performance text generation using Tensor Parallelism and continuous batching for the most popular open LLMs, including Llama, Mistral, and more. Text Generation Inference is used in production by companies such as Grammarly, Uber, Deutsche Telekom, and many more.

The integration of TGI into Amazon SageMaker, in combination with AWS Inferentia2, presents a powerful solution and viable alternative to GPUs for building production LLM applications. The seamless integration ensures easy deployment and maintenance of models, making LLMs more accessible and scalable for a wide range of production use cases.

With the new TGI for AWS Inferentia2 on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly-concurrent, low-latency LLM experiences like HuggingChat, OpenAssistant, and Serverless Endpoints for LLMs on the Hugging Face Hub.


## Deploy Mistral-7b on AWS Inferentia2 using Amazon SageMaker

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom. This demo is running on AWS Inferentia2,

We are going to show you how to:

1. Setup development environment
2. Retrieve the TGI Neuronx Image
3. Deploy Mistral-7B to Amazon SageMaker
4. Run inference and chat with the model

Let’s get started.

### 1. Setup development environment
We are going to use the sagemaker python SDK to deploy Mistral-7B to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.
You need access to an IAM Role with the required permissions for Sagemaker. You can find out more about it [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In [3]:
!pip install transformers "sagemaker>=2.206.0" --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
optimum-neuron 0.0.20 requires transformers==4.36.2, but you have transformers 4.39.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it doesn't exist
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::850751315356:role/SagemakerEMRNoAuthProductWi-SageMakerExecutionRole-9PI0ILA029B4
sagemaker session region: us-east-2


### 2. Retrieve TGI Neuronx Image

The new Hugging Face TGI Neuronx DLCs can be used to run inference on AWS Inferentia2. You can use the get_huggingface_llm_image_uri method of the sagemaker SDK to retrieve the appropriate Hugging Face TGI Neuronx DLC URI based on your desired backend, session, region, and version. You can find all the available versions [here](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+neuronx&expanded=true).


In [12]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface-neuronx",
  version="0.0.20"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")


llm image uri: 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.20-neuronx-py310-ubuntu22.04


## Text Generation Interface

Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) [here](https://huggingface.co/docs/optimum-neuron/package_reference/export#supported-architectures).

You can find detailed information about the base model on its [Model Card](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).

compilation arguments

{
  "num_cores": 2,
  "auto_cast_type": "fp16"
}

input_shapes

{
  "sequence_length": 2048,
  "batch_size": 1
}

In [15]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config & model config
instance_type = "ml.inf2.8xlarge"
health_check_timeout = 900
batch_size = 1
sequence_length = 2048

# Define Model and Endpoint configuration parameter
config = {
  # 'HF_MODEL_ID': "aws-neuron/Llama-2-7b-chat-hf-seqlen-2048-bs-1",
  'HF_MODEL_ID': "aws-neuron/Mistral-7B-Instruct-v0.2-seqlen-2048-bs-1-cores-2",
  'MAX_CONCURRENT_REQUESTS': json.dumps(batch_size),
  'MAX_INPUT_LENGTH': json.dumps(1024), 
    # ArgumentValidation("`max_batch_prefill_tokens` must be >= `max_input_length`.")
  'MAX_TOTAL_TOKENS': json.dumps(sequence_length),
  'MAX_BATCH_PREFILL_TOKENS': json.dumps(int(sequence_length*batch_size / 2)),
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(sequence_length*batch_size),
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

### 4. Deploy Mistral-7B to Amazon SageMaker
Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) here.

### Compiling LLMs for Inferentia2

At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size ahead of time. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), number of inferentia cores (2), batch size (2), and sequence length (2048).

This means compiling fine-tuned checkpoints for Mistral 7B with the same configuration will take only a few minutes. Examples of this are mistralai/Mistral-7B-v0.1 and HuggingFaceH4/zephyr-7b-beta.

In [19]:
# Deploy model to an endpoint
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
)

Your model is not compiled. Please compile your model before using Inferentia.


------------------------!

In [20]:
from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aws-neuron/Mistral-7B-Instruct-v0.2-seqlen-2048-bs-1-cores-2")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

# Generation arguments
payload = {
    "do_sample": False,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
    "return_full_text": True,
    "stop": ["</s>"]
}

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
chat = llm.predict({"inputs":prompt, "parameters":payload})
print(chat[0]["generated_text"][len(prompt):])

 Yes, I can certainly help you with a classic mayonnaise recipe. Here's a simple one that you can make at home:

Ingredients:
- 1 egg yolk
- 1 tablespoon of Dijon mustard
- 1 cup of vegetable oil (canola or safflower oil work well)
- 1-2 tablespoons of white wine vinegar or lemon juice
- Salt to taste

Instructions:
1. In a medium-sized bowl, whisk together the egg yolk and mustard until it becomes pale and thick.
2. Start adding the oil very slowly, drop by drop, while continuously whisking the mixture. This is called the "emulsion" stage, and it's important to keep the mixture thick and smooth.
3. Once you've added about a quarter of the oil, you can start adding it in a thin, steady stream. Make sure the oil is fully incorporated before adding more.
4. If the mayonnaise starts to thicken too much, you can add a few drops of water to thin it out.
5. Once all the oil has been added, whisk in the vinegar or lemon juice and salt to taste.
6. Refrigerate the mayonnaise for at least an ho

### Usage with Optimum-Neuron

In [10]:
from optimum.neuron import pipeline

# Load pipeline from Hugging Face repository
pipe = pipeline("text-generation", "aws-neuron/Mistral-7B-Instruct-v0.2-seqlen-2048-bs-1-cores-2")

# messages = [
#     {"role": "user", "content": "What is your favourite condiment?"},
#     {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
#     {"role": "user", "content": "Do you have mayonnaise recipes?"}
# ]

# # Generation arguments
# payload = {
#     "do_sample": False,
#     "top_p": 0.6,
#     "temperature": 0.9,
#     "top_k": 50,
#     "max_new_tokens": 512,
#     "repetition_penalty": 1.03,
#     "return_full_text": True,
#     "stop": ["</s>"]
# }


# prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# chat = llm.predict({"inputs":prompt, "parameters":payload})
# print(chat[0]["generated_text"][len(prompt):])

ModuleNotFoundError: _from_pretrained requires the `transformers_neuronx` package. You can install it by running: pip install transformers_neuronx