## lmi-dist rollingbatch Mixtral-8x7B  deployment guide

# Deploying Mixtral with LMI 

### In this tutorial, you will use vllm backend of Large Model Inference(LMI) DLC to deploy Mixtral-8x7B-instruct and run inference with it.

Please make sure the following permission granted before running the notebook:

* S3 bucket push access
* SageMaker access




### Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install sagemaker boto3 awscli huggingface_hub --upgrade  --quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from huggingface_hub import snapshot_download
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id() 
s3_client = boto3.client("s3")


### Step 2.0: Download model artifacts

In [None]:

bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl/mixtral8-7b-awq"
s3_model_prefix = "mixtral8-7b/lmi"  # folder within bucket where model artifact will go

jinja_env = jinja2.Environment()

In [None]:
# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "TheBloke/mixtral-8x7b-v0.1-AWQ"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model", "*.safetensors"]
# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

In [None]:
model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.model_id={model_artifact}")

### Step 2: Start preparing model artifacts

In LMI container, we expect some artifacts to help setting up the model

* serving.properties (required): Defines the model server settings
* model.py (optional): A python file to define the core inference logic
* requirements.txt (optional): Any additional pip wheel need to install

In [None]:
!mkdir -p mymodel

In [None]:
%%writefile ./mymodel/serving.properties
engine=Python
option.model_id={{s3url}}
option.tensor_parallel_degree=4
option.max_rolling_batch_size=16
option.quantize=awq
option.rolling_batch=vllm
option.max_model_len=25456
option.dtype=fp16

In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("mymodel/serving.properties").open().read())
Path("mymodel/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize mymodel/serving.properties | cat -n

In [None]:
%%sh
rm -f mymodel.tar.gz
rm -rf mymodel/.ipynb_checkpoints
tar czvf mymodel.tar.gz -C mymodel .

### Step 3: Start building SageMaker endpoint

#### Getting the container image URI

See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [None]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

#### Upload artifact on S3 and create SageMaker model

In [None]:
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

#### Create SageMaker endpoint with a specified instance type

In [None]:
instance_type = "ml.g5.12xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-mixtral-8x7b-12x")
print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

In [None]:
predictor.endpoint_name

In [None]:
## in case there is already an endpoint deployed.
# predictor = sagemaker.Predictor(
#     endpoint_name='jumpstart-dft-hf-llm-mixtral-8x7b-endpoint',
#     sagemaker_session=sess,
#     serializer=serializers.JSONSerializer(),
# )

### Step 4: Run inference
Comparing the results 
see below a few examples


In [None]:
predictor.predict(
    {"inputs": "The future of Artificial Intelligence is", "parameters": {"max_new_tokens":128, "do_sample":True}}
)

In [None]:
predictor.predict(
    {"inputs": "what is the derivative of x squared", "parameters": {"max_new_tokens":128, "do_sample":True}}
)

## Step 5 Inference recommender

#### Inference recommender for Mixtral with LMI

In [None]:
import json

payload = {"inputs": "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:\n\nAlso, with information from a knowledge library, without any retraining, see retrieval augmented generation. Prompt engineering is the process of designing and refining the prompts or input for a large model to generate specific types of output. Prompt engineering involves selecting appropriate keywords, providing context, and shaping the input in a way that encourages the model to produce the desired response and is the vital technique to actively shape the behavior and output of foundation models. Effective prompt engineering is crucial for directing model behavior and achieving desired responses. Through prompt engineering, you can control a model's tone, style, and domain expertise without more involved customization measures like fine-tuning. The goal is to provide sufficient context and guidance to the model so that it can generalize and perform well on unseen or limited data scenarios. Fine-tuning a pre-trained foundation model is an affordable way to take advantage of their broad capabilities while customizing a model on your own small corpus. Fine-tuning is the customization method that involved further training and does change the weights of your model. Fine-tuning might be useful.\n\ncustomization method that involved further training and does change the weights of your model. Fine tuning might be useful.\n\nyou if you need to customize your model to specific business needs, your model to successfully work with domain-specific language such as industry jargon, technical terms, or other specialized vocabulary. Enhanced performance for specific tasks. Accurate, relative, and context-aware responses in applications. Responses that are more factual, less toxic, and better aligned to specific requirement. There are two main approaches that you can take for fine-tuning depending on your use case and chosen foundation model. If you are interested in fine-tuning your model on domain-specific data, see domain adaptation fine-tuning. If you are interested in instruction-based fine-tuning, using prompt and response examples, see instruction-based fine-tuning. Retrieval Augmented generation. Foundation models are usually trained offline, making the model agnostic to any data that is created after the model was trained. Additionally, foundation models are trained on very general domain corpora, making them less effective for domain-specific tasks. You can use Retrieval Augmented Generation, or RAG, to retrieve data from outside of foundation model and augment your prompts by adding the relevant retrieval data in context. For more information about RAG model architectures, see Retrieval Augmented Generation for Knowledge-Intensive NLP Tasks. With RAG, the external data used to augment your prompts can come from multiple data source, such as.\n\nAmazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access for your data source for exploration and analysis. So you don't have to manage servers. It also provides common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment. With native support for bring your own algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to your specific workloads. Deploy a model into a secure and scalable environment by launching it with a few clicks from SageMaker Studio or the SageMaker Console. Nation models are extremely powerful models able to solve a wide array of tasks. To solve most tasks efficiently, these models require some form of customization. The recommended way to first customize a foundation model to a specific use case is through prompt engineering. Providing your foundation model with well-engineered, context-rich prompts can help achieve desired results without any fine-tuning or changing of model weights. For more information, see prompt engineering for foundation models.\n\nIf prompt engineering alone is not enough to customize your foundation model to a specific task, you can fine-tune the foundation model on additional domain-specific data. The fine-tuning process involves changing model weights. If you want to customize your model\n\nQuestion: how to prompt engineer\nHelpful Answer:","parameters": {"max_new_tokens": 1200, "do_sample":True}}

with open('payload2.json', 'w') as f:
    json.dump(payload, f)

In [None]:
!tar -czvf payload2.tar.gz payload2.json

In [None]:
s3_location = f"s3://{bucket}/sagemaker/InferenceRecommender/djl-inference"
payload_tar_url = sagemaker.s3.S3Uploader.upload("payload2.tar.gz", s3_location)
print(payload_tar_url)

In [None]:
mixtral_endpoint_name = predictor.endpoint_name
mixtral_model_name = model.name

In [None]:
import time 
import boto3

sm_client = boto3.client('sagemaker')

job_name = f"Mixtral-8x7B-v01-awq-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
response = sm_client.create_inference_recommendations_job(
    JobName=job_name,
    JobType='Default',
    RoleArn=role,
    InputConfig={
        'ContainerConfig': {
            'Domain': 'NATURAL_LANGUAGE_PROCESSING',
            'Task': 'TEXT_GENERATION',
            'PayloadConfig': {
                'SamplePayloadUrl': payload_tar_url,
                'SupportedContentTypes': ["application/json"],
            },
            #specify the instance types you would like to test out
            'SupportedInstanceTypes': ['ml.g5.12xlarge'], 
            'SupportedEndpointType': 'RealTime'
        },
        'ModelName': mixtral_model_name,
    "Endpoints": [ 
         { 
            "EndpointName": mixtral_endpoint_name
         }
      ],
    },
    
)

In [None]:
sm_client = boto3.client('sagemaker')
job_name = "Mixtral-8x7B-v01-awq-2024-02-01-00-05-16"

In [None]:
describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)

while describe_IR_job_response["Status"] in ["IN_PROGRESS", "PENDING"]:
    describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)
    print(describe_IR_job_response["Status"])
    time.sleep(15)
    
print(f'Inference Recommender job {job_name} has finished with status {describe_IR_job_response["Status"]}.')

In [None]:
metrics = describe_IR_job_response['InferenceRecommendations'][0]['Metrics']
token_per_sec = round(metrics['MaxInvocations']*1550/60, 2)
cost_per_sec = round(metrics['CostPerHour']/3600, 5)
cost_per_1k_token = round(cost_per_sec/token_per_sec * 1000, 5)
print("According to the Inference recommender job, the corresponding metrices are as below: /n")
print(f"Max tokens per second is about {token_per_sec}")
print(f"Cost per second is about ${cost_per_sec}")
print(f"Cost per 1k tokens is about ${cost_per_1k_token}")

example output:
- According to the Inference recommender job, the corresponding metrices are as below: /n
- Max tokens per second is about `775.0`
- Cost per second is about `$0.00197`
- Cost per 1k tokens is about `$0.00254`