## Deploy Mixtral-8x7B model using LMI-dist of LMI

### Mixtral Architecture

Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model, has the same architecture as Mistral 7B. The difference from Mistral is that, each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.

#### LMI-Dist 

The DeepSpeed container includes a library called LMI Distributed Inference Library (LMI-Dist). LMI-Dist is an inference library used to run large model inference with the best optimization used in different open-source libraries, across vLLM, Text-Generation-Inference (up to version 0.9.4), FasterTransformer, and DeepSpeed frameworks. This library incorporates open-source popular technologies like FlashAttention, PagedAttention, FusedKernel, and efficient GPU communication kernels to accelerate the model and reduce memory consumption.

In this tutorial, you will use lmi-dist backend of Large Model Inference(LMI) DLC to deploy Mixtral-8x7B and run inference with it.


#### Pre-reqs

Please make sure the following permission granted before running the notebook:

* S3 bucket push access
* SageMaker access


#### Set up and Installs

In [2]:
%pip install sagemaker --upgrade  --quiet

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id() 

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


#### Prepare Model Artifacts

In LMI container, we expect some artifacts to help setting up the model

* serving.properties (required): Defines the model server settings
* model.py (optional): A python file to define the core inference logic
* requirements.txt (optional): Any additional pip wheel need to install

If you use LMI-Dist for the rolling batch option with DJL Serving, you can configure the options in the [table](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html#large-model-inference-lmi-dist) in serving.properties. 

In [4]:
%%writefile serving.properties
engine=MPI
option.model_id=mistralai/Mixtral-8x7B-v0.1
option.tensor_parallel_degree=8
option.max_rolling_batch_size=32
option.rolling_batch=lmi-dist

Overwriting serving.properties


The code and configuration you want to deploy can either be stored locally or in S3. These files will be bundled into a tar.gz file that will be uploaded to SageMaker.

In [5]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


#### Build SageMaker endpoint

#### Getting the container image URI

Retrieve the ECR image URI for the DJL DeepSpeed accelerated large language model framework. The image URI is looked up based on the framework name, AWS region, and framework version. This allows us to dynamically select the right Docker image for our environment.

Functions for generating ECR image URIs for pre-built SageMaker Docker images. See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [6]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

#### Upload artifact on S3 and create SageMaker model

This code uploads a tarball containing model code to an S3 bucket under the prefix "large-model-lmi/code", prints the S3 path where the code was uploaded, and then creates a SageMaker model pointing to the model code that was uploaded. The model is given the S3 path to the uploaded code so that SageMaker knows where to retrieve the model code from when deploying the model. The role parameter specifies the IAM role that SageMaker can assume in order to access the uploaded model code.

In [7]:
s3_code_prefix = "large-model-lmi-dist/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-420618410968/large-model-lmi-dist/code/mymodel.tar.gz


#### Create SageMaker endpoint

This cell deploys the trained model to a SageMaker endpoint for real-time inference. The instance_type defines the machine instance for the endpoint, in this case a very large. GPU instance to support fast inferences. The endpoint name is programmatically generated based on the base name. The model is deployed with a large container startup timeout specified, as the TensorRT model takes time to initialize on the GPU instance.

A SageMaker Predictor is then created to call the deployed endpoint for real-time inferences. The endpoint name, sagemaker session, and JSON serializer for input/output data are specified. The predictor provides a simple interface to call the endpoint and preprocess inputs and postprocess outputs so the endpoint can be easily integrated into applications.

In [8]:
instance_type = "ml.g5.48xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

endpoint_name: lmi-model-2024-02-18-17-21-25-477
----------------!

#### Run inference

This code snippet is making a text generation prediction using an AI model from Anthropic called Claude. The inputs key passes in the text prompt "The future of Artificial Intelligence is" to seed the generation. Parameters are also configured - max_new_tokens sets the maximum length to 128 tokens, while do_sample enables stochastic sampling from the model's predicted distribution during generation rather than taking the most likely token each time. The model will use the prompt and parameters to complete the sentence in a creative way, demonstrating controllable natural language generation.

In [9]:
predictor.predict(
    {"inputs": "The future of Artificial Intelligence is", "parameters": {"max_new_tokens":128, "do_sample":True}}
)

b'{"generated_text": " still uncertain. As humans we strive to continue evolving and so do the machines that assist us every day. In order to help assistants like Siri and Alexa create a more seamless everyday experience we strive to analyse and write \xe2\x80\x98service descriptions\xe2\x80\x99.\\n\\nA service description is a string of natural language designed to assist our vocal assistants in knowing how to accomplish a task for a user. We make an effort and spend time aligning these descriptions to align the description given by the AI and the expectation given by the person fulfilling that request.\\n\\nWithout a service description attached to your app, voice assistants"}'

#### Clean Up Endpoint

In [10]:
predictor.delete_endpoint()