# Deploy Codestral on Amazon SageMaker with DeepSpeed

---

[Codestral](https://mistral.ai/news/codestral/) is an open-weight generative AI model explicitly designed for code generation tasks. It helps developers write and interact with code through a shared instruction and completion API endpoint. As it masters code and English, it can be used to design advanced AI applications for software developers. Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

SageMaker has rolled out [DeepSpeed container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

[DJL](https://docs.djl.ai/) provides for the serving framework while [DeepSpeed](https://www.deepspeed.ai/) is the key sharding library we leverage to enable hosting of large models. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

In this notebook, we deploy the open source `Codestral 22B` model across GPU's on a `ml.g5.12xlarge` instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to this [link](https://arxiv.org/pdf/2207.00032.pdf)

---

As a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding. With its larger context window of 32k (compared to 4k, 8k or 16k for competitors), Codestral outperforms all other models in RepoBench, a long-range eval for code generation.

![codestral](imgs/codestral.png)

<b><i>To deploy Codestral on to Sagemaker with TGI, please refer to the 'Deploy Codestral on TGI' notebook located in this folder.</b></i>

---

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Codestral is a 22B open-weight model licensed under the new Mistral AI Non-Production License, which means that you can use it for research and testing purposes. Codestral can be downloaded on HuggingFace.
If you want to use the model in the course of commercial activity, Commercial licenses are also available on demand by reaching out to the Mistral team.
</div>

##### Reach out to Mistral to explore Codestral for commercial use cases: [Contact the Mistral team](https://mistral.ai/contact/)

##### More on the Mistral AI Non-Production License: [Mistral AI Non-Production License](https://mistral.ai/news/mistral-ai-non-production-license-mnpl/)

---

## Requirements

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose `ml.t3.medium`.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [None]:
!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts
!pip install sagemaker==2.224.1 -qU --force --quiet --no-warn-conflicts

---
## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

In this notebook, we demonstrate how to deploy a model without any inference code.

SageMaker needs the model to be in a Tarball format. The tarball is in the following format:

```
code
├──── 
│   └── serving.properties

``` 

- `serving.properties` is the configuration file that can be used to configure the model server.


In [1]:
import boto3
import json
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
smr_client = boto3.client("sagemaker-runtime")

### Initialize parameters

In [3]:
# execution role for the endpoint
role = sagemaker.get_execution_role()

# sagemaker session for interacting with different AWS APIs
sess = sagemaker.session.Session()

# Region
region_name = sess._region_name

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
bucket = None
if bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    bucket = sess.default_bucket()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region_name}")

sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0
sagemaker bucket: sagemaker-us-east-1-570598552974
sagemaker session region: us-east-1


### Image URI of the DJL Container

In [4]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-deepspeed",
    region=region_name,
    version="0.27.0"
)
print(f"DCL Image going to be used is ---- > {inference_image_uri}")

DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121


See more details about DLC images [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md).

### DJL parameters using Serving.properties

Here is a list of settings that we use in this configuration file:

- `engine`: The engine for DJL to use. In this case, we intend to use Accelerate and hence set it as **Python**.
- `option.model_id`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can set `option.s3url` to the URI of the Amazon S3 bucket that contains the model.
- `option.s3url` (Optional): Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages s5cmd to download the model from s3. This is extremely fast and useful when downloading large models like this one.
- `option.dtype`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.
- `option.task`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)
- `option.rolling_batch`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.
    - In the LMI Container:
        - to use vLLM, use `option.rolling_batch=vllm`
        - to use lmi-dist, use `option.rolling_batch=lmi-dist`
        - to use huggingface accelerate, use `option.rolling_batch=auto` for text generation models, or option.rolling_batch=disable for non-text generation models.
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeed, follow the [link](https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference).
- `option.device_map`: The HuggingFace accelerate device_map to use.

For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)

In [5]:
%%writefile serving.properties

engine=Python
option.model_id=mistral-community/Codestral-22B-v0.1
option.dtype=bf16
option.task=text-generation
option.rolling_batch=vllm
option.tensor_parallel_degree=4
option.device_map=auto

Writing serving.properties


In [6]:
%%sh

mkdir sm_code_artifact
mv serving.properties sm_code_artifact/
tar czvf sm_code_artifact.tar.gz sm_code_artifact/
rm -rf sm_code_artifact

sm_code_artifact/
sm_code_artifact/serving.properties


### Create SageMaker endpoint

In [7]:
# Hugging Face Model Id
model_id = "mistral-community/Codestral-22B-v0.1"

# SageMaker Instance Type
instance_type = "ml.g5.12xlarge"

# Folder within bucket where code artifact will go
s3_code_prefix = f"""{model_id.replace("/", "-").replace(".", "x").lower()}/sm_code_artifact"""

# Endpoint name
endpoint_name_prefix = "codestral-22b-vllm"
endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)

print(f"instance_type: {instance_type}")
print(f"model_id: {model_id}")
print(f"s3_code_prefix: {s3_code_prefix}")
print(f"endpoint_name: {endpoint_name}")

instance_type: ml.g5.12xlarge
model_id: mistral-community/Codestral-22B-v0.1
s3_code_prefix: mistral-community-codestral-22b-v0x1/sm_code_artifact
endpoint_name: codestral-22b-vllm-2024-06-25-13-10-12-119


In [8]:
s3_code_artifact = sess.upload_data("sm_code_artifact.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-570598552974/mistral-community-codestral-22b-v0x1/sm_code_artifact/sm_code_artifact.tar.gz


In [9]:
# Deploy model to an endpoint
model = sagemaker.Model(
    image_uri=inference_image_uri,
    model_data=s3_code_artifact,
    role=role
)

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
)

--------------!

## Run inference and chat with the model

### Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

---

### Sample code generation questions

1. "Create a Python class for a multi-threaded web scraper that can handle rate limiting, proxy rotation, and dynamic content loading. Include methods for parsing HTML with BeautifulSoup and storing results in a SQLite database."
2. "Implement a Red-Black Tree data structure in C++ with methods for insertion, deletion, and rebalancing. Include a visualization function that prints the tree structure to the console."
3. "Write a Rust function that implements the Aho-Corasick string matching algorithm for efficient multi-pattern searching. Optimize it for memory usage and include comprehensive error handling."
4. "Develop a JavaScript module for a real-time collaborative text editor using operational transformation. Implement functions for handling concurrent edits, conflict resolution, and syncing with a backend server."
5. "Create a Python script that uses asyncio to concurrently process large CSV files, perform complex data transformations, and upload the results to an S3 bucket. Include proper error handling and logging."
6. "Implement a microservices architecture in Go for a basic e-commerce platform. Include services for user authentication, product catalog, order processing, and inventory management. Use gRPC for inter-service communication and implement circuit breaking for resilience."
7. "Provide me with a python script to recompile huggingface models with optimum neuron for inferentia"

---

### Inference using SageMaker SDK

In [11]:
# Initialize sagemaker client with the endpoint created in the prior step
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [12]:
prompt = "Create a Python class for a multi-threaded web scraper that can handle rate limiting, proxy rotation, and dynamic content loading. Include methods for parsing HTML with BeautifulSoup and storing results in a SQLite database."

inputs = {
    "inputs": prompt,
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.95,
        "max_new_tokens": 4000,
        "do_sample": False
    }
}
response = predictor.predict(inputs)
print(response['generated_text'].strip())

This is a complex task that requires a good understanding of Python, web scraping, and multithreading. Here's a basic outline of how you might approach this:

1. Create a `Scraper` class with the following methods:
   - `__init__`: Initialize the scraper with a list of proxies and a rate limit.
   - `scrape`: Start the scraping process. This should create and start a new thread for each URL in the list.
   - `_scrape_url`: A private method that scrapes a single URL. This should handle rate limiting, proxy rotation, and dynamic content loading.
   - `_parse_html`: A private method that parses HTML using BeautifulSoup and returns the parsed data.
   - `_store_data`: A private method that stores the scraped data in a SQLite database.

2. For rate limiting, you can use the `time.sleep` function to pause the scraper for a certain amount of time between requests.

3. For proxy rotation, you can use a queue to store the proxies and rotate them by dequeuing a proxy, using it, and then enqueuin

### Inference using Boto3 SDK

In [14]:
prompt = "Implement a Red-Black Tree data structure in C++ with methods for insertion, deletion, and rebalancing. Include a visualization function that prints the tree structure to the console."

response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": {
                "temperature": 0.8,
                "top_p": 0.95,
                "max_new_tokens": 4000,
                "do_sample": False
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

print(json.loads(response)['generated_text'])



To implement a Red-Black Tree, you'll need to define a Node structure that includes a color, parent, left child, and right child. Then, implement the following methods:

1. `insert(int value)`: Inserts a new node with the given value into the tree while maintaining the Red-Black Tree properties.
2. `delete(int value)`: Deletes the node with the given value from the tree while maintaining the Red-Black Tree properties.
3. `rebalance(Node* node)`: Rebalances the tree by performing rotations and color flips to maintain the Red-Black Tree properties.
4. `visualize()`: Prints the tree structure to the console in a readable format.

Here's a basic outline of how you can structure your code:

```cpp
#include <iostream>

enum Color { RED, BLACK };

struct Node {
    int value;
    Color color;
    Node* parent;
    Node* left;
    Node* right;
};

class RedBlackTree {
public:
    RedBlackTree();
    ~RedBlackTree();

    void insert(int value);
    void remove(int value);
    void visualize(

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. 

## Clean Up

In [15]:
# Delete the endpoint
sess.delete_endpoint(endpoint_name)

In [16]:
# In case the end point failed we still want to delete the model
sess.delete_endpoint_config(endpoint_name)
model.delete_model()