# Deploy Mistral-7B-instruct fine tuned model via DJL on SageMaker

This notebook serves as a comprehensive guide for deploying Mistral 7B Instruct - LoRA fine-tuned on Amazon SageMaker using [DeepSpeed and DJL serving](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html).   

Refer to this [AWS Blog post](https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/) for more details. This model served as fine-tuned head of a custom RAG architecture, for more details check the blog post.

Steps:
1. **Prepare the Deployment Package**
    * Organize the necessary files including requirements.txt, serving.properties, and model.py within a designated directory.
    * Package the directory contents into a tar.gz file.
    * Upload the Deployment Package to Amazon S3

2. **Upload the packaged tar.gz file to an Amazon S3 bucket**
    * Upload the packaged `tar.gz` file to an Amazon S3 bucket. This serves as the storage location for the deployment package.

3. **Deploy the Model as a SageMaker Endpoint**
    * Utilize SageMaker's capabilities to deploy the packaged model as an endpoint for later API inference.

*Note: This notebook assumes familiarity with Amazon SageMaker, DJL, and basic concepts of deploying machine learning models. Additional documentation and resources are available for further reference and exploration.*


### 0. Initialization

In [None]:
#!pip install sagemaker --upgrade --quiet

In [None]:
import sagemaker
from sagemaker.session import Session
from sagemaker import image_uris
from sagemaker import Model

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = session._region_name

image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.24.0", region=session._region_name)

### 1. Preparing deployment package
Our directory should have the following structure:

your_local_dir    
├── model.py    
├── serving.properties    
├── requirements.txt    
└── fine-tuned model    

In [None]:
!mkdir -p faber_lora

Prepare requirements.txt and serving.properties in ./faber_lora

In [None]:
%%writefile faber_lora/serving.properties
engine=Python
option.model_id=mistralai/Mistral-7B-Instruct-v0.2
option.dtype=fp16
option.tensor_parallel_degree=4
option.enable_streaming=true
option.entryPoint=model.py
option.adapter_checkpoint=your_model_artifacts_dir
option.adapter_name=your_adapter_name

In [None]:
%%writefile your_local_dir/requirements.txt
git+https://github.com/huggingface/transformers
accelerate==0.23.0

Prepare model.py in ./your_local_dir

In [None]:
%%writefile your_local_dir/model.py
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
from djl_python.inputs import Input
from djl_python.outputs import Output
from djl_python.encode_decode import encode, decode
import torch

import logging
import re

import numpy as np
from transformers import Pipeline, PreTrainedTokenizer

device = "cuda"
model = None
tokenizer = None

logger = logging.getLogger(__name__)


def Mistral_Infer(query,
        do_sample=True,
        temperature=0.1,
        top_p=0.92,
        top_k=0,
        max_new_tokens=512,
):

    pipe = pipeline(
        "text-generation", 
        model=model, 
        tokenizer=tokenizer, 
        torch_dtype=torch.bfloat16, 
        device_map="auto"
    )

    sequences = pipe(
        f"<s>[INST] {query} [/INST]",
        do_sample=do_sample,
        max_new_tokens=max_new_tokens, 
        temperature=temperature, 
        top_k=top_k, 
        top_p=top_p,
        num_return_sequences=1,
        repetition_penalty=1.1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
   
    full_text = sequences[0]['generated_text'].split('[STOP][STOP]')[0]
    answer = full_text.split('[/INST]')[1]
    
    return answer


def evaluate(instruction,
        do_sample=True,
        temperature=0.1,
        top_p=0.92,
        top_k=0,
        max_new_tokens=512,
        **kwargs,
):
    response = Mistral_Infer(instruction,
                             do_sample,
                             temperature,
                             top_p,
                             top_k,
                             max_new_tokens
                            )    
    return response
 
    
def load_base_model(adapter_checkpoint, adapter_name):
    model_name = "mistralai/Mistral-7B-Instruct-v0.2"    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True,
        low_cpu_mem_usage=True,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(model, adapter_checkpoint, adapter_name)    
    return model, tokenizer


def inference(inputs: Input):
    json_input = decode(inputs, "application/json")
    sequence = json_input.get("inputs")
    generation_kwargs = json_input.get("parameters", {})
    output = Output()
    outs = evaluate(sequence)
    encode(output, outs, "application/json")
    return output


def handle(inputs: Input):
    """
    Default handler function
    """
    global model, tokenizer
    if not model:
        # stateful model
        props = inputs.get_properties()
        model, tokenizer = load_base_model(props.get("adapter_checkpoint"), props.get("adapter_name"))

    if inputs.is_empty():
        # initialization request
        return None

    return inference(inputs)


### 2. Upload model artifacts gz file to S3

You can upload the files to S3 with AWS KMS key encryption, refer to [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) for details.

In [None]:
%%bash
cp -r your_model_artifacts_dir your_local_dir/

In [None]:
%%bash
tar -cvzf your_model_package.tar.gz your_local_dir/

In [None]:
%%bash
aws s3 cp your_model_package.tar.gz s3://your_s3_bucket/

### 3. Deploy as SageMaker Inference Endpoint

In [None]:
# select the EC2 instance type you prefer
instance_type = "ml.g5.2xlarge"  

model_s3_location = "s3://your_s3_bucket/your_model_package.tar.gz"

In [None]:
import sagemaker.djl_inference

model = Model(
    image_uri,
    model_data=model_s3_location,
    predictor_cls = sagemaker.djl_inference.DJLPredictor, 
    role=role
)

In [None]:
predictor = model.deploy(
    initial_instance_count=1, 
    instance_type=instance_type
)

### 4. Testing the endpoint

In [None]:
import boto3
import json

endpoint = 'the_deployed_sagemaker_endpoint_name'
runtime = boto3.client('runtime.sagemaker')

payload = {
    "inputs": "ask your own question?",
    "parameters": {
        "do_sample": True,
        "temperature": 0.1,
    }
}

In [None]:
import time

st = time.time()

response = runtime.invoke_endpoint(EndpointName=endpoint,
                                   ContentType='application/json',
                                   Body=json.dumps(payload).encode("utf-8"))

et = time.time()
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

In [None]:
pred = json.loads(response['Body'].read())
pred

### 5. Clean-up resources

In [None]:
# uncomment the following lines to delete the endpoint and model
# predictor.delete_endpoint()
# model.delete_model()