# Deploy open-source Large Language Models on Amazon SageMaker

In this notebook, we will show you how to deploy the open-source LLMs from HuggingFace on Amazon SageMaker. The notebook contains five sections:
- Section 1: Deploy Falcon model and embedding model to Amazon SageMaker
- Section 2: Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.
- Section 3: (Optional) Run SageMaker Inference Recommender job to determine the cost and performance of the LLM

***
This notebooks is designed to run on `Python 3 Data Science 3.0` kernel in Amazon SageMaker Studio
***

#### 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
!pip install sagemaker boto3 --upgrade --quiet
!pip install ipywidgets==7.0.0 langchain==0.0.148 faiss-cpu==1.7.4 unstructured==0.9.3 --quiet


## Secton 1: Deploy Falcon model and embedding model to Amazon SageMaker
In this section, we will deploy the open-source [Falcon 7b instruct model](https://huggingface.co/tiiuae/falcon-7b-instruct) on SageMaker for real-time inference. 
To deploy [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.2xlarge` instance type.


This is an example on how to deploy the open-source LLMs to Amazon SageMaker for inference using the [Large Model Inference (LMI)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) container from DLC to SageMaker and run inference with it. We will deploy the 7B-Instruct [Falcon](https://huggingface.co/tiiuae/falcon-7b-instruct) an open-source Chat LLM trained by TII.



In [None]:
import sagemaker
import boto3
import json
import time
from sagemaker import Model, image_uris, serializers, deserializers

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

bucket = sess.default_bucket()  
region = boto3.Session().region_name
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region}")
sm_client = boto3.client('sagemaker')
smr = boto3.client('sagemaker-runtime')


### Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In the **serving.properties** files define the the engine to use and model to host. Note the `tensor_parallel_degree` parameter which is also required in this scenario. We will use tensor parallelism to divide the model into multiple parts because no single GPU has enough memory for the entire model. In this case we will use a 'ml.g5.12xlarge' instance which provides 4 GPUs. Be careful not to specify a value larger than the instance provides or your deployment will fail.

In [None]:
%%writefile serving.properties
engine=MPI
option.model_id=tiiuae/falcon-7b-instruct
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.paged_attention=false
option.rolling_batch=auto
#option.s3url = {{s3url}}

In [None]:
%%sh
mkdir mymodel-7b
mv serving.properties mymodel-7b/
tar czvf mymodel-7b.tar.gz mymodel-7b/
rm -rf mymodel-7b

### Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

#### Getting the container image URI

In [None]:
image_uri = image_uris.retrieve(
framework="djl-deepspeed", region=region, version="0.23.0")
image_uri 

Then we upload the artifacts on S3 and create SageMaker model

In [None]:
s3_code_prefix = "large-model-lmi/code"
code_artifact = sess.upload_data("mymodel-7b.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(sagemaker_session=sess, image_uri=image_uri, model_data=code_artifact, role=role)

#### Create SageMaker endpoint

We now can call the deploy function to create the LLM endpoint. You need to specify the instance to use and endpoint names.

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-falcon-7b")

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
    wait=False
)
falcon_model_name = model.name

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. During this time, please continue the following section to deploy the embedding model for the RAG solution. We will invoke the deployed endpoint when all the models are deployed successfully.

To see more model deployment examples, you can find an example notebook [here at the SageMaker examples gitrepo.](https://github.com/aws/amazon-sagemaker-examples/tree/51cbf4a77c98dc9b74fde6a8c47db0dad40fb910/inference/generativeai)

#### Deploy the all-MiniLM-L6-v2 embedding model on SageMaker

In this section, we host the pre-trained [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) Hugging Face sentence transformer model, into SageMaker and generate 384 dimensional vector embeddings for our product catalog.

In [None]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.model import Model
from sagemaker.predictor import Predictor
import os

instance_type = "ml.g5.2xlarge" # instance type to use for deployment
model_version = "*"
env= {
            "SAGEMAKER_MODEL_SERVER_WORKERS": "1", 
            "TS_DEFAULT_WORKERS_PER_MODEL": "1",
            # This model requires HF_TASK param 
            # https://huggingface.co/docs/transformers/main/main_classes/pipelines#transformers.pipeline.task
            "HF_TASK": "feature-extraction" 
    }


In [None]:
repository = "sentence-transformers/all-MiniLM-L6-v2"
model_name=repository.split("/")[-1]
embed_s3_location=f"s3://{sess.default_bucket()}/sagemaker/{model_name}/model.tar.gz"
pwd = os.system("pwd") # current path

In [None]:
# Download the model from hf.co/models with git clone.
!git lfs install
!git clone https://huggingface.co/$repository

In [None]:
# Create a model.tar.gz archive in S3 and delete downloaded folder
%cd $model_name
!tar zcvf model.tar.gz *
!aws s3 cp model.tar.gz $embed_s3_location
%cd ..
%rm -r $model_name

In [None]:
model_id= "huggingface-textembedding-all-MiniLM-L6-v2"

embed_endpoint_name = model_id + '-' + instance_type.split('.')[-1]

# Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=instance_type,
)
model_inference = Model(
    image_uri=deploy_image_uri,
    model_data=embed_s3_location,
    role=role,
    predictor_cls=Predictor,
    name=model_id,
    env=env,
)
model_predictor_inference = model_inference.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    predictor_cls=Predictor,
    endpoint_name=embed_endpoint_name,
    wait=False
)
print(f"Model {model_id} has been deployed successfully.")

#### Wait until all the endpoints are up and running

In [None]:
# wait for the endpoint to be deployed successfully
def wait_for_endpoint(endpoint_name=None):
    describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name)

    while describe_endpoint_response["EndpointStatus"] == "Creating":
        describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name)
        print(describe_endpoint_response["EndpointStatus"])
        time.sleep(15)

    print(f'endpoint {endpoint_name} is in service now.')
    return

In [None]:
for ep_name in [endpoint_name, embed_endpoint_name]:
    wait_for_endpoint(ep_name)

### Test endpoint outputs
Now we can invoke each endpoint to test the endpoint outputs. First, let's check the text-to-text endpoint using Falcon model.

In [None]:
body = {"inputs": "what is life?", "parameters": {"max_new_tokens":400, "return_full_text": False}}
output = smr.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
resp = json.loads(output['Body'].read().decode("utf8"))
print(resp["generated_text"])

Then run the follow code to generate embeddings of the input using the embedding model.

In [None]:
input_str = {"text_inputs": resp["generated_text"]}
output = smr.invoke_endpoint(EndpointName=embed_endpoint_name, Body=json.dumps(input_str), ContentType="application/json")
embeddings = output['Body'].read().decode("utf-8")
print(embeddings)

## Section2: Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.


We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

1. **Generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.**
2. **Identify top K most relevant documents based on user query.**
    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**
    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**
    - 2.3 **Use the indexes to retrieve the corresponded documents.**
3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**



Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. 

---
To build a simiplied QA application with LangChain, we need: 
1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.
2. Prepare the dataset to build the knowledge data base. 

---

Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

In [None]:
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from typing import Any, Dict, List, Optional
import json

class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results


class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        return embeddings


content_handler = ContentHandler()

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=embed_endpoint_name,
    region_name=region,
    content_handler=content_handler,
)




Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [None]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

parameters = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.1
}


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, "parameters": {**model_kwargs}})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        ans = res['generated_text']
        return ans 


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    region_name=region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

In [None]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import DirectoryLoader

Use langchain to read the `txt` data. There are multiple built-in functions in LangChain to read different format of files such as `csv`, `html`, and `pdf`. For details, see [LangChain document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html).

In [None]:
loader = DirectoryLoader("./data/", glob="**/*.txt")
documents = loader.load()

We generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.documents

In [None]:
docsearch = FAISS.from_documents(documents, embeddings)

In [None]:
question = "what is the recommended way to first customize a foundation model?"

Based on the question above, we then **identify top K most relevant documents based on user query, where K = 3 in this setup**.

In [None]:
docs = docsearch.similarity_search_with_score(question)

In [None]:
docs

In [None]:
source = []
context = []
for doc, score in docs:
    context.append(doc)
    source.append(doc.metadata['source'].split('/')[-1])

Finally, we **combine the retrieved documents with prompt and question and send them into SageMaker LLM.** 

We define a customized prompt as below.

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [None]:
chain = load_qa_chain(llm=sm_llm, prompt=PROMPT)

In [None]:
result = chain({"input_documents": context, "question": question}, return_only_outputs=True)["output_text"]
print(result)

### Run the Question and Answering chatbot application

Once all the endpoints are deployed successfully, you can open a terminal in SageMaker Studio and use the below command to run the chatbot [Streamlit](https://streamlit.io/) application. Note that you need to install the required python packages that are specified in the “requirements.txt” file. You also need to update the environment variables with the endpoint names deployed in your account accordingly. When you execute the `chatbot-steamlit.py` file, it will automatically update the endpoint names based on the environment variables.

```
$ pip install -r requirements.txt
$ export nlp_ep_name=<the falcon endpoint name deployed in your account>
$ export embed_ep_name=<the embedding endpoint name deployed in your account>
$ streamlit run chatbot-streamlit.py --server.port 6006 --server.maxUploadSize 6
```

To access the Streamlit UI, copy your SageMaker Studio url and replace `lab?` with `proxy/[PORT NUMBER]/`. Because we specified the server port to 6006, so the url should look like:

```
https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/
```

Replace the domain ID and region with the correct value in your account to access the UI as below:
![streamlitUI](./img/Streamlit_UI.png)

You can find some suggested `prompt` on the left-hand-side sidebar. When you upload the sample files (you can find the sample files in the test folder), the chatbot will automatically provide prompt suggestions based on the input data type.

**Congratulations on finishing lab 1 !!!**

_____________________________________________________________________________

## Secton 3: (Optional) Understand the model hosting performance using SageMaker Inference Recommender 

[Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html) is a capability of Amazon SageMaker that reduces the time required to get machine learning (ML) models in production by automating load testing and model tuning across SageMaker ML instances. You can use Inference Recommender to deploy your model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost. Inference Recommender helps you select the best instance type and configuration (such as instance count, container parameters, and model optimizations) or serverless configuration (such as max concurrency and memory size) for your ML models and workloads.

In [None]:
!tar -czvf payload.tar.gz test_file/payload.json

In [None]:
s3_location = f"s3://{bucket}/sagemaker/InferenceRecommender/{model_name}"
payload_tar_url = sagemaker.s3.S3Uploader.upload("payload.tar.gz", s3_location)
print(payload_tar_url)

In [None]:
job_name = f"{falcon_model_name}-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
response = sm_client.create_inference_recommendations_job(
    JobName=job_name,
    JobType='Default',
    RoleArn=role,
    InputConfig={
        'ContainerConfig': {
            'Domain': 'NATURAL_LANGUAGE_PROCESSING',
            'Task': 'TEXT_GENERATION',
            'PayloadConfig': {
                'SamplePayloadUrl': payload_tar_url,
                'SupportedContentTypes': ["application/json"],
            },
            #specify the instance types you would like to test out
            'SupportedInstanceTypes': ['ml.g5.2xlarge'], 
            'SupportedEndpointType': 'RealTime'
        },
        'ModelName': falcon_model_name
    },
)

In [None]:
# # uncomment this section to wait for the inference job to finish
# describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)

# while describe_IR_job_response["Status"] in ["IN_PROGRESS", "PENDING"]:
#     describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)
#     print(describe_IR_job_response["Status"])
#     time.sleep(15)
    
# print(f'Inference Recommender job {job_name} has finished with status {describe_IR_job_response["Status"]}.')

Now, let's use the inference recommender job results to calculate the approximate invocation cost for the LLM endpoint.

In [None]:
describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)
describe_IR_job_response['InferenceRecommendations']

The inference recomender job reports the below metrics: 
- 'ModelLatency'
- 'CostPerInference'
- 'CostPerHour'
- 'MaxInvocations' per minute

and more.

Note that the sample json input file consists of 6,200 characters, which is around 1550 tokens per invocation (1 token is approximately 4 characters). To calculate the approximate cost per 1K tokens, you can do the inference many times (with average payload size) and get the best token/s you get through the experiment (different instance types can result in different throughput, model latency, and cost). Then we will calculate the per token per second invocation price and multiply by 1,000. You can also use per invocation cost divide by the tokens per invocation and multiply by 1,000. The calculated price should be similar. SageMaker also supports auto-scaling to scale your endpoint out/in to save cost based on the invocation traffic pattern.

In [None]:
metrics = describe_IR_job_response['InferenceRecommendations'][0]['Metrics']
token_per_sec = round(metrics['MaxInvocations']*1550/60, 2)
cost_per_sec = round(metrics['CostPerHour']/3600, 5)
cost_per_1k_token = round(cost_per_sec/token_per_sec * 1000, 5)
print("According to the Inference recommender job, the corresponding metrices are as below: /n")
print(f"Max tokens per second is about {token_per_sec}")
print(f"Cost per second is about ${cost_per_sec}")
print(f"Cost per 1k tokens is about ${cost_per_1k_token}")

## clean up

In [None]:
# client.delete_endpoint(EndpointName=endpoint_name)
# client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# client.delete_model(ModelName=model_name)