# Adjustment of Amazon SageMaker execution role policies

In order to execute build Docker images and deploy infrastructure resources through AWS Serverless Application Model from SageMaker Studio , we need to adjust the SageMaker Execution Role. This is the IAM Role granting SageMaker Studio the permissions to perform actions against the AWS API. 

To adjust accordingly copy the following JSON:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLikeIfExists": {
                    "iam:PassedToService": "codebuild.amazonaws.com"
                }
            }
        }
    ]
}
```

We then change the Execution Role's permissions by adding the above JSON as inline policy using the above





# Adjustment of Amazon SageMaker execution role trust relationships

To be able to build Docker images from SageMaker Studio we need to establish a trust relation ship between the service and the Amazon CodeBuild service (this is where the Docker build will be executed). Therefor we adjust the role's trust relationship accordingly by copying this JSON...

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "codebuild.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```




# Installation and import of required dependencies



In [None]:
!pip install sagemaker==2.163.0 --upgrade

In [None]:
!pip install sagemaker-studio-image-build

In [None]:
!pip install aws-sam-cli

In [None]:
import sagemaker
import boto3
import os
import tarfile
import requests
from io import BytesIO
from tqdm import tqdm


# Setup of notebook environment



In [None]:
# Retrieve SM execution role
role = sagemaker.get_execution_role()

In [None]:
# Create a new STS client
sts_client = boto3.client('sts')

# Call the GetCallerIdentity operation to retrieve the account ID
response = sts_client.get_caller_identity()
account_id = response['Account']
account_id

In [None]:
# Retrieve region
region = boto3.Session().region_name
region

# Setup of S3 bucket for storage of knowledge documents
Amazon Kendra provides multiple built-in adapters for integrating with data sources to build up a document index, e.g. S3, web-scraper, RDS, Box, Dropbox, ...etc . In this lab we will store the documents containing the knowledge to be infused into the application in S3. For this purpose we create a dedicated S3 bucket.

In [None]:
# specifying bucket name for model artifact storage
model_bucket_name = f'immersion-day-bucket-{account_id}-{region}'
model_bucket_name

In [None]:
# Create S3 bucket
s3_client = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}

bucket_name = model_bucket_name

# Check if bucket already exists
bucket_exists = True
try:
    s3_client.head_bucket(Bucket=bucket_name)
except:
    bucket_exists = False

# Create bucket if it does not exist
if not bucket_exists:
    if region == 'us-east-1':
        s3_client.create_bucket(Bucket=bucket_name)
    else: 
        s3_client.create_bucket(Bucket=bucket_name,
        CreateBucketConfiguration=location)
    print(f"Bucket '{bucket_name}' created successfully")

# Frontend

All relevant components for building a dockerized frontend application can be found in the "docker_and_app" directory. It consists of the following files: 
- ```streamlit_app.py```: actual frontend utilizing the popular streamlit framework
- ```Dockerfile```: Dockerfile providing the blueprint for the creation of a Docker image
- ```requirements.txt```: specifying the dependencies required to be installed for hosting the frontend application
- ```setup.sh```: setup script consisting all the necessary steps to create a ECR repository, build the Docker image and push it to the respective repository we created


## UI

The chatbot frontend web application "AWSomeChat" looks as follows:

![chat-frontend](app_ss.png)

To chat with the chatbot enter a message into the light grey input box and press ENTER. The chat conversation will appear below.

On the top of the page you can spot the session id assigned to your chat conversation. This is used to map different conversation histories to a specific user since the chatbot backend is stateless. To start a new conversation, press the "Clear Chat" and "Reset Session" buttons on the top right of the page.


## Dockerization and hosting

In order to prepare our frontend application to be hosted as a Docker container, we execute the bash script setup.sh. It looks as follows: 

```bash 
#!/bin/bash

# Get the AWS account ID
aws_account_id=$(aws sts get-caller-identity --query Account --output text)
aws_region=$(aws configure get region)

echo "AccountId = ${aws_account_id}"
echo "Region = ${aws_region}"


# Create a new ECR repository
echo "Creating ECR Repository..."
aws ecr create-repository --repository-name rag-app

# Get the login command for the new repository
echo "Logging into the repository..."
#$(aws ecr get-login --no-include-email)
aws ecr get-login-password --region ${aws_region} | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.${aws_region}.amazonaws.com

# Build and push the Docker image and tag it
echo "Building and pushing Docker image..."
sm-docker build -t "${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/rag-app:latest" --repository rag-app:latest .
````

The script performs the following steps in a sequential manner:

1. Retrieval of the AWS account id and region
2. Create a new ECR repository with the name rag-app. Note: this operation will fail, if the repository already exists within your account. This is intended behaviour and can be ignored.
3. Login to the respective ECR repository. 
4. Build the Docker image and tag it with the "latest" tag using the sagemaker-studio-image-build package we previously installed. The "sm-docker build" command will push the built image into the specified repository automatically. All compute will be carried out in AWS CodeBuild.

In [None]:
!cd fe && bash setup.sh

The frontend Docker image is now residing in the respective ECR repository and can be used for our deployment at a later point in time during the lab.

# Document store and retriever component

## Kendra

Amazon Kendra is an intelligent search service that uses natural language processing and advanced machine learning algorithms to return specific answers to search questions from your data.

Unlike traditional keyword-based search, Amazon Kendra uses its semantic and contextual understanding capabilities to decide whether a document is relevant to a search query. It returns specific answers to questions, giving users an experience that's close to interacting with a human expert.


## Creation of a Kendra index
To use Kendra for retrieval-augmented generation we need to first create a Kendra index. This index will hold all the knowledge we want to infuse into our LLM-powered chatbot application. 

## Using Kendra with LLMs 
Now that we have a understanding of the basics of Kendra, we want to use it for the indexing of our documents. Kendra Indexing allows us to query our enterprise knowledge base, without us having to worry about how to handle different data types (pdf, xml) in S3, connectors to SaaS applications as well as webpages. 



We use an Amazon Kendra index to ingest enterprise unstructured data from data sources such as wiki pages, MS SharePoint sites, Atlassian Confluence, and document repositories such as Amazon S3. When a user interacts with the GenAI app, the flow is as follows:

1. The user makes a request to the GenAI app.
2. The app issues a search query to the Amazon Kendra index based on the user request.
3. The index returns search results with excerpts of relevant documents from the ingested enterprise data.
4. The app sends the user request and along with the data retrieved from the index as context in the LLM prompt.
5. The LLM returns a succinct response to the user request based on the retrieved data.
6. The response from the LLM is sent back to the user.



## Uploading knowledge documents into an Amazon Kendra index

Next we are going to add some more documents from S3 to show how easy it is to integrate different data sources to a Kendra Index. 
First we are going to download some interesting pdf files from the internet, but please feel free to drop any pdf you might find interesting in it as well. 

In [None]:
import os
import boto3
import requests
from io import BytesIO
from tqdm import tqdm

# Create an S3 client
s3 = boto3.client('s3')

# Create a bucket if it doesn't exist
bucket_name = f'immersion-day-bucket-{account_id}-{region}'
if s3.list_buckets()['Buckets']:
    for bucket in s3.list_buckets()['Buckets']:
        if bucket['Name'] == bucket_name:
            break
    else:
        s3.create_bucket(Bucket=bucket_name)
else:
    s3.create_bucket(Bucket=bucket_name)

# List of URLs to download PDFs from
pdf_urls = [
    "https://patentimages.storage.googleapis.com/bb/0f/5a/6ef847538a6ab5/US10606565.pdf",
    "https://patentimages.storage.googleapis.com/f7/50/e4/81af7ddcbb2773/US9183397.pdf",
    "https://docs.aws.amazon.com/pdfs/enclaves/latest/user/enclaves-user.pdf",
    "https://docs.aws.amazon.com/pdfs/ec2-instance-connect/latest/APIReference/ec2-instance-connect-api.pdf",
]

# Download PDFs from the URLs and upload them to the S3 bucket
for url in tqdm(pdf_urls):
    response = requests.get(url, stream=True)
    filename = os.path.basename(url)
    print(f"Working on {filename}")
    fileobj = BytesIO()
    total_size = int(response.headers.get('content-length', 0))
    block_size = 1024
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)
    for data in response.iter_content(block_size):
        progress_bar.update(len(data))
        fileobj.write(data)
    progress_bar.close()
    fileobj.seek(0)
    s3.upload_fileobj(fileobj, bucket_name, filename)

Lets use those documents in Kendra. First navigate to the kendra console. 

Under "Data Management" you will find the tab "Data Sources". Navigate there and add a new data source via "Add data source". 
Take some time to inspect all the different connectors that are there for you to use out of the box. We will use s3 as our source. 

It is worth noting that Kendra respect enterprise level access attributes. That means, that it can deny queries if a user is not authorized to retrieve a document. 

You can either add the sample bucket as a data source that has been provided on the top of the connectors, but for the sake of demonstration, we will add our downloaded pdfs as well. 



After the connection has been established, you can sync your data source by clicking "sync now". 

# Model- Falcon40b-instruct via SageMaker Real-time Endpoint with HuggingFace LLM Deep Learning Container


For the response generation step we need a generative Model to come up with a appealing chatbot-like answer taking into account the users ask and the relevant documents retrieved from the Kendra index. For this purpose usually (instruction-fine-tuned) decoder or encoder-decoder models are used. I will be using the Falcon40b-instruct model.

## Deploy Falcon40b-instruct on Amazon SageMaker

The Falcon40b-instruct model can be deployed to SageMaker using the HuggingFace LLM DLC as shown below

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

In [None]:
# Deploy model to an endpoint

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
    )


# Orchestration layer

We encapsulate the execution of the single steps involved in the retrieval-augmented generation design pattern into an orchestration layer.I will utilize the popular framework langchain for this.


### KendraIndexRetriever

Indexes refer to ways to structure documents so that LLMs can best interact with them. The most common way that indexes are used in chains is in a "retrieval" step. This step refers to taking a user's query and returning the most relevant documents. We draw this distinction because (1) an index can be used for other things besides retrieval, and (2) retrieval can use other logic besides an index to find relevant documents. We therefor have a concept of a "Retriever" interface - this is the interface that most chains work with.

Most of the time when we talk about indexes and retrieval we are talking about indexing and retrieving unstructured data (like text documents). For interacting with structured data (SQL tables, etc) or APIs there is other design patterns we will not cover in this lab. The primary index and retrieval types supported by LangChain are currently centered around vector databases and document retrieval systems, Kendra being one of them.

### ConversationalBufferWindowMemory 
Memory is the concept of storing and retrieving data in the process of a conversation. There are two main methods:

- Based on input, fetch any relevant pieces of data
- Based on the input and output, update state accordingly

There are two main types of memory: short term and long term.

Short term memory generally refers to how to pass data in the context of a singular conversation (generally is previous ChatMessages or summaries of them), while long term memory deals with how to fetch and update information between conversations. In this lab we will focus on short term memory in the form of the ```ConversationalBufferWindowMemory```. 

```ConversationBufferWindowMemory``` keeps a list of the interactions of the conversation over time. It only uses the last K interactions. This can be useful for keeping a sliding window of the most recent interactions, so the buffer does not get too large. 

### DynamoDBChatMessageHistory

In their plain form, langchain's memory classes store the conversation in-memory. However this approach is not working when hosting the orchestration layer as a stateless microservice. To overcome this langchain offers ChatMessageHistory classes as memory plugins, interfacing to durable storage backends. The ```DynamoDBChatMessageHistory``` class leverages Amazon DynamoDB to store the chat message history in a durable manner. It supports session management by leveraging a session_id parameter matching the partition key in the DynamoDB schema.

### SageMakerEndpoint

For encapsulating funcionality around LLM inference, langchain provides the LLM class specifically designed for interfacing with LLMs. There are lots of LLM providers on the market, as well as a huge variety of hosting. This class is designed to provide a standard interface for all of them.

Amazon SageMaker is a system that can build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. Inheriting from the generic LLM class, the ```SageMakerEndpoint```class provides functionality specifically tied to LLMs hosted via SageMaker Endpoints through a standard interface. Thereby it uses the endpoint name as unique identifier for the targeted endpoint.

### PromptTemplate

The new way of programming models is through prompts. A "prompt" refers to the input to the model. This input is rarely hard coded, but rather is often constructed from multiple components. A PromptTemplate is responsible for the construction of this input. LangChain provides several classes and functions to make constructing and working with prompts easy, ```PromptTemplate``` being one of them.

A PromptValue is what is eventually passed to the model. Most of the time, this value is not hardcoded but is rather dynamically created based on a combination of user input, other non-static information (often coming from multiple sources), and a fixed template string. We call the object responsible for creating the PromptValue a ```PromptTemplate```. This object exposes a method for taking in input variables and returning a PromptValue.

### ConversationalRetrievalChain

Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.

The ```ConversationalRetrievalChain``` is a special purpose chain designed for chatbot implementations infusing knowledge via retrieval-augmented generation. 
This chain has two steps. First, it condenses the current question and the chat history into a standalone question. This is neccessary to create a standanlone ask to use for retrieval. After that, it does retrieval and then answers the question using retrieval-augmented generation with a separate model. Part of the power of the declarative nature of LangChain is that you can easily use a separate language model for each call. This can be useful to use a cheaper and faster model for the simpler task of condensing the question, and then a more expensive model for answering the question. However, within this lab we will be using one model for both steps.

## Create Lambda function codebase 

We will now look into the orchestrator implementation, meant to be hosted through AWS Lambda with a Python runtime. You can find the source code in the ```rag_app```directory. It consists of the following components:
- ```kendra```directory: implementation of the Kendra retriever. This can be used as is and does not require further attention.
- ```rag_app.py```: implementation of the orchestration layer as AWS Lambda handler function.
- ```requirements.txt```: specifying the dependencies required to be installed for hosting the frontend application.

```

# Application Deployment

Finally, we want to put all pieces together and deploy the LLM-powered chatbot application. 

## Infrastructure as Code: CloudFormation and SAM

The AWS Serverless Application Model (SAM) is an open-source framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, and event source mappings. With just a few lines per resource, you can define the application you want and model it using YAML. During deployment, SAM transforms and expands the SAM syntax into AWS CloudFormation syntax, enabling you to build serverless applications faster.

## Application stack resources

The application stack is defined through the ```template.yml``` AWS SAM file in yaml format. Once executed, it spins up the following resources:
- AWS Lambda function for hosting the orchestration layer
- Amazon API Gateway for exposing the orchestration layer in a RESTful way
- ExecutionRole for the AWS Lambda function
- VPC including two Subnets, an InternetGateway, ElasticIp, RoutingTables for hosting the application
- ECS Service/Cluster including a TaskDefinition and SecurityGroups for hosting the Frontend
- ExecutionRole for the ECS Task
- LogGroup for Observability
- LoadBalancer for exposing the Frontend
- Amazon DynamoDB table for durable storage of the chat history

## Deploy stack with SAM

Before we will deploy the AWS SAM stack, we need to adjust the Lambda function's environment variable pointing to the Kendra index. 

Now we are ready for deployment. 


In [None]:
# Building the code artifacts
!sam build

In [None]:
# Deploying the stack
!sam deploy --stack-name rag-stack --resolve-s3 --capabilities CAPABILITY_IAM

Once the deployment is done, we can go ahead to the CloudFormation service and select the "Resources" tab of the Stack "rag-app". Click on the "Physical ID" of the LoadBalancer and copy the DNS name of the page you get forwarded to. You can now reach the web application through a browser by using this as URL.


# Application testing

Now that we are in the chat, let us check some things we want to ask our chatbot. 

Lets ask about Amazon EC2. What it is, how we can create one and some more information about it. 
Take a look at the below conversation .

<p align="center">
  <img src="ChatEC2.png" alt="A chat with the model about EC2">
</p>



First of all, we can see that the LLM has memory about the previous conversation turn, as we reference EC2 implicitly via "Okay. How can I create one?" 

Secondly, we see that the shortcoming of a low number of retrieved characters on the Kendra side as I am using a free demo account


#### Discussions about the patents that we uploaded to Kendra bucket


<p align="center">
  <img src="PatentChat.png" alt="A chat with the model about one of the patents we downloaded">
</p>



# Cleanup


In [None]:
!sam delete --no-prompts --stack-name rag-stack --region $region