# Building a simple RAG chatbot with LangChain, Hugging Face, FAISS, Amazon SageMaker and Amazon Textract

In [1]:
%%sh
pip install sagemaker langchain amazon-textract-caller amazon-textract-textractor sentence-transformers pypdf pip install faiss-cpu -qU

In [2]:
import boto3, json, sagemaker
from typing import Dict
from langchain import LLMChain
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.llms import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Deploy LLM on SageMaker

In [3]:
#t5 XL
# Hub Model configuration. https://huggingface.co/models
role = sagemaker.get_execution_role()

hub = {
	#'HF_MODEL_ID':'google/flan-t5-small',
    'HF_MODEL_ID':'google/flan-t5-xl',
	'SM_NUM_GPUS': json.dumps(1)
}



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g4dn.4xlarge",
	container_startup_health_check_timeout=300,
    endpoint_name="flan-t5-demo"
  )
  
# send request
predictor.predict({
	"inputs": "Translate to German:  My name is Arthur",
})

--------!

[{'generated_text': 'Ich bin Arthur.'}]

In [4]:
endpoint_name = predictor.endpoint_name
endpoint_name

'flan-t5-demo'

Step 2. Ask a question to LLM without providing the context
To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [5]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

out = predictor.predict({"inputs": question})
out

[{'generated_text': 'SageMaker and SageMaker XL.'}]

Step 3. Improve the answer to the same question using prompt engineering with insightful context
To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [6]:
context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

In [7]:
prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

out = predictor.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]: all instances supported in Amazon SageMaker


Let's see if our LLM is capable of following our instructions...

In [9]:
unanswerable_question = "What color is my desk?"

text_input = prompt_template.replace("{context}", context).replace("{question}", unanswerable_question)

out = predictor.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {unanswerable_question}\n[Output]: {generated_text}")

[Input]: What color is my desk?
[Output]: I don't know


Step 4. Use RAG based approach to identify the correct documents, and use them along with prompt and question to query LLM
We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

Generate embedings for each of document in the knowledge library with the MiniLM embedding model.
Identify top K most relevant documents based on user query.
For a query of your interest, generate the embedding of the query using the same embedding model.
Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.
Use the indexes to retrieve the corresponded documents.
Combine the retrieved documents with prompt and question and send them into LLM.
Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens.

In [58]:
hub_config = {
    'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2', # model_id from hf.co/models
    'HF_TASK': 'feature-extraction'
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used
    py_version="py36", # python version of the DLC
)

Then we deploy the model as we did earlier for our generative LLM:

In [60]:
encoder = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name="minilm-demo"
)

------!

We can then create the embeddings like so:

In [61]:
out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

We will see that we have two outputs (one for each of our input sentences):

In [62]:
len(out)

2

In [63]:
len(out[0]), len(out[1])

(8, 8)

In [64]:
len(out[0][0])

384

Perfect! There's just one problem, how do we transform these eight vector embeddings into a single sentence embedding? For this, we simply take the mean value across each vector dimension, like so:

In [65]:
import numpy as np

embeddings = np.mean(np.array(out), axis=1)
embeddings.shape

(2, 384)

Now we have two 384-dimensional vector embeddings, one for each of our input texts. To make our lives easier later, we will wrap this encoding process into a single function

In [17]:
from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({'inputs': docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

## Configure LLM in LangChain

In [50]:
model_kwargs = {"max_new_tokens": 512, "top_p": 0.8, "temperature": 0.8}

In [20]:
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        print (response_json)
        return response_json["generated_texts"][0]


content_handler = ContentHandler()

In [11]:
#to get logs
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_data = {"text_inputs": prompt, **model_kwargs}
        input_str = json.dumps(input_data)
        
        print(f"Transformed Input: {input_str}")
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        
        print(f"Transformed Output: {response_json}")
        
        return response_json["generated_texts"][0]

content_handler = ContentHandler()


In [51]:
#adjust based on chatgpt suggestions
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_data = {
            "inputs": prompt,  # Adjust this field based on the expected input format
            **model_kwargs,
        }
        input_str = json.dumps(input_data)
        return input_str.encode("utf-8")
     
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        print (response_json)
        return response_json["generated_texts"][0]


content_handler = ContentHandler()


In [52]:
import boto3
sm_client = boto3.client("sagemaker-runtime") # needed for AWS credentials

llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    model_kwargs=model_kwargs,
    content_handler=content_handler,
    client=sm_client,
)



## Zero-shot example

In [53]:
system_prompt = """
As a helpful energy specialist, please answer the question, focusing on numerical data.
Don't invent facts. If you can't provide a factual answer, say you don't know what the answer is.

{question}
"""

prompt = PromptTemplate.from_template(system_prompt + "{content}")

In [54]:
system_prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

prompt = PromptTemplate.from_template(system_prompt + "{context}")

In [55]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [40]:
context ="Solar investments in China have increased by 30% each year in the last decade"
question = "What is the latest trend for solar investments in China?"

query = f"question: {question}"

In [41]:
query = f"question: {question}"
print(query)

question: What is the latest trend for solar investments in China?


In [47]:
answer = llm_chain.run(text_input)
print(answer)

  warn_deprecated(
  warn_deprecated(


ValueError: Missing some input keys: {'question', 'context'}

In [61]:
from typing import Dict
import json
from langchain_community.llms.sagemaker_endpoint import SagemakerEndpoint, LLMContentHandler
#from langchain.chains import PromptTemplate, LLMChain

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_data = {
            "inputs": prompt,  # Adjust this field based on the expected input format
            **model_kwargs,
        }
        input_str = json.dumps(input_data)
        return input_str.encode("utf-8")

    def transform_output(self, output: 'StreamingBody') -> str:
            response_json = json.loads(output.read().decode("utf-8"))
            print(response_json)
            return response_json[0]["generated_text"]

content_handler = ContentHandler()

import boto3

# Replace 'endpoint_name' and 'model_kwargs' with your actual values
endpoint_name = "flan-t5-demo"
model_kwargs = {"max_new_tokens": 512, "top_p": 0.8, "temperature": 0.8}

sm_client = boto3.client("sagemaker-runtime")  # needed for AWS credentials

llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    model_kwargs=model_kwargs,
    content_handler=content_handler,
    client=sm_client,
)

system_prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

prompt_template = PromptTemplate.from_template(system_prompt + "{context}")

llm_chain = LLMChain(llm=llm, prompt=prompt_template)

context = "Solar investments in China have increased by 30% each year in the last decade"
question = "What is the latest trend for solar investments in China?"

query = f"question: {question}"
print(query)

answer = llm_chain.run({"context": context, "question": query})
print(answer)


question: What is the latest trend for solar investments in China?


  warn_deprecated(
  warn_deprecated(


[{'generated_text': "I don't know"}]
I don't know


In [36]:
import json
import boto3

# Create a SageMaker client
sagemaker_runtime = boto3.client('sagemaker-runtime')  # Replace 'your-region' with your AWS region

# Define the input data
text_input = {
    "inputs": "Answer the following QUESTION based on the CONTEXT\ngiven. If you do not know the answer and the CONTEXT doesn't\ncontain the answer truthfully say \"I don't know\".\n\nCONTEXT:\nSolar investments in China have increased by 30% each year in the last decade\n\nQUESTION:\nWhat is the latest trend for solar investments in China?\n\nANSWER:\nSolar investments in China have increased by 30% each year in the last decade",
    "max_new_tokens": 512,
    "top_p": 0.8,
    "temperature": 0.8
}

# Call the SageMaker endpoint
try:
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName='flan-t5-demo',
        ContentType='application/json',
        Body=json.dumps(text_input)
    )

    result = json.loads(response['Body'].read().decode())
    print(result)

except Exception as e:
    print(f"Error raised by inference endpoint: {e}")


[{'generated_text': "I don't know"}]


## RAG example with PDF files

In [62]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader

In [44]:
#!pip install pypdf
#!pip install pypdf2

In [63]:
#this is working ,it is created as pypdf can't load s3 object directly
import boto3
import tempfile
from io import BytesIO
from langchain.document_loaders import PyPDFLoader
import os

# Specify your S3 bucket name and item name
bucket_name = "bo-automation"
item_name = "langchain-rag-demo/Coal2022.pdf"

# Create an S3 client
s3 = boto3.client("s3")

# Get the PDF file content from S3
response = s3.get_object(Bucket=bucket_name, Key=item_name)
pdf_content = response["Body"].read()

# Use BytesIO to create a file-like object from the PDF content
pdf_file = BytesIO(pdf_content)

# Save the contents to a temporary file
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
    temp_file.write(pdf_content)
    temp_file_path = temp_file.name

# Create a PyPDFLoader instance and load the PDF document
loader = PyPDFLoader(temp_file_path)
docs = loader.load()
print(len(docs))

# Now you can work with the 'document' object, which represents the PDF content
# For example, you can access the pages: document.pages

# Optionally, delete the temporary file
os.remove(temp_file_path)

137


In [132]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Specify the chunk size and overlap
chunk_size = 180
chunk_overlap = 0

# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

all_chunks = []

# Assuming 'docs' is the list of loaded documents
for document in docs:
    # Extract text from the document
    #text = document.content  # Adjust this based on the actual structure of your Document object

    # Split the text into chunks
    chunks = splitter.split_documents(docs)

    # Add the chunks to the list
    all_chunks += chunks

    # Print information about the chunks
    print(f"Original text length: {len(docs)}, number of chunks: {len(chunks)}")


Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701
Original text length: 137, number of chunks: 1701


### Analyze documents with Amazon Textract and split them in chunks

In [48]:
import boto3
from PyPDF2 import PdfReader
from io import BytesIO

# Assuming you have the list of URIs
uris = ["s3://bo-automation/langchain-rag-demo/Coal2022.pdf","s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf", "s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf"]

for uri in uris:
    print(f"Loading {uri}")

    # Extract bucket name and object key from the URI
    uri_parts = uri.split("/")
    bucket_name = uri_parts[2]
    item_name = "/".join(uri_parts[3:])

    # Load the PDF using s3fs and PyPDF2
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, item_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs))

    # Extract text using PyPDF2
    text = ""
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]
        text += page.extract_text()

    # Process the extracted text as needed
    print(f"Loaded {uri}, text length: {len(text)}")
    #print(text)


ModuleNotFoundError: No module named 'PyPDF2'

### Embed document chunks and store them in FAISS
https://github.com/facebookresearch/faiss 

In [133]:
#from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [134]:
# Define embedding model
# See https://huggingface.co/spaces/mteb/leaderboard

embedding_model_id = "BAAI/bge-small-en-v1.5"

embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_id,
)

In [1]:
%%time
# Embed chunks
#embeddings_db = FAISS.from_documents(all_chunks, embeddings)
print(f"Number of chunks: {len(all_chunks)}")
#print(f"Number of embeddings: {len(embeddings)}")

# Embed chunks
embeddings_db = FAISS.from_documents(all_chunks, embeddings)


NameError: name 'all_chunks' is not defined

In [137]:
# Save database
embeddings_db.save_local("faiss_index")

### Shortcut : load existing embedding database

In [138]:
embeddings_db = FAISS.load_local("faiss_index", embeddings)

********

### Configure RAG chain

In [139]:
retriever = embeddings_db.as_retriever(search_kwargs={"k": 10})

In [76]:
# Define prompt template
prompt_template = """
As a helpful energy specialist, please answer the question below, focusing on numerical data and using only the context below.
Don't invent facts. If you can't provide a factual answer, say you don't know what the answer is.

question: {question}

context: {context}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [140]:
#working version
# Define prompt template1
system_prompt = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

prompt_template = PromptTemplate.from_template(system_prompt + "{context}")

In [141]:
chain = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff",
    retriever=retriever, 
    chain_type_kwargs = {"prompt": prompt_template})

### Ask our question again

In [144]:
question = "outlook for Global coal consumption in 2025?"
answer = chain.run({"query": question})
#print(answer)

#answer = chain.run({"query": question})
print(answer)


  warn_deprecated(
  warn_deprecated(
  warn_deprecated(
  warn_deprecated(
  warn_deprecated(


[{'generated_text': 'By 2025, coal demand is forecast to increase by 2. 6 Mt, despite'}]
By 2025, coal demand is forecast to increase by 2. 6 Mt, despite


In [123]:
question = "what is the impact of High coal prices on coal mining projects?"
answer = chain.run({"query": question})
print(answer)

  warn_deprecated(
  warn_deprecated(
  warn_deprecated(
  warn_deprecated(
  warn_deprecated(


[{'generated_text': "I don't know"}]
I don't know


Alternate approach

In [66]:
def rag_query(question: str) -> str:
    # create query vec
    query_vec = embed_docs(question)[0]
    # query pinecone
    res = index.query(query_vec, top_k=5, include_metadata=True)
    # get contexts
    contexts = [match.metadata['text'] for match in res.matches]
    # build the multiple contexts string
    context_str = construct_context(contexts=contexts)
    # create our retrieval augmented prompt
    text_input = prompt_template.replace("{context}", context_str).replace("{question}", question)
    # make prediction
    out = llm.predict({"inputs": text_input})
    return out[0]["generated_text"]

In [67]:
rag_query("What does STEPS mean?")

NameError: name 'index' is not defined

## Delete endpoint and model

In [145]:
predictor.delete_model()
predictor.delete_endpoint()

In [151]:
import boto3

sagemaker = boto3.client('sagemaker')
response = sagemaker.list_endpoints()

if not response['Endpoints']:
    print("No active endpoints.")
