# Building a simple RAG chatbot with LangChain, Hugging Face, FAISS, Amazon SageMaker and Amazon Textract

In [1]:
%%sh
pip install sagemaker langchain amazon-textract-caller amazon-textract-textractor sentence-transformers pypdf pip install faiss-cpu -qU

In [2]:
import boto3, json, sagemaker
from typing import Dict
from langchain import LLMChain
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.llms import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Deploy LLM on SageMaker

In [3]:
#t5 small
# Hub Model configuration. https://huggingface.co/models
role = sagemaker.get_execution_role()

hub = {
	#'HF_MODEL_ID':'google/flan-t5-small',
    'HF_MODEL_ID':'google/flan-t5-xl',
	'SM_NUM_GPUS': json.dumps(1)
}



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
llm = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g4dn.4xlarge",
	container_startup_health_check_timeout=300,
    endpoint_name="flan-t5-demo"
  )
  
# send request
llm.predict({
	"inputs": "Translate to German:  My name is Arthur",
})

-------!

[{'generated_text': 'Ich bin Arthur.'}]

In [4]:
endpoint_name = llm.endpoint_name
endpoint_name

'flan-t5-demo'

Step 2. Ask a question to LLM without providing the context
To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [5]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"

out = llm.predict({"inputs": question})
out

[{'generated_text': 'SageMaker and SageMaker XL.'}]

Step 3. Improve the answer to the same question using prompt engineering with insightful context
To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [6]:
context = """Managed Spot Training can be used with all instances
supported in Amazon SageMaker. Managed Spot Training is supported
in all AWS Regions where Amazon SageMaker is currently available."""

In [7]:
prompt_template = """Answer the following QUESTION based on the CONTEXT
given. If you do not know the answer and the CONTEXT doesn't
contain the answer truthfully say "I don't know".

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

text_input = prompt_template.replace("{context}", context).replace("{question}", question)

out = llm.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {question}\n[Output]: {generated_text}")

[Input]: Which instances can I use with Managed Spot Training in SageMaker?
[Output]: all instances supported in Amazon SageMaker


Let's see if our LLM is capable of following our instructions...

In [8]:
unanswerable_question = "What color is my desk?"

text_input = prompt_template.replace("{context}", context).replace("{question}", unanswerable_question)

out = llm.predict({"inputs": text_input})
generated_text = out[0]["generated_text"]
print(f"[Input]: {unanswerable_question}\n[Output]: {generated_text}")

[Input]: What color is my desk?
[Output]: I don't know


Step 4. Use RAG based approach to identify the correct documents, and use them along with prompt and question to query LLM
We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

Generate embedings for each of document in the knowledge library with the MiniLM embedding model.
Identify top K most relevant documents based on user query.
For a query of your interest, generate the embedding of the query using the same embedding model.
Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.
Use the indexes to retrieve the corresponded documents.
Combine the retrieved documents with prompt and question and send them into LLM.
Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens.

In [9]:
hub_config = {
    'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2', # model_id from hf.co/models
    'HF_TASK': 'feature-extraction'
}

huggingface_model = HuggingFaceModel(
    env=hub_config,
    role=role,
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used
    py_version="py36", # python version of the DLC
)

Then we deploy the model as we did earlier for our generative LLM:

In [11]:
encoder = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.2xlarge",
    endpoint_name="minilm-demo"
)

-----!

We can then create the embeddings like so:

In [12]:
out = encoder.predict({"inputs": ["some text here", "some more text goes here too"]})

We will see that we have two outputs (one for each of our input sentences):

In [13]:
len(out)

2

In [14]:
len(out[0]), len(out[1])

(8, 8)

In [15]:
len(out[0][0])

384

Perfect! There's just one problem, how do we transform these eight vector embeddings into a single sentence embedding? For this, we simply take the mean value across each vector dimension, like so:

In [16]:
import numpy as np

embeddings = np.mean(np.array(out), axis=1)
embeddings.shape

(2, 384)

Now we have two 384-dimensional vector embeddings, one for each of our input texts. To make our lives easier later, we will wrap this encoding process into a single function

In [17]:
from typing import List

def embed_docs(docs: List[str]) -> List[List[float]]:
    out = encoder.predict({'inputs': docs})
    embeddings = np.mean(np.array(out), axis=1)
    return embeddings.tolist()

## Configure LLM in LangChain

In [19]:
model_kwargs = {"max_new_tokens": 512, "top_p": 0.8, "temperature": 0.8}

In [20]:
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        print (response_json)
        return response_json["generated_texts"][0]


content_handler = ContentHandler()

In [8]:
#modifed for flan t5
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": f"translate English to German: {prompt}", **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        print (response_json)
        return response_json["generated_texts"][0]


content_handler = ContentHandler()

In [50]:
from typing import Dict, List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
import json

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
       input_str = json.dumps({"text_inputs": inputs, **model_kwargs})
       return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
       response_json = json.loads(output.read().decode("utf-8"))
       return response_json["embedding"]

content_handler = ContentHandler()

In [21]:
sm_client = boto3.client("sagemaker-runtime") # needed for AWS credentials

llm = SagemakerEndpoint(
    endpoint_name=endpoint_name,
    model_kwargs=model_kwargs,
    content_handler=content_handler,
    client=sm_client,
)

## Zero-shot example

In [22]:
system_prompt = """
As a helpful energy specialist, please answer the question, focusing on numerical data.
Don't invent facts. If you can't provide a factual answer, say you don't know what the answer is.
"""

prompt = PromptTemplate.from_template(system_prompt + "{content}")

In [23]:
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [24]:
question = "What is the latest trend for solar investments in China?"

query = f"question: {question}"

In [25]:
answer = llm_chain.run({query})
print(answer)

  warn_deprecated(
  warn_deprecated(


ValueError: Error raised by inference endpoint: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (422) from primary with message "Failed to deserialize the JSON body into the target type: missing field `inputs` at line 1 column 336". See https://eu-north-1.console.aws.amazon.com/cloudwatch/home?region=eu-north-1#logEventViewer:group=/aws/sagemaker/Endpoints/flan-t5-demo in account 254455524940 for more information.

In [56]:
from langchain.docstore.document import Document
example_doc_1 = """
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.
"""

docs = [
    Document(
        page_content=example_doc_1,
    )
]

from typing import Dict

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.question_answering import load_qa_chain
import json
from typing import Dict, List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
import json

query = """How long was Elizabeth hospitalized?
"""

prompt_template = """Use the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
       input_str = json.dumps({"text_inputs": inputs, **model_kwargs})
       return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> List[List[float]]:
       response_json = json.loads(output.read().decode("utf-8"))
       return response_json["embedding"]

content_handler = ContentHandler()

chain = load_qa_chain(
    llm=SagemakerEndpoint(
        endpoint_name="XYZ",
        credentials_profile_name="XYZ",
        region_name="XYZ",
        model_kwargs={"temperature": 1e-10},
        content_handler=content_handler,
    ),
    prompt=PROMPT,
)

chain({"input_documents": docs, "question": query}, return_only_outputs=True)

ValidationError: 1 validation error for SagemakerEndpoint
__root__
  Could not load credentials to authenticate with AWS client. Please check that credentials in the specified profile name are valid. (type=value_error)

## RAG example with PDF files

In [28]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader

### Upload local PDF files to S3

Sources:
* https://www.iea.org/reports/world-energy-investment-2023
* https://www.iea.org/reports/coal-2022
* https://www.iea.org/reports/world-energy-outlook-2023

Feel free to use your own files, the code below should work without any change.

In [29]:
# Define S3 bucket and prefix for PDF storage

bucket = "bo-automation"
prefix = "langchain-rag-demo"

In [15]:
%%sh -s $bucket $prefix
aws s3 cp --recursive pdfs s3://$1/$2/


The user-provided path pdfs does not exist.


CalledProcessError: Command 'b'aws s3 cp --recursive pdfs s3://$1/$2/\n'' returned non-zero exit status 255.

In [24]:
import boto3

# Create an S3 client
s3 = boto3.client('s3')

# Get the current AWS account ID
account_id = boto3.client('sts').get_caller_identity().get('Account')

# Get the default S3 bucket associated with the current AWS session
default_bucket = f'sagemaker-{account_id}'

print(f"Default S3 Bucket: {default_bucket}")


Default S3 Bucket: sagemaker-254455524940


In [16]:
import boto3

# Replace 'your-s3-bucket-name' with the actual name of your S3 bucket
bucket_name = 'bo-automation'

# Create an S3 client
s3 = boto3.client('s3')

# Validate that the specified bucket exists
try:
    s3.head_bucket(Bucket=bucket_name)
except s3.exceptions.ClientError as e:
    if e.response['Error']['Code'] == '404':
        print(f"The specified bucket '{bucket_name}' does not exist.")
    else:
        print(f"Error accessing the bucket '{bucket_name}': {e}")
    # You might want to handle this error appropriately based on your use case.

# Assign the bucket variable
bucket = bucket_name

# Now you can use the 'bucket' variable in your code
print(f"The S3 bucket is: {bucket}")


The S3 bucket is: bo-automation


In [48]:
# Build list of S3 URIs

s3 = boto3.client("s3")
objs = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
objs = objs['Contents']
uris = [f's3://{bucket}/{obj["Key"]}' for obj in objs]
uris    

['s3://bo-automation/langchain-rag-demo/',
 's3://bo-automation/langchain-rag-demo/Coal2022.pdf',
 's3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf',
 's3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf']

In [49]:
print(uris)

['s3://bo-automation/langchain-rag-demo/', 's3://bo-automation/langchain-rag-demo/Coal2022.pdf', 's3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf', 's3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf']


In [19]:
!pip install pypdf
!pip install pypdf2

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf2
Successfully installed pypdf2-3.0.1


In [30]:
#this is working ,it is created as pypdf can't load s3 object directly
import boto3
import tempfile
from io import BytesIO
from langchain.document_loaders import PyPDFLoader
import os

# Specify your S3 bucket name and item name
bucket_name = "bo-automation"
item_name = "langchain-rag-demo/Coal2022.pdf"

# Create an S3 client
s3 = boto3.client("s3")

# Get the PDF file content from S3
response = s3.get_object(Bucket=bucket_name, Key=item_name)
pdf_content = response["Body"].read()

# Use BytesIO to create a file-like object from the PDF content
pdf_file = BytesIO(pdf_content)

# Save the contents to a temporary file
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
    temp_file.write(pdf_content)
    temp_file_path = temp_file.name

# Create a PyPDFLoader instance and load the PDF document
loader = PyPDFLoader(temp_file_path)
docs = loader.load()
print(len(docs))

# Now you can work with the 'document' object, which represents the PDF content
# For example, you can access the pages: document.pages

# Optionally, delete the temporary file
os.remove(temp_file_path)

137


In [32]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a splitter instance
splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)

# Assuming 'docs' is the list of loaded documents
for document in docs:
    # Extract text from the document
    #text = document.content  # Adjust this based on the actual structure of your Document object

    # Split the text into chunks
    chunks = splitter.split_documents(docs)

    # Add the chunks to the list
    all_chunks += chunks

    # Print information about the chunks
    print(f"Original text length: {len(docs)}, number of chunks: {len(chunks)}")


Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108
Original text length: 137, number of chunks: 1108


In [31]:
import boto3
from io import BytesIO
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader
#from langchain.text import RecursiveCharacterTextSplitter

# Specify your S3 bucket name and prefix
bucket_name = "bo-automation"
prefix = "langchain-rag-demo/"

# Create an S3 client
s3 = boto3.client("s3")

# List all objects in the S3 bucket with the given prefix
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
uris = [f"s3://{bucket_name}/{obj['Key']}" for obj in response.get('Contents', [])]

# Specify the chunk size and overlap
chunk_size = 256
chunk_overlap = 0

# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

all_chunks = []

for uri in uris:
    print(f"Loading {uri}")

    # Load the PDF using PyPDFDirectoryLoader
    loader = PyPDFDirectoryLoader(uri)
    document = loader.load()

    # Split the document into chunks
    chunks = splitter.split_documents(document)

    # Add the chunks to the list
    all_chunks += chunks

    print(f"Loaded {uri}, {len(document)} pages, {len(chunks)} chunks")

# Now 'all_chunks' contains the text chunks from all the loaded PDFs


Loading s3://bo-automation/langchain-rag-demo/
Loaded s3://bo-automation/langchain-rag-demo/, 0 pages, 0 chunks
Loading s3://bo-automation/langchain-rag-demo/Coal2022.pdf
Loaded s3://bo-automation/langchain-rag-demo/Coal2022.pdf, 0 pages, 0 chunks
Loading s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf, 0 pages, 0 chunks
Loading s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf, 0 pages, 0 chunks


In [33]:
loader = PyPDFDirectoryLoader("s3://bo-automation/langchain-rag-demo/")
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
#loader = PyPDFDirectoryLoader("langchain-rag-demo/")
docs = loader.load()
len(docs)



0

In [17]:
for uri in uris:
    print(f"Loading {uri}")
    loader = PyPDFDirectoryLoader(uri)
    document = loader.load()
    print(f"Loaded {uri}, {len(document)} pages")
    print(document)  # Print the loaded document for inspection


NameError: name 'uris' is not defined

In [34]:

splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=0)

all_chunks = []

for uri in uris:
    loader = PyPDFDirectoryLoader(uri)
    document = loader.load()
    chunks = splitter.split_documents(document)
    all_chunks += chunks
    print(f"Loaded {uri}, {len(document)} pages, {len(chunks)} chunks")


Loaded s3://bo-automation/langchain-rag-demo/, 0 pages, 0 chunks
Loaded s3://bo-automation/langchain-rag-demo/Coal2022.pdf, 0 pages, 0 chunks
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf, 0 pages, 0 chunks
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf, 0 pages, 0 chunks


### Analyze documents with Amazon Textract and split them in chunks

In [33]:
!pip install PyPDF2
import s3fs
from PyPDF2 import PdfReader

# Specify the S3 path to the PDF file
pdf_file_path = "s3://bo-automation/langchain-rag-demo/Coal2022.pdf"

# Use s3fs to open the file from S3
fs = s3fs.S3FileSystem()
with fs.open(pdf_file_path, "rb") as file:
    # Use PdfReader instead of PdfFileReader
    pdf_reader = PdfReader(file)
    
    # Get the number of pages using len(reader.pages)
    num_pages = len(pdf_reader.pages)
    
    # Do further processing as needed
    print(f"Number of pages: {num_pages}")



Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Number of pages: 137


In [51]:
import boto3
from PyPDF2 import PdfReader
from io import BytesIO

# Assuming you have the list of URIs
uris = ["s3://bo-automation/langchain-rag-demo/Coal2022.pdf","s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf", "s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf"]

for uri in uris:
    print(f"Loading {uri}")

    # Extract bucket name and object key from the URI
    uri_parts = uri.split("/")
    bucket_name = uri_parts[2]
    item_name = "/".join(uri_parts[3:])

    # Load the PDF using s3fs and PyPDF2
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, item_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs))

    # Extract text using PyPDF2
    text = ""
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]
        text += page.extract_text()

    # Process the extracted text as needed
    print(f"Loaded {uri}, text length: {len(text)}")
    #print(text)


Loading s3://bo-automation/langchain-rag-demo/Coal2022.pdf
Loaded s3://bo-automation/langchain-rag-demo/Coal2022.pdf, text length: 221136
Loading s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf, text length: 280452
Loading s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf
Loaded s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf, text length: 915956


In [77]:
from langchain.document_loaders.pdf import PyPDFLoader
#from langchain.text.chunkers.character import RecursiveCharacterTextSplitter

# Specify the S3 URI for the single PDF document
single_uri = "s3://bo-automation/langchain-rag-demo/Coal2022.pdf"

# Create a PyPDFLoader instance and load the document
loader = PyPDFLoader(single_uri)
document = loader.load()

# Print the result for debugging
print(f"Loaded {single_uri}, {len(document)} pages")

# Check if the document is loaded successfully
if document:
    # Specify the chunk size and overlap
    chunk_size = 256
    chunk_overlap = 0

    # Create a RecursiveCharacterTextSplitter instance
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    # Split the text into chunks
    chunks = splitter.split_documents(document)

    # Print information about the chunks
    print(f"Number of chunks: {len(chunks)}")
else:
    print("Loading document failed.")


FileNotFoundError: [Errno 2] No such file or directory: 's3://bo-automation/langchain-rag-demo/Coal2022.pdf'

In [35]:
def split_text_into_chunks(text, chunk_size=256, chunk_overlap=0):
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap

    return chunks

# Assuming you have the list of URIs
uris = ["s3://bo-automation/langchain-rag-demo/Coal2022.pdf",
        "s3://bo-automation/langchain-rag-demo/WorldEnergyInvestment2023.pdf",
        "s3://bo-automation/langchain-rag-demo/WorldEnergyOutlook2023.pdf"]

# Specify the chunk size and overlap
chunk_size = 256
chunk_overlap = 0

all_chunks = []

for uri in uris:
    print(f"Loading {uri}")

    # Extract bucket name and object key from the URI
    uri_parts = uri.split("/")
    bucket_name = uri_parts[2]
    item_name = "/".join(uri_parts[3:])

    # Load the PDF using s3fs and PyPDF2
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, item_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs))

    # Extract text using PyPDF2
    text = ""
    for page_num in range(len(pdf.pages)):
        page = pdf.pages[page_num]
        text += page.extract_text()

    # Split the text into chunks
    text_chunks = split_text_into_chunks(text, chunk_size, chunk_overlap)

    # Print information about the chunks
    print(f"Loaded {uri}, original text length: {len(text)}, number of chunks: {len(text_chunks)}")

    # Process the text chunks as needed
    #for i, chunk in enumerate(text_chunks):
        #print(f"Chunk {i + 1} length: {len(chunk)}")
        # Process each chunk as needed
    # Append chunks to the list
    all_chunks += text_chunks
    print(f"number of chunks: {len(all_chunks)}")

Loading s3://bo-automation/langchain-rag-demo/Coal2022.pdf


NameError: name 'PdfReader' is not defined

### Embed document chunks and store them in FAISS
https://github.com/facebookresearch/faiss 

In [36]:
#from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [43]:
# Define embedding model
# See https://huggingface.co/spaces/mteb/leaderboard

embedding_model_id = "BAAI/bge-small-en-v1.5"

embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_id,
)

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [44]:
%%time
# Embed chunks
#embeddings_db = FAISS.from_documents(all_chunks, embeddings)
print(f"Number of chunks: {len(all_chunks)}")
#print(f"Number of embeddings: {len(embeddings)}")

# Embed chunks
embeddings_db = FAISS.from_documents(all_chunks, embeddings)


Number of chunks: 152904
CPU times: user 3min 43s, sys: 4.48 s, total: 3min 48s
Wall time: 3min 24s


In [45]:
# Save database
embeddings_db.save_local("faiss_index")

### Shortcut : load existing embedding database

In [46]:
embeddings_db = FAISS.load_local("faiss_index", embeddings)

********

### Configure RAG chain

In [58]:
retriever = embeddings_db.as_retriever(search_kwargs={"k": 10})

In [59]:
# Define prompt template
prompt_template = """
As a helpful energy specialist, please answer the question below, focusing on numerical data and using only the context below.
Don't invent facts. If you can't provide a factual answer, say you don't know what the answer is.

question: {question}

context: {context}
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [60]:
chain = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff",
    retriever=retriever, 
    chain_type_kwargs = {"prompt": prompt})

### Ask our question again

In [62]:
question = "What is the latest trend for coal demand in Europe?"
answer = chain.run({"query": question})
print(answer)

Expected value not found in generated_text


In [None]:
question = "What does STEPS mean?"
answer = chain.run({"query": question})
print(answer)

## Delete endpoint and model

In [33]:
predictor.delete_model()
predictor.delete_endpoint()

In [33]:
import boto3

sagemaker = boto3.client('sagemaker')
response = sagemaker.list_endpoints()

if not response['Endpoints']:
    print("No active endpoints.")
