<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel: PySpark
</div>

## Lab 03. Build Retrival Augmented Generation System using Amazon EMR Spark Distributed Processing and OpeSearch Vector Database
---

## Contents

- [Overview](#overview)
- [Connect to an Existing EMR Cluster](#connect-to-an-existing-emr-cluster)
- [Upload Files from Local to S3](#upload-files-from-local-to-s3)
- [Convert PDF to Text](#convert-pdf-to-text)
- [Run Parallelized Embeddings using Amazon EMR (EC2 Spark based Processing)](#run-parallelized-embeddings-using-amazon-emr-ec2-spark-based-processing)
- [Putting it All Together](#putting-it-all-together)


In this notebook we demonstrate how you can build a Retrival Augmented Generation System using the following components,
1. Embedding Model: `BAAI/bge-base-en-v1.5`
2. Text Generation Model: `meta-/llama2-7b-chat`
3. Vector Database: OpenSearch as Vector Database to store embeddings
4. StreamLit UI: A Chat Interface to talk to your documents

## Connect to an Existing EMR Cluster

### Why empty cell you ask?

Let's connect to an EMR Cluster while at this cell. Click `Cluster` button on the top right section of this JupyterLab window > Select a `Cluster` > Click Connect > Select `No Credentials` and `Voila`!

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;"> Stop! Please read this!
</div>


In [None]:
%load_ext sagemaker_studio_analytics_extension.magics

In [None]:
%%help

## Upload Files from Local to S3

In [None]:
%%local
!python3 -m pip install setuptools

In [None]:
%%local
!python3 -m pip install sagemaker==2.192.0

In [None]:
%%local
import os
import json
import glob
import boto3
import sagemaker
from tqdm import tqdm

In [None]:
%%local
REGION = "us-west-2"
sess = sagemaker.Session()
default_bucket = sess.default_bucket()
s3_client = boto3.client("s3")

print(f"Using default bucket ---> {default_bucket}")

A few sample files are available in directory under ./AWSGuides/, these are sample documents we'll be using to build our RAG application.

In [None]:
%%local
def upload_raw_pdf_files_to_bucket(destination_bucket, destination_prefix, raw_pdf_files):
    
    print(f"Uploading ---> {len(raw_pdf_files)} files!")
    
    uploaded_file_s3uris = []
    for pdf_file in tqdm(raw_pdf_files, total=len(raw_pdf_files)):
        pdf_fname = os.path.basename(pdf_file).replace(",", "").replace(" ", "-")
        
        pdf_dest_prefix = os.path.join(destination_prefix, pdf_fname)
        
        s3_client.upload_file(
            pdf_file, 
            destination_bucket, 
            pdf_dest_prefix
        )
        uploaded_file_s3uris.append(f"s3://{destination_bucket}/{pdf_dest_prefix}")
    
    return uploaded_file_s3uris

pdf_files_to_upload = glob.glob("./AWSGuides/*.pdf")

destination_prefix = "Lab03/raw-pdfs"

files_paths_in_s3 = upload_raw_pdf_files_to_bucket(
    destination_bucket=default_bucket, 
    destination_prefix=destination_prefix,
    raw_pdf_files=pdf_files_to_upload
)

print(f"Uploaded files to ---> {files_paths_in_s3}")

Let's send these variables from our local instance to Pyspark Primary node using a simple 

`%%send_to_spark` command

In [None]:
%%send_to_spark -i REGION -t str -n REGION

In [None]:
%%send_to_spark -i destination_prefix -t str -n SRC_FILE_PREFIX

In [None]:
%%send_to_spark -i default_bucket -t str -n SRC_BUCKET_NAME

## Convert PDF to Text

In [None]:
import os
import boto3
import json
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
import io

In [None]:
print(f"Source bucket and prefix to read pdf files ---> {SRC_BUCKET_NAME} {SRC_FILE_PREFIX}")

In [None]:
def list_files_in_s3_bucket_prefix(bucket_name, prefix):
    
    s3 = boto3.client('s3')

    # Paginate through the objects in the specified bucket and prefix, and collect all keys (file paths)
    paginator = s3.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    file_paths = []
    for page in page_iterator:
        if "Contents" in page:
            for obj in page["Contents"]:
                if os.path.basename(obj["Key"]):
                    file_paths.append(obj["Key"])

    return file_paths

all_pdf_files = list_files_in_s3_bucket_prefix(
    bucket_name=SRC_BUCKET_NAME, 
    prefix=SRC_FILE_PREFIX
)
print(f"Found {len(all_pdf_files)} files ---> {all_pdf_files}")

Let's prep a list to process files along with bucket names 

In [None]:
all_pdf_files = [(SRC_BUCKET_NAME, fpath) for fpath in all_pdf_files]
type(all_pdf_files)

Let's convert our list to a spark RDD for parallelization of our list

In [None]:
pdfs_rdd = spark.sparkContext.parallelize(all_pdf_files)
type(pdfs_rdd)

Each code node reaches out a pdf file from our list, downloads the pdf file into memory and returns a PyPDF2 class reference for downstream workloads

![EMR Read PDFs into Memory](./media/EMR-Doc-Read.jpg)

In [None]:
def load_pdf_from_s3_into_memory(row):
    """
    Load a PDF file from an S3 bucket directly into memory.
    """
    try:
        src_bucket_name, src_file_key = row 
        s3 = boto3.client('s3')
        pdf_file = io.BytesIO()
        s3.download_fileobj(src_bucket_name, src_file_key, pdf_file)
        pdf_file.seek(0)
        pdf_reader = PdfReader(pdf_file)
        return (src_file_key, pdf_reader, len(pdf_reader.pages))
    
    except Exception as e:    
        return (os.path.basename(src_file_key), str(e))

Let's concurrently load pdf files into memory using rdd map and collect the results back to our Primary Node

In [None]:
pdfs_in_memory = pdfs_rdd.map(load_pdf_from_s3_into_memory).collect()

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x_labels = [pdfx.split('/')[-1] for pdfx, _, _ in pdfs_in_memory]
y_values = [pages_count for _, _, pages_count in pdfs_in_memory]
x = range(len(y_values))

# Create a figure and a set of subplots
fig, axs = plt.subplots(2, 1, figsize=(10, 10))

# First Subplot: Bar Chart
axs[0].bar(x, y_values, color=['red', 'green', 'blue'])
axs[0].set_title('Bar Chart')
axs[0].set_xticks(x)
axs[0].set_xticklabels(x_labels, rotation=45, ha="right")
axs[0].set_ylabel('Pdf Pages Count --->')

_bottom = 0
for (pdf_name, page_count, color) in zip(x_labels, y_values, ['red', 'green', 'blue']):
    axs[1].bar([0], [page_count], bottom=_bottom, color=color, label=pdf_name)
    _bottom += page_count
axs[1].set_title('Stacked Bar Chart')
axs[1].set_xticks([0])
axs[1].set_xticklabels(['Documents'], rotation=45, ha="right")
axs[1].set_ylabel('Stacked Pages Count --->')

# Add a legend to the second subplot
axs[1].legend()

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

%matplot plt

In [None]:
class CustomDocument:
    def __init__(self, text, path, number):
        self.page_content = text
        self.metadata = {
            'source': path, 
            'page': number  
        }

    def __repr__(self):
        # This method is for representing the object in a way that’s clear to a human (also can be used for debugging)
        return f"Document(page_content='{self.page_content}', metadata={self.metadata})"

    # Optionally, if you need a string representation of the instance that is more user-friendly, 
    # you can implement the __str__ method
    def __str__(self):
        return f"Page Content: {self.page_content}\nSource: {self.metadata['source']}\nPage Number: {self.metadata['page']}"
    
def extract_text_from_pdf_reader(row):
    """ 
    Extract text from a page of the document 
    """
    try:
        doc_path, page_num = row
        page_text = global_pdfs_in_mem_dict[doc_path].pages[page_num].extract_text()
        return page_text, doc_path, page_num
    except Exception as e:
        return str(e), doc_path, page_num

In [None]:
global_pdfs_in_mem_dict = {_key: pdf_reader for _key, pdf_reader, _ in pdfs_in_memory}

In [None]:
docs_instances = []
for (file_src, _, page_count) in pdfs_in_memory:
    for pg_num in range(page_count):
        docs_instances.append((file_src, pg_num))
print(f"Created {len(docs_instances)} parallel instances to process!")

In [None]:
docs_instances_rdd = spark.sparkContext.parallelize(docs_instances)

Every PDF document has 'n' pages to process, this task can be executed in a parallel fashion using Spark Processing. 

Each Document is split page by page, each page from a global reference of in memory pdfs.

![PageLevelProcessingEMRPDFtoTxt](./media/PageLevelProcessingEMRPDFtoTxt.jpg)

In [None]:
documents = docs_instances_rdd.map(extract_text_from_pdf_reader).collect()
documents_custom = [
    CustomDocument(text=text, path=doc_source, number=page_num) 
    for text, doc_source, page_num in documents
]

In [None]:
documents_custom[121]

We split pages using a reference chunk size, chunk size is an experimental value. To learn more about chunk size and how RecursiveCharacterTextSplitter, see: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [None]:
global_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
docs = global_text_splitter.split_documents(documents_custom)
print(f"Total number of docs pre-split {len(documents_custom)} | after split {len(docs)}")

In [None]:
# create data
plt.clf()

x_labels = ["Pre-Split", "Post Split"]
y = [len(documents_custom), len(docs)]
x = range(len(x_labels))

fig, axs = plt.subplots(1, 1, figsize=(7, 5))

# First Subplot: Bar Chart
axs.bar(x, y, color=["red", "blue"])
axs.set_title('Pre/Post RecursiveCharacterTextSplitter Split')
axs.set_xticks(x)
axs.set_xticklabels(x_labels, rotation=45, ha="right")
axs.set_ylabel('Text # to Process -->')

# Add a legend to the second subplot
axs.legend()

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()

%matplot plt

In [None]:
print(docs[1001])

## Run Parallelized Embeddings using Amazon EMR (EC2 Spark based Processing)

In [None]:
def generate_embeddings(input_text_sample):
    
    assert isinstance(input_text_sample, str), f"Input must be a single string but found " 
    
    lambda_client = boto3.client('lambda', region_name='us-west-2') 

    # Prepare the data to send to the Lambda function
    data = {
        "input": input_text_sample
    }

    # Invoke the Lambda function
    response = lambda_client.invoke(
        FunctionName="invokeEmbeddingEndpoint",
        InvocationType="RequestResponse",
        Payload=json.dumps(data)
    )

    # Decode and load the response payload
    response_payload = json.loads(response['Payload'].read().decode("utf-8"))

    # Extract status and embeddings from the response
    status_code, embeddings = int(response_payload['statusCode']), json.loads(response_payload['body'])

    return status_code, embeddings
    
class EmbeddingsGenerator:
    
    @staticmethod
    def embed_documents(input_text, normalize=True):
        """
        Generate embeddings for the provided text, invoking a Lambda function.
        """
        assert isinstance(input_text, list), "Input type must me list to embed_documents function"
        
        input_text_rdd = spark.sparkContext.parallelize(input_text)
        
        embeddings_generated = input_text_rdd.map(generate_embeddings).collect()
        
        embedding_response = []
        for s_code, embeddings in embeddings_generated:
            if s_code == 200:
                embedding_response.append(embeddings)
            else:
                pass
        
        return embedding_response
    
    @staticmethod
    def embed_query(input_text):
        status_code, embedding = generate_embeddings(input_text)
        if status_code == 200:
            return embedding
        else: 
            None

In [None]:
response_code, sample_sentence_embedding = generate_embeddings(docs[1000].page_content)
print(f"Status {response_code}, Embedding size of the document --->", len(sample_sentence_embedding))

In [None]:
%%local
INDEX_NAME_OSE = "amz-guides-index"
f = open("../studio-local-ui/indexname.txt", "w")
f.write(INDEX_NAME_OSE)
f.close()

In [None]:
%%send_to_spark -i INDEX_NAME_OSE -t str -n INDEX_NAME_OSE

In [None]:
%%local
def get_secret(secret_name, region_name="us-west-2"):
    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    get_secret_value_response = client.get_secret_value(
        SecretId=secret_name
    )
    secrets = json.loads(get_secret_value_response['SecretString'])
    user = secrets['username']
    pwd = secrets['password']
    return user, pwd

# Use the function
my_secret_name = "OpenSearchSecret-workshop-studio-v2-cfn"  
user, pwd = get_secret(my_secret_name, REGION)
print(f"Session user and pwd ---> ", user, pwd)

# write data to a local file 
f = open("../studio-local-ui/opensearchlogin.txt", "w")
f.write(f"{user}|||{pwd}")
f.close()

In [None]:
%%send_to_spark -i user -t str -n user

In [None]:
%%send_to_spark -i pwd -t str -n pwd

<div style="background-color: #FFFF00; border-left: 5px solid yellow; padding: 10px; color: black;">
    Please navigate to your AWS Management Console to find your OpenSearch Domain Endpoint URL
</div>

Let's find your OpenSearch domain URL to connect, switch over to your AWS Management Console and search for `Amazon OpenSearch Service`. You should see a OpenSearch domain active and in `Green` status listed as `opensearchservi-xxxx`. 
1.  Click on your OpenSearch Domain
2. Find `Domain endpoint`
3. Copy the URL: https://xxx.es.amazonaws.com

In [None]:
%%local
OPENSEARCH_DOMAIN_URL = "https://search-opensearchservi-yoetoghxcrcw-v4bi7r4eb6jvbidmjg3bbohw4q.us-west-2.es.amazonaws.com"
f = open("../studio-local-ui/opesearchurl.txt", "w")
f.write(OPENSEARCH_DOMAIN_URL)
f.close()

In [None]:
%%send_to_spark -i OPENSEARCH_DOMAIN_URL -t str -n OPENSEARCH_DOMAIN_URL

This step below parallelizes the following operations using EMR Spark,
1. Takes text chunks of documents - ivokes an embedding endpoint to encode our data chunks
2. Ingest embeddings + text + text meta data into OpenSearch database

In [None]:
import time
from langchain.vectorstores import OpenSearchVectorSearch

start = time.time()
docsearch = OpenSearchVectorSearch.from_documents(
    docs, 
    EmbeddingsGenerator, 
    opensearch_url=OPENSEARCH_DOMAIN_URL,
    bulk_size=len(docs),
    http_auth=(user, pwd),
    index_name=INDEX_NAME_OSE,
    engine="faiss"
)

end = time.time()
print(f"Total Time for ingestion: {round(end - start, 2)} secs")

In [None]:
query = "What is a Amazon SageMaker?"
sample_responses = docsearch.similarity_search(
    query, 
    k=5, 
    space_type="cosineSimilarity", 
    search_type="painless_scripting"
)

In [None]:
sample_responses[4].page_content

## Putting it All Together

To recap,

1. We create a Spark Cluster to leverage PySpark for Distributed Data Processing at scale!
2. We pushed some raw data into S3 (in reality, this data can be housed anywhere RedShift, S3, RDS, Dynamo, Snowflake, etc..)
3. We Parallelized our document extraction from S3 using PySpark - our PySpark `Core` nodes were able to reach out to doc store (S3) read a file into memory for downstream processing
4. We then split our processing at Document - at a page level and further parallelize our pdf reading process using PySpark
5. We chunk our document corpus using `LangChain`'s `RecursiveCharacterTextSplitter`. We then convert our text into Embeddings using `BAAI/bge-base-en-v1.5` Embedding LLM Model and ingest these embeddings into OpenSearch index. - all using PySpark Parallel Processing technique
6. Now we use `Streamlit` to interact with text generation model and document embeddings with a UI

Now, let's host the app. In order to do this, we will connect to the System terminal. Navigate to the `File` > `New` and `Terminal` to launch a new `System Terminal`

Then run the following commands using `System Terminal`:

`cd ~/sagemaker-studio-foundation-models/studio-local-ui`

`streamlit run rag_app.py --server.runOnSave true`

OR You can run bash commands from inside a notebook cell as below,

In [None]:
%%bash
cd ../studio-local-ui
streamlit run rag_app.py --server.runOnSave true

##### Navigate to https://example.studio.us-west-2.sagemaker.aws/jupyterlab/default/proxy/8501/ 

Replace "example" with your your current url hash `https://use_this_hash.studio.us-west-2...`