# Populate RAG DB

This notebook aggregates the logic of notebooks 01 to 05 in order to quickly build the RAG content.

## Dowload and process documents

In this first step we download the PDF documents, we process them with Amazon Textract using the textractor library and we store the documents on Amazon S3.

### Install needed libraries

In [None]:
%pip install -q pypdf amazon-textract-textractor[pdf] requests 2> /dev/null

In [None]:
!sudo apt-get update -y &> /dev/null && sudo apt install poppler-utils -y 2> /dev/null

### Run the processing script locally

We can run this step locally since we do not need to access the Aurora DB, which would require the Studio space to have access to the VPC hosting the Aurora database.

In [None]:
!python scripts/prepare_documents.py

## Calculate embeddings for the documents, load data into Aurora Postgres

Now that we have the text we extracted from the documents, we can create chunks from them, calculate the embeddings for each chunk and load this data into Aurora Postgres. We also load the analytical data from the `extracted_enties.csv` file. 

We first retrieve some values from the deployed infrastructure which is needed for the SageMaker processor container to run.

In [None]:
import json
import os
import boto3

ssm = boto3.client("ssm")
secretsmanager = boto3.client("secretsmanager")
region = boto3.session.Session().region_name

security_group_parameter = "/AgenticLLMAssistantWorkshop/SMProcessingJobSecurityGroupId"
dbsecret_arn_parameter = "/AgenticLLMAssistantWorkshop/DBSecretARN"
subnet_ids_parameter = "/AgenticLLMAssistantWorkshop/SubnetIds"
s3_bucket_name_parameter = "/AgenticLLMAssistantWorkshop/AgentDataBucketParameter"
script_processor_container_parameter = "/AgenticLLMAssistantWorkshop/ScriptProcessorContainer"

security_group = ssm.get_parameter(Name=security_group_parameter)
security_group = security_group["Parameter"]["Value"]

db_secret_arn = ssm.get_parameter(Name=dbsecret_arn_parameter)
db_secret_arn = db_secret_arn["Parameter"]["Value"]

subnet_ids = ssm.get_parameter(Name=subnet_ids_parameter)
private_subnets_with_egress_ids = json.loads(subnet_ids["Parameter"]["Value"])

s3_bucket_name = ssm.get_parameter(Name=s3_bucket_name_parameter)
s3_bucket_name = s3_bucket_name["Parameter"]["Value"]

processed_documents_s3_key = "documents_processed.json"
sql_tables_s3_key = "structured_metadata"

script_processor_container_uri = ssm.get_parameter(Name=script_processor_container_parameter)["Parameter"]["Value"]

In [None]:
!aws s3 cp data/extracted_entities.csv s3://{s3_bucket_name}/{sql_tables_s3_key}/extracted_entities.csv

### Run the code using a SageMaker script processor

Amazon SageMaker processors are tools that enable users to perform data preprocessing, postprocessing, feature engineering, data validation, and model evaluation tasks on fully managed infrastructure. The Processor class handles general processing tasks, while specialized classes like ScriptProcessor and FrameworkProcessor allow for running scripts and using pre-built containers for popular machine learning frameworks respectively. These processors make it easy to prepare data for machine learning, evaluate models, and perform other data processing tasks without having to manage the underlying infrastructure, allowing data scientists and ML engineers to focus on their core work.

For this use case we use a ScriptProcessor with a custom built container (you can see the definition of the container in `data_pipelines/data/docker/Dockerfile`) which executes the script `data_pipelines/data/scripts/prepare_and_load_data_in_aurora.py`.

When executing the script we can tell SageMaker processor to load data from S3 and make it available in the container. We do this to provide the processed document file and the csv file. 

The container running the script is configured with a network access to the Aurora data base (see NetworkConfig)

(estimated time ~ 3min)

In [None]:
%%time
from sagemaker.network import NetworkConfig

current_network_config = NetworkConfig(
    subnets=private_subnets_with_egress_ids, security_group_ids=[security_group]
)

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

# Initialize the ScriptProcessor
script_processor = ScriptProcessor(
    image_uri=script_processor_container_uri,
    role=get_execution_role(),
    instance_type="ml.t3.large",
    instance_count=1,
    base_job_name="load-rag-data",
    env={"SQL_DB_SECRET_ID": db_secret_arn, "AWS_DEFAULT_REGION": region},
    network_config=current_network_config,
    command=["python3"]
)

# Run the processing job
script_processor.run(code="scripts/prepare_and_load_data_in_aurora.py", inputs=[
        ProcessingInput(
            input_name="processed_documents",
            source=f"s3://{s3_bucket_name}/{processed_documents_s3_key}",
            destination="/opt/ml/processing/input/processed_documents",
        ),
        ProcessingInput(
            input_name="sqltables",
            source=f"s3://{s3_bucket_name}/{sql_tables_s3_key}",
            destination="/opt/ml/processing/input/sqltables",
        )
    ])