# 1 - Ingest with LlamaParse into S3 for KB

In this notebook, we use LlamaParse to pre-process complex documents and stage them in S3 for Amazon Bedrock Knowledge Base (KB).

### LlamaParse is a document parser optimized for RAG over complex documents
- ✅ Extracts tables / charts
- ✅ Input natural language parsing instructions
- ✅ JSON mode
- ✅ Image Extraction
- ✅ Support for ~10+ document types (.pdf, .pptx, .docx, .xml)

### Installation

Install llama-index (core framework) and llama-parse (LlamaParse client)

In [90]:
%pip install llama-index
%pip install llama-parse


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Setup and imports

In [3]:
import nest_asyncio

nest_asyncio.apply()

Get API key from http://cloud.llamaindex.ai/ and configure via env variable

In [4]:
import os

os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-...'

### Download data

For this demo, we will build a simple knowledge base with 2 10K filings for uber and lyft.

In [62]:
!mkdir -p './data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O './data/10k/lyft_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O './data/10k/uber_2021.pdf'

--2024-04-18 13:31:02--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1440303 (1.4M) [application/octet-stream]
Saving to: ‘./data/10k/lyft_2021.pdf’


2024-04-18 13:31:02 (18.6 MB/s) - ‘./data/10k/lyft_2021.pdf’ saved [1440303/1440303]

--2024-04-18 13:31:02--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response...

### Load and parse data with LlamaParse

In [5]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
    api_key=os.environ.get('LLAMA_CLOUD_API_KEY'),  # set via api_key param or in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

file_extractor = {".pdf": parser}
reader = SimpleDirectoryReader(
    input_dir='data/10k/',
    file_extractor=file_extractor
)

In [6]:
documents = reader.load_data()

Started parsing the file under job_id 705de5a7-2782-40e8-a5cf-813bb198a9fe
Started parsing the file under job_id f35906d9-4b66-4d31-99d8-6cdb087c8934


In [7]:
documents[0].metadata

{'file_path': '/Users/suo/dev/rag-bedrock/data/10k/lyft_2021.pdf',
 'file_name': 'lyft_2021.pdf',
 'file_type': 'application/pdf',
 'file_size': 1440303,
 'creation_date': '2024-04-18',
 'last_modified_date': '2024-04-18'}

### Upload data as markdown and metadata file 

Now, upload and stage the parsed result markdown files for ingestion in to Amazon Bedrock KB.

We also create metadata json files (the format is specific to Amazong Bedrock KB)

In [8]:
import boto3 
import botocore

# Create an S3 client
s3 = boto3.client('s3')

# Specify the bucket
bucket_name = 'bedrock-kb-10ks'

In [9]:
import json

def create_bucket(bucket_name):
    try:
        s3.head_bucket(Bucket=bucket_name)
        print(f"Bucket '{bucket_name}' already exists.")
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == '404':
            s3.create_bucket(Bucket=bucket_name)
            print(f"Bucket '{bucket_name}' created successfully.")
        else:
            print(f"Error creating bucket: {str(e)}")
            raise    

def upload_document(document, bucket_name):
    try:
        object_key = document.metadata['file_path']
        
        # Upload the text as a markdown file
        s3.put_object(
            Body=document.text.encode('utf-8'),
            Bucket=bucket_name,
            Key=f"{object_key}.md",
            ContentType='text/markdown'
        )
        print(f"Text uploaded to S3 as '{object_key}.md'")

        # Format the metadata in the desired structure
        formatted_metadata = {
            "metadataAttributes": document.metadata
        }
        
        # Upload the metadata as a JSON file
        metadata_json = json.dumps(formatted_metadata, indent=4)
        s3.put_object(
            Body=metadata_json.encode('utf-8'),
            Bucket=bucket_name,
            Key=f"{object_key}.md.metadata.json",
            ContentType='application/json'
        )
        print(f"Metadata uploaded to S3 as '{object_key}.md.metadata.json'")
    
    except Exception as e:
        print(f"Error uploading document: {str(e)}")

In [10]:
for doc in documents:
    upload_document(doc, bucket_name)  

Text uploaded to S3 as '/Users/suo/dev/rag-bedrock/data/10k/lyft_2021.pdf.md'
Metadata uploaded to S3 as '/Users/suo/dev/rag-bedrock/data/10k/lyft_2021.pdf.md.metadata.json'
Text uploaded to S3 as '/Users/suo/dev/rag-bedrock/data/10k/uber_2021.pdf.md'
Metadata uploaded to S3 as '/Users/suo/dev/rag-bedrock/data/10k/uber_2021.pdf.md.metadata.json'


### Next: Create Amazon Bedrock KB

Now, you can create an Amazon Bedrock KB either:
1. via the AWS managment console, or
2. programmatically following this guide: https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/0_create_ingest_documents_test_kb.ipynb