# File Processor
1. pdf 파일을 읽어서 chunking 작업을 수행
2. 저장된 chunk들에 대해 embedding 작업을 수행
3. embedding 된 내용을 vector store (opensearch)에 저장


## 0. Prerequisites

In [None]:
%load_ext autoreload
%autoreload 2
%pip install ipywidgets

### 1. PDF 처리 후 Chunking

In [None]:
from pathlib import Path

input_file = 'data/bedrock-ug.pdf'
chunk_size = 1000
start_page = 15

# Additional Parameters for Contextual Retrieval
add_contextual = True
document_size = 20000

document_name = Path(input_file).resolve().stem
document_name

#### Split Document into chunked format

In [None]:
from helpers.document_helper import DocumentHelper

chunked_document = DocumentHelper.split(full_text=DocumentHelper.load_pdf(input_file, start_page=start_page), chunk_size=chunk_size, max_document_length=document_size if add_contextual else -1)

# save result into json file
output_file = f"output/{document_name}_{chunk_size}{"_situated" if add_contextual else ""}_chunks.json"

import json
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(chunked_document, f, ensure_ascii=False, indent=2)
    print(f"Chunks saved to {output_file}")



### 2. Progress Embedding

### 2-0. Load Requirement

In [None]:
from libs.bedrock_service import BedrockService
from libs.opensearch_service import OpensearchService

from config import Config
config = Config.load()
config.__dict__


In [None]:
bedrock_service = BedrockService(config.aws.region, config.aws.profile, config.bedrock.retries, config.bedrock.embed_model_id, config.bedrock.model_id, config.model.max_tokens, config.model.temperature, config.model.top_p)
opensearch_service = OpensearchService(config.aws.region, config.aws.profile, config.opensearch.prefix, config.opensearch.domain_name, config.opensearch.document_name, config.opensearch.user, config.opensearch.password)


### 2-1. Situate Document

In [None]:
temperature = 0.0
top_p = 0.5

In [None]:
import json
import time
from tqdm.notebook import tqdm

if add_contextual:
    chunked_file = f"output/{document_name}_{chunk_size}{"_situated" if add_contextual else ""}_chunks.json"

    with open(chunked_file, 'r', encoding='utf-8') as f:
        documents = json.load(f)

    total_token_usage = {"inputTokens": 0, "outputTokens": 0, "totalTokens": 0}
    documents_token_usage = {}

    sys_prompt = """
    You're an expert at providing a succinct context, targeted for specific text chunks.

    <instruction>
    - Offer 1-5 short sentences that explain what specific information this chunk provides within the document.
    - Focus on the unique content of this chunk, avoiding general statements about the overall document.
    - Clarify how this chunk's content relates to other parts of the document and its role in the document.
    - If there's essential information in the document that backs up this chunk's key points, mention the details.
    </instruction>
    """
    fail_count = 0

    for doc_index, document in tqdm(enumerate(documents), leave = False, total=len(documents)):
        if fail_count > 10:
            break
        doc_content = document['content']

        if 'token_usage' in document:
            doc_token_usage = document['token_usage']
        else:
            document['token_usage'] = {"inputTokens": 0, "outputTokens": 0, "totalTokens": 0}
        
        for chunk in tqdm(document['chunks']):
            if 'simulated' in chunk:
                continue
            document_context_prompt = f"""
            <document>
            {doc_content}
            </document>
            """

            chunk_content = chunk['content']
            chunk_context_prompt = f"""
            Here is the chunk we want to situate within the whole document:

            <chunk>
            {chunk_content}
            </chunk>

            Skip the preamble and only provide the consise context.
            """
            usr_prompt = [{
                    "role": "user", 
                    "content": [
                        {"text": document_context_prompt},
                        {"text": chunk_context_prompt}
                    ]
                }]
            
            try:
                response = bedrock_service.converse(
                    messages=usr_prompt, 
                    system_prompt=sys_prompt,
                    temperature=temperature,
                    top_p=top_p,
                    max_tokens=4096
                )
                situated_context = response['output']['message']['content'][0]['text'].strip()
                chunk['content'] = f"Context:\n{situated_context}\n\nChunk:\n{chunk['content']}"
                chunk['simulated'] = True

                if 'usage' in response:
                    usage = response['usage']
                    for key in ['inputTokens', 'outputTokens', 'totalTokens']:
                        document['token_usage'][key] += usage.get(key, 0)
                print(f"completed generating context for chunk [{doc_index}_{chunk['chunk_id']}]")

            except Exception as e:
                print(f"Error generating context for chunk [{doc_index}_{chunk['chunk_id']}]: {e}")
                fail_count += 1
            time.sleep(5)

    with open(chunked_file, "w", encoding='utf-8') as f:
        json.dump(documents, f, indent=4)

In [None]:
documents[-1]

### 2-2. Create Index

In [None]:
# Configure Index
index_prefix = "aws_"
index_name = (f"{index_prefix}contextual_{document_name}" if add_contextual and not document_name.startswith("contextual_") else document_name) + f"_{chunk_size}"

overwrite_index = True

opensearch_index_configuration = {
    "settings": {
        "index.knn": True,
        "index.knn.algo_param.ef_search": 512
    },
    "mappings": {
        "properties": {
            "metadata": {
                "properties": {
                    "source": {
                        "type": "keyword"
                    },
                    "doc_id": {
                        "type": "keyword"
                    },
                    "timestamp": {
                        "type": "date"
                    }
                }
            },
            "content": {
                "type": "text",
                "analyzer": "standard"
            },
            "content_embedding": {
                "type": "knn_vector",
                "dimension": 1024,
                "method": {
                    "engine": "faiss",
                    "name": "hnsw",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 16
                    },
                    "space_type": "l2"
                }
            }
        }
    }
}

index_name

In [None]:
if overwrite_index:
    if opensearch_service.opensearch_client.indices.exists(index=index_name):
        opensearch_service.opensearch_client.indices.delete(index=index_name)
    
    opensearch_service.opensearch_client.indices.create(index=index_name, body=opensearch_index_configuration)
else:
    if not opensearch_service.opensearch_client.indices.exists(index=index_name):
        opensearch_service.opensearch_client.indices.create(index=index_name, body=opensearch_index_configuration)

index_pattern = f"{index_prefix}*" if index_prefix else "*"
indices = opensearch_service.opensearch_client.cat.indices(index=index_pattern, format="json")

indices_name = [item['index'] for item in indices]
indices_name


### 2-3. Embed Document

In [None]:
import json
from tqdm.notebook import tqdm
from libs.bedrock_service import BedrockService
from datetime import datetime

bedrock_service = BedrockService(config.aws.region, config.aws.profile, config.bedrock.retries, config.bedrock.embed_model_id, config.bedrock.model_id, config.model.max_tokens, config.model.temperature, config.model.top_p)

with open(chunked_file, 'r', encoding='utf-8') as f:
    documents = json.load(f)

embedded_documents = []

for document in tqdm(documents):
    doc_id = document['doc_id']
    embedded_chunks = []

    for chunk in tqdm(document['chunks']):
        context = chunk['content']
        chunk_embedding = bedrock_service.embedding(text=context)
        if chunk_embedding:
            chunk_id = chunk['chunk_id']
            _id = f"{doc_id}_{chunk_id}"
            embedded_chunk = {
                "metadata": {
                    "source": document_name, 
                    "doc_id": doc_id,
                    "chunk_id": chunk_id,
                    "timestamp": datetime.now().isoformat()
                },
                "content": chunk['content'],
                "content_embedding": chunk_embedding
            }
            embedded_chunks.append(embedded_chunk)

            opensearch_service.opensearch_client.index(
                index=index_name,
                body=embedded_chunk
            )
            
        embedded_documents.append({
            "_id": _id,
            "embedded_chunks": embedded_chunks
        })
        
print(f"Successfully embedded and stored documents in index '{index_name}'")


### 2-4. Test Query

In [None]:
question = "What is Bedrock?"

question_embedding = bedrock_service.embedding(text=question)
knn = opensearch_service.search_by_knn(question_embedding, 'contextual_bedrock-ug_1000')
knn