## Managed RAG Lab: Document Enrichment (Optional)

This lab is optional. But if you want to see how the Swagger API ymls and diagrams are enriched for RAG, please run this notebook.

In this situation, the Swagger API documentation consists of JSON or YAML files, flow diagrams, and other non-rich text formats. Amazon Q Business does not handle these data types by default, therefore we will need to run an enrichment process using this notebook to generate synthetic documentation based on YAML files and images for Amazon Q for Business. Here is a flow diagram of this lab:


![Lab Diagram](../static/q-business.png)

### > Test Lab01 setup has run

In [None]:
from utils import (
    get_text_response,
    load_yaml_to_string,
    upload_file_to_s3,
    image_to_base64
)
from termcolor import colored
import os
import json
import time

# claude pricing in us-east-1 pricing
input_per_1k = 0.00025
output_per_1k = 0.00125

### > Load parameters from lab00-setup
If you have not ran lab00-setup, please go back and run the setup notebook

In [None]:
%store -r bucket
%store -r prefix
%store -r yml_dir
%store -r uml_dir
%store -r data_dir
## check all 5 values are printed and do not fail
print(bucket)
print(prefix)
print(yml_dir)
print(uml_dir)
print(data_dir)

### Enrich documents using Amazon Bedrock (Optional)
If you are interested in how to enrich the OpenAPI and UML diagrams using foundation models from Amazon Bedrock, please processed with the rest of this notebook. 

### > Enrich the yml files

Here is a prompt for Anthropic Claude3 model to generate synthetic documentation from YAML files.

In [None]:
prompt_template="""
You will be provided with an OpenAPI YAML file containing the specification for a set of APIs. Your
task is to generate a principle-level documentation for these APIs in JSON format.

Here are the steps you should follow:

1. Read the provided <yaml> carefully and understand the APIs, their
endpoints, request/response data structures, and other details.

<yaml>
{YAML_FILE}
</yaml>

2. In the <description> field of your JSON output, provide a comprehensive description of the APIs.
Explain what each API does, what data properties the requests take, and what the expected response
messages are. Callout any limits and frequent encoutered errors. Useexamples from the YAML file to 
illustrate your points.

3. In the <stats> field, generate some overall statistics about the APIs and present them in pullet
list sentence style, such as:
- Number of routes/endpoints?
- Number of request data models?
- Number of response data models?
- Any other relevant stats you can extract from the YAML file

4. In the <faq> field, generate a list of 20 questions and corresponding answers for a Frequently Asked
Questions (FAQ) section. Start with simple questions about the APIs and gradually increase the
complexity. The questions should cover various aspects of the APIs, such as their functionality,
data structures, error handling, and so on. The answers should be clear, concise, and informative,
using examples from the YAML file where appropriate.

5. Structure your JSON output as follows:

{
"description": "<description>",
"stats": "<stats>",
"faq": [
{
"question": "<question>",
"answer": "<answer>"
},
...
]
}

Replace <description>, <stats>, <question>, and <answer> with the appropriate content you generated
in the previous steps.

Please provide your response in JSON format only, without any additional explanations or comments.
"""

### > Let's preview one of the YAML files

In [None]:
from IPython.display import display, Markdown

yml_file = f"{data_dir}/yml_files/petstore.yml"
yml_str = load_yaml_to_string(yml_file)


display(Markdown(f"""```yml\n{yml_str}```"""))

### > Generate enriched documents from YAML

This code snippet below is processing YAML files (which contain Swagger API documentation) to generate human-readable documentation in text format. This documentation can then be leveraged by Amazon Q Business to buils an AI assistant that understand these APIs.

In [None]:
!cd .. && rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
input_tokens = 0
output_tokens = 0

for yml_filename in os.listdir(yml_dir):
    # Construct the full file path
    yml_filepath = os.path.join(yml_dir, yml_filename)

    yml_str = load_yaml_to_string(yml_filepath)

    query = prompt_template.replace("{YAML_FILE}", yml_str)

    max_retries = 3
    delay = 2  # Delay in seconds between retries
    
    for attempt in range(max_retries):
        try:
    
            response = get_text_response(text_query=query)
            
            input_tokens += response["usage"]["input_tokens"]
            output_tokens += response["usage"]["output_tokens"]
            
            document_json = json.loads(response["content"][0]["text"])
            print("JSON loaded successfully...\n")
            break
        except json.JSONDecodeError as e:
            if attempt == max_retries - 1:
                print(f"Failed to load JSON after {max_retries} attempts. Skipping operation.\n")
            else:
                print(f"Failed to load JSON (attempt {attempt + 1}/{max_retries}): {e}\n")
                time.sleep(delay)
                
    # yml upload file to s3
    key = f"{prefix}/{yml_filepath.replace(data_dir+'/', '')}"
    s3_path = upload_file_to_s3(yml_filepath, bucket, key)
    
    print(f'yml file uploaded to {s3_path}...\n')
    
    # build the doc
    doc = f"""
    Documentation for {yml_filename.split(".")[0]}
    
    Description:
    {document_json["description"]}
    {document_json["stats"]}
    
    FAQ:
    
    """
    
    for faq in document_json["faq"]:
        doc += faq["question"] + "\n\n" + faq["answer"] + "\n\n"
        
    txt_filename = yml_filename.split(".")[0]+".txt"
    txt_filepath = f"{data_dir}/yml_questions/{txt_filename}"
    
    with open(txt_filepath, 'w', encoding='utf-8') as file:
                file.write(doc)
    print(f'documentation generated at {txt_filepath}...\n')
    
    # txt file upload file to s3
    key = f"{prefix}/{txt_filepath.replace(data_dir+'/', '')}"
    metadata = {"s3_url":s3_path}
    s3_path = upload_file_to_s3(txt_filepath, bucket, key, metadata=metadata)
    
    print(f'documentation uploaded to {s3_path}...\n')
    
total_cost = (
    input_per_1k * input_tokens +
    output_per_1k * output_tokens
) / 1000

print('\n')
print('========================================================================')
print('Estimated cost:', colored(f"${total_cost}", 'green'), f"in us-east-1 region with {colored(input_tokens, 'green')} input tokens and {colored(output_tokens, 'green')} output tokens.")
print('========================================================================')

### > Enrich UML diagrams

Here is a prompt for Anthropic Claude3 model to generate synthetic caption from UML diagrams.

In [None]:
prompt_template="""
You will be provided with an OpenAPI YAML file containing the specification for a set of APIs. Your
task is to generate a principle-level documentation for these APIs in JSON format.

Here are the steps you should follow:

1. Read the provided images carefully to understand the APIs, their
endpoints, request/response data structures, and other details.

2. In the <description> field of your JSON output, provide a comprehensive description of the APIs.
Explain what each API does, what data properties the requests take, and what the expected response
messages are. Callout any limits and frequent encoutered errors. Useexamples from the YAML file to 
illustrate your points.

3. In the <stats> field, generate some overall statistics about the APIs and present them in pullet
list sentence style, such as:
- Number of routes/endpoints?
- Number of request data models?
- Number of response data models?
- Any other relevant stats you can extract from the YAML file

4. In the <faq> field, generate a list of 20 questions and corresponding answers for a Frequently Asked
Questions (FAQ) section. Start with simple questions about the APIs and gradually increase the
complexity. The questions should cover various aspects of the APIs, such as their functionality,
data structures, error handling, and so on. The answers should be clear, concise, and informative,
using examples from the YAML file where appropriate.

5. Structure your JSON output as follows:

{
"description": "<description>",
"stats": "<stats>",
"faq": [
{
"question": "<question>",
"answer": "<answer>"
},
...
]
}

Replace <description>, <stats>, <question>, and <answer> with the appropriate content you generated
in the previous steps.

Please provide your response in JSON format only, without any additional explanations or comments.
"""

### > Let's preview one of the UML diagrams

In [None]:
from PIL import Image
from IPython.display import display

# Open the JPG image file
image = Image.open(f"{data_dir}/uml_diagrams/petstore.jpg")
image = image.convert("RGB")
display(image)

### > Generate enriched documents from UML¶
This code snippet below is processing UML diagrams (which contain Swagger API information) to generate human-readable documentation in text format. This documentation can then be leveraged by Amazon Q Business to buils an AI assistant that understand these APIs.

In [None]:
!cd .. && rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
input_tokens = 0
output_tokens = 0

for uml_filename in os.listdir(uml_dir):
    # Construct the full file path
    uml_filepath = os.path.join(uml_dir, uml_filename)

    image = Image.open(uml_filepath)
    image = image.convert("RGB")

    query = prompt_template

    max_retries = 3
    delay = 2  # Delay in seconds between retries
    
    for attempt in range(max_retries):
        try:
    
            response = get_text_response(image_base64=image_to_base64(image),text_query=query)
            
            input_tokens += response["usage"]["input_tokens"]
            output_tokens += response["usage"]["output_tokens"]
            
            document_json = json.loads(response["content"][0]["text"])
            print("JSON loaded successfully...\n")
            break
        except json.JSONDecodeError as e:
            if attempt == max_retries - 1:
                print(f"Failed to load JSON after {max_retries} attempts. Skipping operation.\n")
            else:
                print(f"Failed to load JSON (attempt {attempt + 1}/{max_retries}): {e}\n")
                time.sleep(delay)
                
    # yml upload file to s3
    key = f"{prefix}/{uml_filepath.replace(data_dir+'/', '')}"
    s3_path = upload_file_to_s3(uml_filepath, bucket, key)
    
    print(f'yml file uploaded to {s3_path}...\n')
    
    # build the doc
    doc = f"""
    Documentation for {yml_filename.split(".")[0]}
    
    Description:
    {document_json["description"]}
    {document_json["stats"]}
    
    FAQ:
    
    """
    
    for faq in document_json["faq"]:
        doc += faq["question"] + "\n\n" + faq["answer"] + "\n\n"
        
    txt_filename = uml_filename.split(".")[0]+".txt"
    txt_filepath = f"{data_dir}/uml_questions/{txt_filename}"
    
    with open(txt_filepath, 'w', encoding='utf-8') as file:
                file.write(doc)
    print(f'documentation generated at {txt_filepath}...\n')
    
    # txt file upload file to s3
    key = f"{prefix}/{txt_filepath.replace(data_dir+'/', '')}"
    metadata = {"s3_url":s3_path}
    s3_path = upload_file_to_s3(txt_filepath, bucket, key, metadata=metadata)
    
    print(f'documentation uploaded to {s3_path}...\n')
    
total_cost = (
    input_per_1k * input_tokens +
    output_per_1k * output_tokens
) / 1000

print('\n')
print('========================================================================')
print('Estimated cost:', colored(f"${total_cost}", 'green'), f"in us-east-1 region with {colored(input_tokens, 'green')} input tokens and {colored(output_tokens, 'green')} output tokens.")
print('========================================================================')