## Setting up of AWS Textract and AWS S3 for OCR

Note: Put the access_key, scret_acess_key, default_region, bucket_name in the .env file
There should already be a bucket in S3 folder

## Automated PDF Text Extraction with AWS Textract

This Python script defines a class that automates the entire process of analyzing a PDF document using AWS Textract. It handles the complete cloud workflow by first uploading a local PDF file to an S3 bucket, then initiating an asynchronous analysis job with Textract to detect layout and tables. The script continuously polls AWS to check the job's status and, upon successful completion, retrieves all pages of the analysis results, saving the complete, raw data into a single JSON file on your local machine for later use.

In [1]:
import os
import time
import boto3
import json
from dotenv import load_dotenv

load_dotenv()

class TextractJobRunner:
    def __init__(self, base_dir="output"):
        self.textract = boto3.client(
            'textract',
            aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
            aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
            region_name=os.getenv("AWS_DEFAULT_REGION")
        )
        self.s3 = boto3.client(
            's3',
            aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
            aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
            region_name=os.getenv("AWS_DEFAULT_REGION")
        )
        self.bucket_name = os.getenv("AWS_BUCKET_NAME")
        self.base_dir = base_dir

    def run_job_and_save_response(self, file_path):
        """
        Takes a PDF, runs Textract analysis, and saves the raw JSON output.
        """
        file_name_no_ext = os.path.splitext(os.path.basename(file_path))[0]
        output_folder = os.path.join(self.base_dir, file_name_no_ext)
        os.makedirs(output_folder, exist_ok=True)

        # Upload to S3
        s3_key = f'textract-analysis/{os.path.basename(file_path)}'
        self.s3.upload_file(file_path, self.bucket_name, s3_key)
        print(f"Uploaded to S3: {s3_key}")

        # Start Textract job
        response = self.textract.start_document_analysis(
            DocumentLocation={'S3Object': {'Bucket': self.bucket_name, 'Name': s3_key}},
            FeatureTypes=['LAYOUT', 'TABLES']
        )
        job_id = response['JobId']
        print(f"Textract Job Started: {job_id}")

        # Poll for completion
        while True:
            result = self.textract.get_document_analysis(JobId=job_id)
            status = result['JobStatus']
            print(f"Job status: {status}")
            if status in ['SUCCEEDED', 'FAILED']:
                if status == 'FAILED':
                    raise Exception("Textract job failed.")
                break
            time.sleep(5)

        # Retrieve all results using pagination
        results = []
        next_token = None
        while True:
            response = self.textract.get_document_analysis(JobId=job_id, NextToken=next_token) if next_token else self.textract.get_document_analysis(JobId=job_id)
            results.append(response)
            next_token = response.get('NextToken')
            if not next_token:
                break
        print(f"Retrieved {len(results)} pages of results.")

        # Save the raw results to a JSON file
        json_path = os.path.join(output_folder, "raw_textract_response.json")
        with open(json_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2)
        print(f"Saved raw Textract response to: {json_path}")

        return json_path

processor = TextractJobRunner()
processor.run_job_and_save_response("test_info_extract.pdf")

Uploaded to S3: textract-analysis/test_info_extract.pdf
Textract Job Started: c852bd39678d05af52b923989f4809995de2f1da5f93e167c7addc680e71967c
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: SUCCEEDED
Retrieved 1 pages of results.
Saved raw Textract response to: output\test_info_extract\raw_textract_response.json


'output\\test_info_extract\\raw_textract_response.json'

## Textract JSON Parser and Figure Extractor

This Python script defines two main functions to process the output from an AWS Textract analysis. The first function, save_layout, parses the raw JSON response to identify and organize all layout elements (like titles and text blocks) into a structured layout.csv file. The second function, save_figures, uses the same JSON data along with the original PDF to locate the coordinates of each figure, extracts the actual images by cropping the PDF, and saves them as individual PNG files into a specified folder.

In [3]:
import os
import json
import pandas as pd
from collections import defaultdict
import fitz

def save_layout(results, output_folder):
    """
    This function is used to produce a csv from the raw json format.
    """
    rows = []
    layout_counters = defaultdict(int)
    reading_order = 0
    line_map = {}
    
    for page in results:
        line_map.update({b['Id']: b for b in page['Blocks'] if b['BlockType'] == 'LINE'})
        layout_blocks = [b for b in page['Blocks'] if b['BlockType'].startswith('LAYOUT')]

        for block in layout_blocks:
            layout_key = block['BlockType'].replace('LAYOUT_', '').capitalize()
            layout_counters[layout_key] += 1
            layout_label = f"{layout_key} {layout_counters[layout_key]}"

            line_text = ''
            for rel in block.get('Relationships', []):
                if rel.get('Type') == 'CHILD':
                    line_text = ' '.join(line_map.get(i, {}).get('Text', '') for i in rel.get('Ids', []) if i in line_map)

            rows.append({
                'Page number': block.get('Page', 1),
                'Layout': layout_label,
                'Text': line_text.strip(),
                'Reading Order': reading_order,
                'Confidence score % (Layout)': block.get('Confidence', 0)
            })
            reading_order += 1

    layout_path = os.path.join(output_folder, 'layout.csv')
    pd.DataFrame(rows).to_csv(layout_path, index=False)
    print(f"Saved raw layout to {layout_path}")

def save_figures(results, pdf_path, output_folder):
    """
    This function is used to retrieve the figures using the coordinates from raw json file and adds it to the layout.csv.
    """
    os.makedirs(output_folder, exist_ok=True)
    
    doc = fitz.open(pdf_path)

    all_blocks = [block for page in results for block in page.get('Blocks', [])]
    figure_blocks = [b for b in all_blocks if b.get('BlockType') == 'LAYOUT_FIGURE']

    for i, block in enumerate(figure_blocks, 1):
        page_num = block.get('Page')
        page = doc.load_page(page_num - 1)
        box = block['Geometry']['BoundingBox']
        
        clip_rect = fitz.Rect(
            box['Left'] * page.rect.width, box['Top'] * page.rect.height,
            (box['Left'] + box['Width']) * page.rect.width,
            (box['Top'] + box['Height']) * page.rect.height
        )
        
        pix = page.get_pixmap(clip=clip_rect, dpi=200)
        output_path = os.path.join(output_folder, f"figure_{i}.png")
        pix.save(output_path)
        
    doc.close()
    print(f"Saved {len(figure_blocks)} figures to '{output_folder}'.")


base_dir = "output"
document_folder = "test_info_extract"
pdf_file_path = "test_info_extract.pdf"
figure_output_folder = os.path.join(base_dir, document_folder, "figures")

output_folder_path = os.path.join(base_dir, document_folder)
json_file_path = os.path.join(output_folder_path, "raw_textract_response.json")
with open(json_file_path, 'r', encoding='utf-8') as f:
    results_data = json.load(f)

save_layout(results_data, output_folder_path)
save_figures(results_data, pdf_file_path, figure_output_folder)

Saved raw layout to output\test_info_extract\layout.csv
Saved 12 figures to 'output\test_info_extract\figures'.


## CSV and Markdown Generation from Layout Data 

This Python script automates the final step of document reconstruction by taking a layout.csv file and a folder of previously extracted figure images as input. It first reads the CSV and systematically finds every row corresponding to a figure, updating its 'Text' column with the correct relative path to the saved image file. After saving this updated data to a new layout_with_figures.csv, the script then generates a complete final_layout.md file, converting titles to Markdown headers and the newly added figure paths into proper image links, effectively creating a readable version of the original document.

In [4]:
import pandas as pd
import os
import re

csv_path = os.path.join(output_folder_path, 'layout.csv')
figures_folder_path = os.path.join(output_folder_path, 'figures')


df = pd.read_csv(csv_path, na_filter=False)

figure_files = sorted(
    [f for f in os.listdir(figures_folder_path) if f.startswith('figure_') and f.endswith('.png')],
    key=lambda x: int(re.search(r'figure_(\d+)\.png', x).group(1))
)
print(f"\nFound {len(figure_files)} figure image files.")

figure_rows_indices = df[df['Layout'].str.startswith('Figure', na=False)].index

# Loop through the figure rows and update the 'Text' column.
for i, df_index in enumerate(figure_rows_indices):
    figure_path = os.path.join('figures', figure_files[i])
    df.loc[df_index, 'Text'] = figure_path

output_csv_path = os.path.join(output_folder_path, 'layout_with_figures.csv')
df.to_csv(output_csv_path, index=False)
print(f"\nSuccessfully created new CSV with integrated figure filenames: {output_csv_path}")

md_content = []
for index, row in df.iterrows():
    layout_type = str(row.get('Layout', '')).split(' ')[0]
    text = str(row.get('Text', ''))

    if not text:
        continue

    if layout_type == 'Title':
        md_content.append(f"# {text}\n")
    elif layout_type == 'Header':
        md_content.append(f"## {text}\n")
    elif layout_type == 'Figure':
        md_content.append(f"![{text}]({text})\n")
    else: 
        md_content.append(f"{text}\n")

# Save the final markdown content to a file.
md_path = os.path.join(output_folder_path, 'final_layout.md')
with open(md_path, 'w', encoding='utf-8') as f:
    f.write("\n".join(md_content))
print(f"Successfully created Markdown file: {md_path}")



Found 12 figure image files.

Successfully created new CSV with integrated figure filenames: output\test_info_extract\layout_with_figures.csv
Successfully created Markdown file: output\test_info_extract\final_layout.md


## Contextual Image Summarization with S3 Integration

This Python script automates the process of generating rich, contextual descriptions for figures listed in a CSV file. It iterates through the layout data, and for each figure, it uploads the corresponding local image file to an AWS S3 bucket, generating a temporary, secure pre-signed URL. This URL, along with the text immediately before and after the figure, is then used to create a dynamic prompt for a large language model (gpt-4o-mini) via LangChain. After processing all figures in a batch, the script collects the AI-generated summaries and updates the original layout data, saving the final, enriched content into a new CSV file named layout_with_summaries.csv.

In [6]:
import os
import pandas as pd
from dotenv import load_dotenv
import boto3

# LangChain Imports
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

load_dotenv()

def upload_image_to_s3(local_path, bucket_name, s3_client):
    """
    Uploads a local image file to an S3 bucket and returns a temporary,
    secure pre-signed URL valid for one hour.
    """
    
    # Define a unique object name for the image in the S3 bucket
    s3_key = f"figures/{os.path.basename(local_path)}"
    
    # Upload the file
    s3_client.upload_file(local_path, bucket_name, s3_key)
    
    # Generate a pre-signed URL that grants temporary access
    url = s3_client.generate_presigned_url(
        'get_object',
        Params={'Bucket': bucket_name, 'Key': s3_key},
        ExpiresIn=3600 
    )
    print(f"Uploaded {os.path.basename(local_path)} and generated pre-signed URL.")
    return url

csv_path = os.path.join(output_folder_path, 'layout_with_figures.csv')
df = pd.read_csv(csv_path)

s3_client = boto3.client('s3')
bucket_name = os.getenv("AWS_BUCKET_NAME")

batch_input = []
figure_filenames_in_order = []
for i, row in df.iterrows():
    if str(row.get('Layout')).startswith('Figure'):
        image_path = os.path.join(output_folder_path, row['Text'])

        # Get context from surrounding text rows
        text_before = df.loc[i - 1, 'Text'] if i > 0 else "No text before."
        text_after = df.loc[i + 1, 'Text'] if i < len(df) - 1 else "No text after."
        
        # Build the dynamic prompt for this specific figure
        dynamic_prompt = f"""Describe the image in detail. It is part of a home energy report.

        CONTEXT BEFORE IMAGE: "{text_before}"
        CONTEXT AFTER IMAGE: "{text_after}"

        Based on the context above, analyze the image. Be specific about graphs, such as bar plots. If it's a simple icon with no data, just say 'icon' - no other explanation.
        """
        
        # Upload the image to S3 and get the secure URL
        image_s3_url = upload_image_to_s3(image_path, bucket_name, s3_client)
        
        if image_s3_url:
            batch_input.append({
                "image_url_input": image_s3_url,
                "prompt_text": dynamic_prompt,
                "original_path": os.path.basename(image_path)
            })
            figure_filenames_in_order.append(os.path.basename(image_path))

# Define the LangChain Prompt Template and Chain 
messages = [
    (
        "user",
        [
            {"type": "text", "text": "{prompt_text}"},
            {
                "type": "image_url",
                "image_url": {"url": "{image_url_input}"},
            },
        ],
    )
]

prompt = ChatPromptTemplate.from_messages(messages)
model = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | model | StrOutputParser()

# Run the Batch Process
if batch_input:

    image_summaries = chain.batch(batch_input, {"max_concurrency": 5})
    for i, summary in enumerate(image_summaries):
        filename = figure_filenames_in_order[i]
        print(f"File: {filename}")
        print(f"Description: {summary}\n")

    figure_indices = df[df['Layout'].str.startswith('Figure', na=False)].index.tolist()

    summary_counter = 0
    for idx in figure_indices:
        if summary_counter < len(image_summaries):
            original_layout_label = df.loc[idx, 'Layout']
            new_summary = image_summaries[summary_counter]
            
            df.loc[idx, 'Text'] = new_summary
            print(f"Updated '{original_layout_label}' with new summary.")
            
            summary_counter += 1

    output_dir = os.path.dirname(csv_path) or '.'
    output_path = os.path.join(output_dir, 'layout_with_summaries.csv')
    df.to_csv(output_path, index=False)
    print(f"\nSuccessfully created final CSV with summaries: {output_path}")

else:
    print("No figures found in the CSV to process.")

Uploaded figure_1.png and generated pre-signed URL.
Uploaded figure_2.png and generated pre-signed URL.
Uploaded figure_3.png and generated pre-signed URL.
Uploaded figure_4.png and generated pre-signed URL.
Uploaded figure_5.png and generated pre-signed URL.
Uploaded figure_6.png and generated pre-signed URL.
Uploaded figure_7.png and generated pre-signed URL.
Uploaded figure_8.png and generated pre-signed URL.
Uploaded figure_9.png and generated pre-signed URL.
Uploaded figure_10.png and generated pre-signed URL.
Uploaded figure_11.png and generated pre-signed URL.
Uploaded figure_12.png and generated pre-signed URL.
File: figure_1.png
Description: The image contains a bar plot comparing energy usage in kilowatt-hours (kWh) among three different categories of homes:

1. **You**: Represented by a blue bar, it shows an energy usage of **125 kWh**.
2. **Similar nearby homes**: Represented by an orange bar, it shows an energy usage of **103 kWh**.
3. **Efficient nearby homes**: Represent

## Multimodal Data Preparation for Advanced RAG

This Python script prepares a sophisticated data structure for a Retrieval-Augmented Generation (RAG) system by loading two separate CSV files: one containing the original document layout with figure paths, and another with AI-generated summaries. It processes this data by creating "parent" documents from the original content to serve as the ground truth for retrieval. It then applies a custom chunking logic to the AI-generated summaries, iteratively grouping them into larger, context-rich "child" documents up to a 1000-character limit. The final output consists of two aligned sets of documents (original parents and chunked summary children), perfectly structured for use in a MultiVectorRetriever where searches can be performed on the rich summaries to retrieve the original, precise content.

In [None]:
import os
import uuid
import pandas as pd
from dotenv import load_dotenv
import boto3

# LangChain Imports
from langchain.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.storage import InMemoryStore
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage

# Load environment variables from your .env file
load_dotenv()

csv_figures_path = 'output/test_info_extract/layout_with_figures.csv'
csv_summaries_path = 'output/test_info_extract/layout_with_summaries.csv'

try:
    df_figures = pd.read_csv(csv_figures_path, na_filter=False)
    df_summaries = pd.read_csv(csv_summaries_path, na_filter=False)
    df_figures_sorted = df_figures.sort_values(by='Reading Order').reset_index(drop=True)
    df_summaries_sorted = df_summaries.sort_values(by='Reading Order').reset_index(drop=True)
    assert len(df_figures_sorted) == len(df_summaries_sorted), "CSV files must have the same number of rows."
    print(f"Successfully loaded and sorted data from both CSV files.")
except (FileNotFoundError, KeyError, AssertionError) as e:
    print(f"Error loading or processing CSV files: {e}")
    exit()

# Create Parent Documents and Custom Child Chunks
print("Creating parent documents and custom child chunks...")
parent_documents = []
child_docs = []
parent_doc_ids = []

for index, row in df_figures_sorted.iterrows():
    doc_id = str(uuid.uuid4())
    parent_doc_ids.append(doc_id)
    parent_content = f"{row['Layout']}: {row['Text']}"
    parent_documents.append(Document(page_content=parent_content, metadata={"doc_id": doc_id}))

max_chars = 1000
current_chunk_text = ""
ids_for_current_chunk = []
texts_to_chunk = [f"{row['Layout']}: {row['Text']}" for index, row in df_summaries_sorted.iterrows()]

for i, text_to_add in enumerate(texts_to_chunk):
    doc_id = parent_doc_ids[i]
    if not current_chunk_text:
        current_chunk_text = text_to_add
        ids_for_current_chunk.append(doc_id)
        continue
    if len(current_chunk_text) + len(text_to_add) + 1 > max_chars:
        metadata = {"parent_doc_ids": ",".join(ids_for_current_chunk)}
        child_docs.append(Document(page_content=current_chunk_text, metadata=metadata))
        current_chunk_text = text_to_add
        ids_for_current_chunk = [doc_id]
    else:
        current_chunk_text += "\n" + text_to_add
        ids_for_current_chunk.append(doc_id)
if current_chunk_text:
    metadata = {"parent_doc_ids": ",".join(ids_for_current_chunk)}
    child_docs.append(Document(page_content=current_chunk_text, metadata=metadata))
print(f"Created {len(child_docs)} child chunks from {len(parent_documents)} parent documents.")


Successfully loaded and sorted data from both CSV files.
Creating parent documents and custom child chunks...
Created 6 child chunks from 38 parent documents.


## Hybrid Search and Parent Document Retrieval
This Python script sets up a sophisticated, two-part retrieval system for a RAG pipeline. First, it creates a hybrid search mechanism by combining a keyword-based retriever (BM25Retriever) with a semantic vector-based retriever (Chroma), allowing it to find documents based on both exact words and contextual meaning. This hybrid search is performed on a collection of smaller "child" documents (like summaries). The script then defines a custom function that takes the results of this hybrid search, extracts the IDs of the original "parent" documents they correspond to, and fetches those complete parent documents from an in-memory store, ensuring that the final output is the full, original source content.

In [None]:
# Set up the Retriever
emb = OpenAIEmbeddings()
vectorstore = Chroma(collection_name="rag_s3_ordered_retriever", embedding_function=emb)
vectorstore.add_documents(child_docs)
store = InMemoryStore()
store.mset(list(zip(parent_doc_ids, parent_documents)))

bm25_retriever = BM25Retriever.from_documents(child_docs)
bm25_retriever.k = 3
semantic_retriever = vectorstore.as_retriever(search_kwargs={'k': 3})
hybrid_retriever = EnsembleRetriever(retrievers=[bm25_retriever, semantic_retriever], weights=[0.5, 0.5])

def get_parents_from_hybrid_search(query):
    child_chunks = hybrid_retriever.invoke(query)
    
    # Get the string of parent IDs from the metadata
    ordered_parent_ids = []
    seen_ids = set()
    for chunk in child_chunks:
        parent_ids_str = chunk.metadata.get("parent_doc_ids", "")
        for p_id in parent_ids_str.split(","):
            if p_id and p_id not in seen_ids:
                ordered_parent_ids.append(p_id)
                seen_ids.add(p_id)
                
    return store.mget(ordered_parent_ids)

retriever = RunnableLambda(get_parents_from_hybrid_search)
print("\nSuccessfully built the vector store and retriever.")


Successfully built the vector store and retriever.


## Dynamic Multimodal RAG Chain

This Python script defines the final, dynamic stage of a Retrieval-Augmented Generation (RAG) pipeline designed to handle both text and images. It creates a chain that first retrieves relevant documents from your vector store. A custom parsing function then processes these documents: if a document represents a figure, it uploads the corresponding image to AWS S3 on-the-fly to generate a secure, temporary URL. Finally, another function dynamically assembles a multimodal prompt for the language model, combining all the retrieved text with any generated image URLs, allowing the AI to generate a comprehensive answer based on both the textual and visual context.

In [None]:
# Define the RAG Chain with On-the-Fly URL Generation
def parse_docs_and_generate_urls(docs):
    image_urls, text_content = [], []
    for doc in docs:
        if not doc: continue
        
        page_content = doc.page_content
        # Check if the document is a figure
        if "Figure" in page_content.split(':')[0]:
            # The figure path is in the text part of the content
            relative_fig_path = page_content.split(': ', 1)[1]
            local_fig_path = os.path.join(base_dir, relative_fig_path)
            
            url = upload_image_to_s3(local_fig_path, bucket_name, s3_client)
            if url:
                image_urls.append(url)
                # Add the figure's layout label as text context
                text_content.append(page_content.split(':')[0] + ":")
        else:
            text_content.append(page_content)
            
    return {"images": image_urls, "texts": text_content}

def build_prompt_with_urls(inputs):
    context_text = "\n".join(inputs["context"]['texts']).strip()
    prompt_content = [{
        "type": "text",
        "text": f"""Answer the question based only on the following context.

Context:
---
{context_text}
---

Question: {inputs['question']}
"""
    }]
    for url in inputs["context"]["images"]:
        prompt_content.append({"type": "image_url", "image_url": {"url": url}})
    return [HumanMessage(content=prompt_content)]

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = (
    {"context": retriever | RunnableLambda(parse_docs_and_generate_urls), "question": RunnablePassthrough()}
    | RunnableLambda(build_prompt_with_urls)
    | llm
    | StrOutputParser()
)

## RAG Pipeline Execution with Source Inspection

This Python script demonstrates how to run and inspect the results of the complete Retrieval-Augmented Generation (RAG) pipeline. It defines a sample query and then invokes a special chain designed for debugging, which returns not only the final, AI-generated answer but also the specific "reference chunks" (the parent documents) that were retrieved from the vector store to create that answer. The script then prints both the retrieved source documents for verification and the final answer, providing a clear way to understand what information the language model used as its context.

In [None]:
# --- 5. Example Usage ---
if __name__ == '__main__':
    query = "What was the energy usage compared to neighbors?"
    
    print(f"\n--- Example Search for: '{query}' ---")

    # 1. First, get the final answer from the main chain
    print("\nGenerating answer...")
    answer = chain.invoke(query)

    # 2. Then, get the reference documents from the retriever separately to inspect them
    print("Retrieving reference documents...")
    reference_docs = retriever.invoke(query)

    # 3. Print the retrieved context (the reference parent documents)
    print("\n--- REFERENCE CHUNKS (Parent Documents) ---")
    if reference_docs:
        for doc in reference_docs:
            if doc:
                print(doc.page_content)
    else:
        print("No context retrieved.")

    # 4. Print the final answer
    print("\n--- FINAL ANSWER ---")
    print(answer)


--- Example Search for: 'What was the energy usage compared to neighbors?' ---

Generating answer...
Retrieving reference documents...

--- REFERENCE CHUNKS (Parent Documents) ---
Title 1: Home Energy Report: electricity
------------------------------
Text 1: March report Account number: 954137 Service address: 1627 Tulip Lane
------------------------------
Text 2: Dear JILL DOE, here is your usage analysis for March.
------------------------------
Text 3: Your electric use:
------------------------------
Text 4: Above typical use
------------------------------
Text 5: 18% more than similar nearby homes
------------------------------
Figure 1: figures\figure_1.png
------------------------------
Figure 10: figures\figure_10.png
------------------------------
Text 20: Save more this spring
------------------------------
Text 21: Reduce use and save money on your electric bill with these thorough tips, from the kitchen to the laundry room.
------------------------------
Figure 11: figure

In [30]:
# --- 5. Example Usage ---
if __name__ == '__main__':
    query = "How much is my electrical consumption for April and how much is for the similar homes?"
    
    print(f"\n--- Example Search for: '{query}' ---")

    # 1. First, get the final answer from the main chain
    print("\nGenerating answer...")
    answer = chain.invoke(query)

    # 2. Then, get the reference documents from the retriever separately to inspect them
    print("Retrieving reference documents...")
    reference_docs = retriever.invoke(query)

    # 3. Print the retrieved context (the reference parent documents)
    print("\n--- REFERENCE CHUNKS (Parent Documents) ---")
    if reference_docs:
        for doc in reference_docs:
            if doc:
                print(doc.page_content)
    else:
        print("No context retrieved.")

    # 4. Print the final answer
    print("\n--- FINAL ANSWER ---")
    print(answer)


--- Example Search for: 'How much is my electrical consumption for April and how much is for the similar homes?' ---

Generating answer...
Retrieving reference documents...

--- REFERENCE CHUNKS (Parent Documents) ---
Title 1: Home Energy Report: electricity
Text 1: March report Account number: 954137 Service address: 1627 Tulip Lane
Text 2: Dear JILL DOE, here is your usage analysis for March.
Text 3: Your electric use:
Text 4: Above typical use
Text 5: 18% more than similar nearby homes
Figure 1: figures\figure_1.png
Figure 10: figures\figure_10.png
Text 20: Save more this spring
Text 21: Reduce use and save money on your electric bill with these thorough tips, from the kitchen to the laundry room.
Figure 11: figures\figure_11.png
Figure 5: figures\figure_5.png
Text 13: Watch this space for new ways to save energy each month.
Footer 1: Turn over for more savings ideas.
Figure 6: figures\figure_6.png
Title 2: Your top three tailored energy-saving tips
Text 14: Caulk windows and doors