# Rhubarb Large Document Processing

This cookbook demonstrates how to use Rhubarb's sliding window approach to process documents with more than 20 pages using Claude models.

Claude models have a limitation of processing only 20 pages at a time. Rhubarb provides a sliding window approach to process larger documents by breaking them into chunks and processing each chunk separately.

In [None]:
import boto3
import os
import json
from rhubarb import DocAnalysis
from rhubarb.models import LanguageModels

## Setup

First, let's set up our AWS session and create a DocAnalysis object with sliding window enabled.

In [None]:
# Initialize AWS session
session = boto3.Session()

# Path to a large PDF document (more than 20 pages)
file_path = "path/to/your/large-document.pdf"

# Create DocAnalysis with sliding window enabled
doc_analysis = DocAnalysis(
    file_path=file_path,
    boto3_session=session,
    modelId=LanguageModels.CLAUDE_SONNET_V2,
    max_tokens=2048,
    temperature=0.0,
    sliding_window_overlap=2     # Number of pages to overlap between windows (1-10)
)

## Process a Large Document

Now let's process the document by asking a question about its content.

In [None]:
# Ask a question about the document
question = "Please summarize the main points of this document."

# Process the document
response = doc_analysis.run(question)

# Print the response
print(json.dumps(response, indent=2))

## Understanding the Response

When using the sliding window approach, the response will contain results from each window along with metadata about the processing.

In [None]:
# Extract window information
if "sliding_window_processing" in response:
    print(f"Total windows processed: {response['sliding_window_processing']['total_windows']}")
    
    # Print information about each window
    for i, window_info in enumerate(response['sliding_window_processing']['window_info']):
        print(f"Window {i+1}: Pages {window_info['current_window_start']}-{window_info['current_window_end']} of {window_info['total_pages']}")

## Using a Custom Output Schema

You can also use a custom output schema with the sliding window approach.

In [None]:
# Define a custom output schema
output_schema = {
    "type": "object",
    "properties": {
        "summary": {
            "type": "string",
            "description": "A summary of the document content"
        },
        "key_points": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "description": "Key points from the document"
        }
    },
    "required": ["summary", "key_points"]
}

# Process the document with the custom schema
response_with_schema = doc_analysis.run(
    "Please summarize the main points of this document.",
    output_schema=output_schema
)

# Print the response
print(json.dumps(response_with_schema, indent=2))

## Manually Using the LargeDocumentProcessor

For more control over the processing, you can use the LargeDocumentProcessor directly.

In [None]:
from rhubarb.file_converter import LargeDocumentProcessor

# Create a LargeDocumentProcessor
processor = LargeDocumentProcessor(file_path=file_path, s3_client=session.client('s3'))

# Get information about the document
print(f"Total pages: {processor.total_pages}")
print(f"Current window: {processor.get_window_info()}")

# Get the current window of pages as base64
pages = processor.get_pages_as_base64()
print(f"Number of pages in current window: {len(pages)}")

# Move to the next window
processor.move_to_next_window(overlap=2)
print(f"After moving to next window: {processor.get_window_info()}")

## Custom Processing Function

You can also define a custom processing function to use with the LargeDocumentProcessor.

In [None]:
def custom_processor(page_data, **kwargs):
    """Custom function to process a window of pages"""
    window_info = kwargs.get("window_info", {})
    
    # Print information about the current window
    print(f"Processing pages {window_info['current_window_start']}-{window_info['current_window_end']} of {window_info['total_pages']}")
    
    # Here you would typically process the pages using your own logic
    # For this example, we'll just return some basic information
    return {
        "pages_processed": len(page_data),
        "page_numbers": [page["page"] for page in page_data],
        "window_info": window_info
    }

# Process the entire document using our custom function
results = processor.process_document(
    processor_func=custom_processor,
    use_converse_api=False,
    overlap=2
)

# Print the results
print(json.dumps(results, indent=2))

## Conclusion

In this cookbook, we've demonstrated how to use Rhubarb's sliding window approach to process documents with more than 20 pages using Claude models. This approach allows you to work with large documents while respecting Claude's 20-page limitation.