# Office Document Processing with Rhubarb

This notebook demonstrates how to process Microsoft Office documents (Excel, PowerPoint, Word) using Rhubarb.

## Prerequisites

- AWS credentials configured
- Rhubarb installed with Office format dependencies: `pip install pyrhubarb`
- Office documents to process

The following dependencies are automatically installed with Rhubarb for Office format support:
- `openpyxl` for Excel file processing
- `python-pptx` for PowerPoint file processing  
- `matplotlib` for visual rendering
- `python-docx` for Word document processing (already supported)

In [None]:
import boto3
from rhubarb import DocAnalysis
import json
import os

os.environ['AWS_PROFILE'] = 'aws-sample-works+prod-Admin'

# Create a boto3 session
session = boto3.Session()
print("AWS session created successfully")

## Excel Spreadsheet Processing

Rhubarb can process Excel files (.xlsx, .xls) with intelligent handling for large spreadsheets.

In [None]:
# Example: Processing an Excel file
# Replace with path to your Excel file
excel_file_path = "/path/to/spreadsheet.xlsx"

# For demonstration, we'll show the setup without requiring an actual file
try:
    excel_analysis = DocAnalysis(
        file_path=excel_file_path,
        boto3_session=session
    )
    
    # Ask questions about the spreadsheet
    response = excel_analysis.run(
        message="What are the key data points in this spreadsheet? Summarize the main findings."
    )
    
    print("Excel Analysis Result:")
    print(response["output"])
    
except FileNotFoundError:
    print("Excel file not found. Please update the file path to an actual Excel file.")
except Exception as e:
    print(f"Error processing Excel file: {e}")

### Processing Specific Worksheets

You can process specific worksheets by using the `pages` parameter.

In [None]:
# Process only specific worksheets (sheets 1, 2, and 3)
try:
    excel_specific = DocAnalysis(
        file_path=excel_file_path,
        pages=[1, 2, 3],  # Process first 3 worksheets
        boto3_session=session
    )
    
    response = excel_specific.run(
        message="Compare the data across these three worksheets. What patterns do you see?"
    )
    
    print("Specific Worksheets Analysis:")
    print(response["output"])
    
except FileNotFoundError:
    print("Excel file not found for specific worksheet processing.")
except Exception as e:
    print(f"Error: {e}")

## PowerPoint Presentation Processing

Process PowerPoint presentations slide by slide.

In [None]:
# Example: Processing a PowerPoint file
ppt_file_path = "/path/to/powerpoint.pptx"

try:
    ppt_analysis = DocAnalysis(
        file_path=ppt_file_path,
        pages=[1,3,5],
        boto3_session=session
    )
    
    # Analyze the presentation
    response = ppt_analysis.run(
        message="Summarize the key points from this presentation. What are the main themes?"
    )
    
    print("PowerPoint Analysis Result:")
    print(response["output"])
    
except FileNotFoundError:
    print("PowerPoint file not found. Please update the file path to an actual PPTX file.")
except Exception as e:
    print(f"Error processing PowerPoint file: {e}")

### Processing Specific Slides

In [None]:
# Process specific slides
try:
    ppt_specific = DocAnalysis(
        file_path=ppt_file_path,
        pages=[1, 5, 10],  # Process slides 1, 5, and 10
        boto3_session=session
    )
    
    response = ppt_specific.run(
        message="What are the key messages on these specific slides?"
    )
    
    print("Specific Slides Analysis:")
    print(response["output"])
    
except FileNotFoundError:
    print("PowerPoint file not found for specific slide processing.")
except Exception as e:
    print(f"Error: {e}")

### Including PowerPoint Speaker Notes

The new `include_powerpoint_notes` parameter allows you to optionally include speaker notes from PowerPoint presentations in the analysis.

In [None]:
# Example: Processing PowerPoint with speaker notes included
try:
    ppt_with_notes = DocAnalysis(
        file_path=ppt_file_path,
        pages=[1, 2, 3],  # Process first 3 slides
        include_powerpoint_notes=True,  # Include speaker notes
        boto3_session=session
    )
    
    response = ppt_with_notes.run(
        message="Analyze the content and speaker notes. What additional context do the notes provide?"
    )
    
    print("PowerPoint Analysis with Speaker Notes:")
    print(response["output"])
    
except FileNotFoundError:
    print("PowerPoint file not found.")
except Exception as e:
    print(f"Error: {e}")

# Compare with notes disabled (default behavior)
try:
    ppt_without_notes = DocAnalysis(
        file_path=ppt_file_path,
        pages=[1, 2, 3],  # Same slides
        include_powerpoint_notes=False,  # Default - no speaker notes
        boto3_session=session
    )
    
    response = ppt_without_notes.run(
        message="Analyze just the slide content without speaker notes."
    )
    
    print("\nPowerPoint Analysis without Speaker Notes:")
    print(response["output"])
    
except FileNotFoundError:
    print("PowerPoint file not found.")
except Exception as e:
    print(f"Error: {e}")

## Structured Data Extraction from Office Documents

Use JSON schemas to extract structured data from Office documents.

In [None]:
# Define a schema for extracting structured data from a financial spreadsheet
financial_schema = {
    "type": "object",
    "properties": {
        "total_revenue": {
            "type": "number",
            "description": "Total revenue amount"
        },
        "total_expenses": {
            "type": "number", 
            "description": "Total expenses amount"
        },
        "net_profit": {
            "type": "number",
            "description": "Net profit (revenue - expenses)"
        },
        "key_metrics": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "metric_name": {"type": "string"},
                    "value": {"type": "number"},
                    "unit": {"type": "string"}
                }
            }
        }
    },
    "required": ["total_revenue", "total_expenses", "net_profit"]
}

try:
    # Extract structured data from Excel file
    structured_analysis = DocAnalysis(
        file_path=excel_file_path,
        boto3_session=session
    )
    
    response = structured_analysis.run(
        message="Extract the financial data from this spreadsheet",
        output_schema=financial_schema
    )
    
    print("Structured Data Extraction:")
    print(json.dumps(response["output"], indent=2))
    
except FileNotFoundError:
    print("Excel file not found for structured extraction.")
except Exception as e:
    print(f"Error: {e}")

## Large Document Processing

For large Office documents, Rhubarb automatically handles chunking and processing.

In [None]:
# Process a large Excel file with sliding window
large_file_path = "/path/to/spreadsheet.xlsx"

try:
    large_doc = DocAnalysis(
        file_path=large_file_path,
        sliding_window_overlap=2,  # Overlap between chunks
        boto3_session=session
    )
    
    response = large_doc.run(
        message="Analyze trends and patterns across this entire dataset"
    )
    
    print("Large Document Analysis:")
    print(response["output"])
    
except FileNotFoundError:
    print("Large Excel file not found.")
except Exception as e:
    print(f"Error: {e}")

## S3 Integration

Process Office documents stored in Amazon S3.

In [None]:
# Process Office documents from S3
s3_excel_path = "s3://your-bucket/path/to/spreadsheet.xlsx"
s3_ppt_path = "s3://your-bucket/path/to/presentation.pptx"

try:
    # Excel from S3
    s3_excel = DocAnalysis(
        file_path=s3_excel_path,
        boto3_session=session
    )
    
    excel_response = s3_excel.run(
        message="What insights can you derive from this S3-stored spreadsheet?"
    )
    
    print("S3 Excel Analysis:")
    print(excel_response["output"])
    
except Exception as e:
    print(f"S3 Excel processing error: {e}")

try:
    # PowerPoint from S3
    s3_ppt = DocAnalysis(
        file_path=s3_ppt_path,
        boto3_session=session
    )
    
    ppt_response = s3_ppt.run(
        message="Summarize this S3-stored presentation"
    )
    
    print("S3 PowerPoint Analysis:")
    print(ppt_response["output"])
    
except Exception as e:
    print(f"S3 PowerPoint processing error: {e}")

## Streaming Responses

Get real-time streaming responses for Office document analysis.

In [None]:
# Streaming analysis of Office documents
try:
    streaming_analysis = DocAnalysis(
        file_path=excel_file_path,
        boto3_session=session
    )
    
    print("Streaming Excel Analysis:")
    for chunk in streaming_analysis.run_stream(
        message="Provide a detailed analysis of the data trends in this spreadsheet"
    ):
        # Note: For streaming, chunks don't have the same structure as run() responses
        # Streaming chunks are usually strings or have different attributes
        if hasattr(chunk, 'content'):
            print(chunk.content, end='', flush=True)
        elif isinstance(chunk, dict) and 'content' in chunk:
            print(chunk['content'], end='', flush=True)
        else:
            print(chunk, end='', flush=True)
    
    print("\n\nStreaming complete.")
    
except FileNotFoundError:
    print("File not found for streaming analysis.")
except Exception as e:
    print(f"Streaming error: {e}")

## Multi-Format Document Analysis

Compare insights across different Office document formats.

In [None]:
# Analyze multiple Office document types
documents = {
    "Excel Report": "path/to/financial_report.xlsx",
    "PowerPoint Summary": "path/to/executive_summary.pptx", 
    "Word Document": "path/to/detailed_analysis.docx"
}

results = {}

for doc_type, file_path in documents.items():
    try:
        analysis = DocAnalysis(
            file_path=file_path,
            boto3_session=session
        )
        
        response = analysis.run(
            message="What are the key insights from this document?"
        )
        
        results[doc_type] = response["output"]
        print(f"\n{doc_type} Analysis:")
        print("=" * 50)
        print(response["output"])
        
    except FileNotFoundError:
        print(f"{doc_type} file not found: {file_path}")
    except Exception as e:
        print(f"Error processing {doc_type}: {e}")

print("\nMulti-format analysis complete.")

## Best Practices for Office Document Processing

1. **File Size Management**: For very large Excel files, use page selection to process specific worksheets
2. **Memory Efficiency**: Rhubarb automatically uses read-only mode for Excel files to optimize memory usage
3. **S3 Integration**: Store large Office files in S3 for better performance and scalability
4. **Error Handling**: Always implement proper error handling for file format and processing issues
5. **Structured Extraction**: Use JSON schemas for consistent data extraction from Office documents

## Supported Features

- ✅ Excel (.xlsx, .xls) with automatic chunking for large files
- ✅ PowerPoint (.pptx) with slide-by-slide processing
- ✅ Word (.docx) with paragraph-based processing
- ✅ S3 integration for all Office formats
- ✅ Page/sheet/slide selection
- ✅ Streaming responses
- ✅ Structured data extraction
- ✅ Large document processing with sliding window
- ✅ Visual rendering at 150 DPI for optimal quality