# PPTX Image Extraction Demo

This notebook demonstrates the image extraction module for extracting SmartArt, charts, and images from PowerPoint presentations.

**Features:**
- Detect slides with visual content (SmartArt, pictures, charts)
- Render slides as PNG images
- Upload to S3 with structured paths
- Store mappings in OpenSearch for retrieval

## 1. Configuration

**Edit these values for your environment:**

In [None]:
# =============================================================================
# CONFIGURATION - Edit these values
# =============================================================================

# S3 bucket for storing extracted images
S3_BUCKET = "your-bucket-name"  # <-- Edit this

# S3 prefix (folder path) for images
S3_PREFIX = "images/pptx/"  # <-- Edit if needed

# Path to your PowerPoint file
PPTX_PATH = "./sample_presentation.pptx"  # <-- Edit this

# OpenSearch endpoint (optional - for storing mappings)
OPENSEARCH_ENDPOINT = None  # <-- Set if you want to store mappings

# =============================================================================
print(f"S3 Bucket: {S3_BUCKET}")
print(f"S3 Prefix: {S3_PREFIX}")
print(f"PPTX Path: {PPTX_PATH}")

## 2. Install Dependencies (if needed)

In [None]:
# Uncomment if running on SageMaker or fresh environment
# !pip install python-pptx Pillow structlog

## 3. Import Modules

In [None]:
import sys
from pathlib import Path

# Add src to path if running from notebooks directory
src_path = Path("../src").resolve()
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the image extraction module
from rag_assist.ingestion.images import (
    PPTXImageDetector,
    PPTXImageExtractor,
    S3ImageStore,
    ImageMapper,
    extract_images_from_pptx,
    ImageInfo,
    SlideImageMapping,
    ExtractionResult,
)

print("Modules imported successfully!")

## 4. Load and Inspect PPTX File

In [None]:
from pptx import Presentation

# Load presentation
pptx_path = Path(PPTX_PATH)
if not pptx_path.exists():
    print(f"ERROR: File not found: {pptx_path}")
    print("Please update PPTX_PATH in Cell 1 to point to a valid .pptx file")
else:
    prs = Presentation(str(pptx_path))
    print(f"File: {pptx_path.name}")
    print(f"Total slides: {len(prs.slides)}")
    print(f"File size: {pptx_path.stat().st_size / 1024:.1f} KB")
    print("\nSlide titles:")
    for i, slide in enumerate(prs.slides, 1):
        title = ""
        if slide.shapes.title and slide.shapes.title.has_text_frame:
            title = slide.shapes.title.text[:50]
        print(f"  Slide {i}: {title or '(no title)'}")

## 5. Detect Slides with Visual Content

The detector identifies slides containing:
- **Pictures**: Embedded images
- **SmartArt**: Diagrams and flowcharts
- **Charts**: Data visualizations

In [None]:
# Initialize detector
detector = PPTXImageDetector(
    include_pictures=True,
    include_smartart=True,
    include_charts=True,
)

# Detect slides with images
slide_mappings = detector.detect_image_slides(PPTX_PATH)

print(f"\nFound {len(slide_mappings)} slides with visual content:\n")
print(f"{'Slide':<8} {'Title':<30} {'Pictures':<10} {'SmartArt':<10} {'Charts':<10}")
print("-" * 70)

for mapping in slide_mappings:
    title = mapping.slide_title[:28] + ".." if len(mapping.slide_title) > 30 else mapping.slide_title
    print(
        f"{mapping.slide_number:<8} "
        f"{title:<30} "
        f"{'Yes' if mapping.has_pictures else '-':<10} "
        f"{'Yes' if mapping.has_smartart else '-':<10} "
        f"{'Yes' if mapping.has_charts else '-':<10}"
    )

# Get just the slide numbers
slide_numbers = detector.get_slide_numbers_with_images(PPTX_PATH)
print(f"\nSlide numbers with images: {slide_numbers}")

## 6. Extract and Preview Images

Extract images from detected slides. The extractor:
1. Tries LibreOffice rendering (most accurate for SmartArt)
2. Falls back to extracting embedded images
3. Creates placeholder if extraction fails

In [None]:
from IPython.display import display, Image as IPImage
from io import BytesIO

# Initialize extractor
extractor = PPTXImageExtractor(
    output_format="png",
    use_libreoffice=True,  # Set to False if LibreOffice not installed
    fallback_to_placeholder=True,
)

# Extract images from detected slides
extracted_images = extractor.extract_slides(
    PPTX_PATH,
    slide_numbers=slide_numbers,
)

print(f"Extracted {len(extracted_images)} images\n")

# Preview extracted images
for slide_num, img_bytes, info in extracted_images:
    print(f"\n--- Slide {slide_num} ---")
    print(f"Image ID: {info.image_id}")
    print(f"Type: {info.image_type}")
    print(f"Size: {info.width_px}x{info.height_px} px ({info.size_bytes / 1024:.1f} KB)")
    
    # Display image (scaled down for notebook)
    display(IPImage(data=img_bytes, width=400))

## 7. Upload Images to S3

Upload extracted images to your S3 bucket.

In [None]:
# Initialize S3 store
store = S3ImageStore(
    bucket=S3_BUCKET,
    prefix=S3_PREFIX,
)

# Upload images
print("Uploading images to S3...\n")

s3_uris = []
for slide_num, img_bytes, info in extracted_images:
    try:
        uri = store.upload_image(img_bytes, info)
        s3_uris.append(uri)
        print(f"Slide {slide_num}: {uri}")
    except Exception as e:
        print(f"Slide {slide_num}: FAILED - {e}")

print(f"\nUploaded {len(s3_uris)} images to S3")

## 8. Generate Presigned URLs (for viewing)

In [None]:
print("Presigned URLs (valid for 1 hour):\n")

for uri in s3_uris:
    try:
        presigned_url = store.get_presigned_url(uri, expiration=3600)
        print(f"{uri}")
        print(f"  -> {presigned_url[:80]}...\n")
    except Exception as e:
        print(f"{uri}: Error generating URL - {e}\n")

## 9. Store Mappings (Optional - requires OpenSearch)

Store image-to-slide mappings in OpenSearch for integration with your RAG system.

In [None]:
# Skip this cell if you don't have OpenSearch configured

if OPENSEARCH_ENDPOINT:
    # Initialize mapper with your OpenSearch client
    # Replace with your actual OpenSearch client initialization
    from rag_assist.vectorstore.opensearch_client import OpenSearchClient
    
    os_client = OpenSearchClient(endpoint=OPENSEARCH_ENDPOINT)
    mapper = ImageMapper(opensearch_client=os_client)
    
    # Update slide mappings with extracted image info
    slide_to_info = {info.slide_number: info for _, _, info in extracted_images}
    for mapping in slide_mappings:
        if mapping.slide_number in slide_to_info:
            mapping.images = [slide_to_info[mapping.slide_number]]
    
    # Store mappings
    count = mapper.store_mappings(slide_mappings)
    print(f"Stored {count} mappings in OpenSearch")
else:
    print("OpenSearch not configured. Skipping mapping storage.")
    print("Set OPENSEARCH_ENDPOINT in Cell 1 to enable this feature.")

## 10. Demo: Retrieve Images for a Slide

Demonstrate how to retrieve images given a slide number (simulating RAG retrieval).

In [None]:
# Get document ID from extracted images
if extracted_images:
    doc_id = extracted_images[0][2].document_id
    print(f"Document ID: {doc_id}\n")
    
    # Simulate retrieval: given a slide number, find the image
    test_slide_num = slide_numbers[0] if slide_numbers else 1
    print(f"Looking up images for slide {test_slide_num}...\n")
    
    # In production, this would query OpenSearch
    # For demo, we use our local data
    for slide_num, img_bytes, info in extracted_images:
        if slide_num == test_slide_num:
            print(f"Found image: {info.s3_uri}")
            print(f"Type: {info.image_type}")
            print(f"\nDisplaying image:")
            display(IPImage(data=img_bytes, width=500))
            break
else:
    print("No images extracted. Run cells 5-6 first.")

## 11. Full Pipeline (One Function Call)

Run the complete extraction pipeline with a single function call.

In [None]:
# Run full pipeline
result = extract_images_from_pptx(
    pptx_path=PPTX_PATH,
    s3_bucket=S3_BUCKET,
    s3_prefix=S3_PREFIX,
    opensearch_client=None,  # Set to your client if using OpenSearch
    store_mappings=False,    # Set to True if using OpenSearch
)

print("=" * 50)
print("EXTRACTION RESULT")
print("=" * 50)
print(f"Document ID: {result.document_id}")
print(f"Filename: {result.filename}")
print(f"Total slides: {result.total_slides}")
print(f"Slides with images: {result.slides_with_images}")
print(f"Images extracted: {len(result.images)}")
print(f"Errors: {len(result.errors)}")

if result.errors:
    print(f"\nErrors encountered:")
    for error in result.errors:
        print(f"  - {error}")

print(f"\nExtracted images:")
for img in result.images:
    print(f"  Slide {img.slide_number}: {img.s3_uri}")

## 12. Cleanup (Optional)

Delete test images from S3.

In [None]:
# Uncomment to delete uploaded images
# WARNING: This will delete all images for this document from S3!

# if extracted_images:
#     doc_id = extracted_images[0][2].document_id
#     deleted = store.delete_document_images(doc_id)
#     print(f"Deleted {deleted} images from S3")

---

## Summary

This notebook demonstrated:

1. **Detection**: `PPTXImageDetector` identifies slides with SmartArt, pictures, and charts
2. **Extraction**: `PPTXImageExtractor` renders slides as PNG images
3. **Storage**: `S3ImageStore` uploads images to S3 with structured paths
4. **Mapping**: `ImageMapper` stores slide-to-image mappings in OpenSearch
5. **Pipeline**: `extract_images_from_pptx()` runs the full workflow

### Integration with RAG

To integrate with your existing text-based RAG:

```python
# During retrieval, if a chunk has slide_number:
chunk = retrieve_chunk(query)
if chunk.metadata.slide_number:
    images = mapper.get_s3_uris_for_slide(
        document_id=chunk.metadata.document_id,
        slide_number=chunk.metadata.slide_number
    )
    if images:
        # Include image URLs in response
        presigned_urls = [store.get_presigned_url(uri) for uri in images]
```