# PPTX Image Extraction Demo (v2 - PDF Approach)

This notebook demonstrates extracting images from PowerPoint presentations using a PDF intermediate step.

**Workflow:**
1. **Detect** slides with SmartArt/images using python-pptx
2. **Convert** PPTX to PDF using LibreOffice (renders SmartArt properly)
3. **Extract** specific PDF pages as PNG using PyMuPDF
4. **Upload** to S3 with slide number mapping

## 1. Configuration

**Edit these values for your environment:**

In [None]:
# =============================================================================
# CONFIGURATION - Edit these values
# =============================================================================

# S3 bucket for storing extracted images
S3_BUCKET = "your-bucket-name"  # <-- Edit this

# S3 prefix (folder path) for images
S3_PREFIX = "images/pptx/"  # <-- Edit if needed

# Path to your PowerPoint file
PPTX_PATH = "./sample_presentation.pptx"  # <-- Edit this

# Image quality (DPI - higher = better quality, larger files)
DPI = 150  # Recommended: 150 for web, 300 for print

# =============================================================================
print(f"S3 Bucket: {S3_BUCKET}")
print(f"S3 Prefix: {S3_PREFIX}")
print(f"PPTX Path: {PPTX_PATH}")
print(f"DPI: {DPI}")

## 2. Install Dependencies

In [None]:
# Uncomment and run if dependencies are not installed
# !pip install python-pptx PyMuPDF Pillow

# For SageMaker/Amazon Linux - install LibreOffice if not available
# !sudo yum install -y libreoffice-headless

## 3. Import Modules

In [None]:
import sys
from pathlib import Path

# Add src to path if running from notebooks directory
src_path = Path("../src").resolve()
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the image extraction module
from rag_assist.ingestion.images import (
    PPTXImageDetector,
    PPTXtoPDFConverter,
    PDFPageExtractor,
    S3ImageStore,
    extract_pptx_images,
    ImageInfo,
    SlideImageMapping,
    ExtractionResult,
)

print("Modules imported successfully!")

## 4. Load and Inspect PPTX File

In [None]:
from pptx import Presentation

# Load presentation
pptx_path = Path(PPTX_PATH)
if not pptx_path.exists():
    print(f"ERROR: File not found: {pptx_path}")
    print("Please update PPTX_PATH in Cell 1 to point to a valid .pptx file")
else:
    prs = Presentation(str(pptx_path))
    print(f"File: {pptx_path.name}")
    print(f"Total slides: {len(prs.slides)}")
    print(f"File size: {pptx_path.stat().st_size / 1024:.1f} KB")
    print("\nSlide titles:")
    for i, slide in enumerate(prs.slides, 1):
        title = ""
        if slide.shapes.title and slide.shapes.title.has_text_frame:
            title = slide.shapes.title.text[:50]
        print(f"  Slide {i}: {title or '(no title)'}")

## 5. Step 1: Detect Slides with Visual Content

The detector identifies slides containing:
- **Pictures**: Embedded images
- **SmartArt**: Diagrams and flowcharts
- **Charts**: Data visualizations

In [None]:
# Initialize detector
detector = PPTXImageDetector(
    include_pictures=True,
    include_smartart=True,
    include_charts=True,
)

# Detect slides with images
slide_mappings = detector.detect_image_slides(PPTX_PATH)
slide_numbers = [m.slide_number for m in slide_mappings]

print(f"\nFound {len(slide_mappings)} slides with visual content:\n")
print(f"{'Slide':<8} {'Title':<30} {'Pictures':<10} {'SmartArt':<10} {'Charts':<10}")
print("-" * 70)

for mapping in slide_mappings:
    title = mapping.slide_title[:28] + ".." if len(mapping.slide_title) > 30 else mapping.slide_title
    print(
        f"{mapping.slide_number:<8} "
        f"{title:<30} "
        f"{'Yes' if mapping.has_pictures else '-':<10} "
        f"{'Yes' if mapping.has_smartart else '-':<10} "
        f"{'Yes' if mapping.has_charts else '-':<10}"
    )

print(f"\nSlide numbers to extract: {slide_numbers}")

## 6. Step 2: Convert PPTX to PDF

LibreOffice converts PPTX to PDF with proper SmartArt and chart rendering.

In [None]:
# Initialize converter
converter = PPTXtoPDFConverter(timeout=120)

# Convert PPTX to PDF
print("Converting PPTX to PDF...")
try:
    pdf_path = converter.convert(PPTX_PATH)
    print(f"PDF created: {pdf_path}")
    print(f"PDF size: {pdf_path.stat().st_size / 1024:.1f} KB")
except RuntimeError as e:
    print(f"ERROR: {e}")
    print("\nMake sure LibreOffice is installed:")
    print("  macOS: brew install libreoffice")
    print("  Linux: sudo apt install libreoffice")
    print("  Amazon Linux: sudo yum install libreoffice-headless")
    pdf_path = None

## 7. Step 3: Extract Specific PDF Pages as PNG

PyMuPDF renders the detected pages as high-quality PNG images.

In [None]:
from IPython.display import display, Image as IPImage

if pdf_path and slide_numbers:
    # Initialize extractor
    extractor = PDFPageExtractor(dpi=DPI)
    
    # Get document ID for metadata
    document_id = slide_mappings[0].document_id if slide_mappings else "unknown"
    
    # Extract pages
    print(f"Extracting {len(slide_numbers)} pages at {DPI} DPI...\n")
    extracted_images = extractor.extract_pages(
        pdf_path,
        slide_numbers,
        document_id,
        pptx_path.name,
    )
    
    print(f"Extracted {len(extracted_images)} images\n")
    
    # Preview extracted images
    for page_num, img_bytes, info in extracted_images:
        print(f"--- Slide {page_num} ---")
        print(f"Size: {info.width_px}x{info.height_px} px ({info.size_bytes / 1024:.1f} KB)")
        display(IPImage(data=img_bytes, width=500))
        print()
else:
    print("No PDF available or no slides to extract.")
    extracted_images = []

## 8. Step 4: Upload Images to S3

In [None]:
if extracted_images:
    # Initialize S3 store
    store = S3ImageStore(
        bucket=S3_BUCKET,
        prefix=S3_PREFIX,
    )
    
    # Upload images
    print("Uploading images to S3...\n")
    
    s3_uris = []
    for page_num, img_bytes, info in extracted_images:
        try:
            uri = store.upload_image(img_bytes, info)
            s3_uris.append(uri)
            print(f"Slide {page_num}: {uri}")
        except Exception as e:
            print(f"Slide {page_num}: FAILED - {e}")
    
    print(f"\nUploaded {len(s3_uris)} images to S3")
else:
    print("No images to upload.")
    s3_uris = []

## 9. Generate Presigned URLs (for viewing)

In [None]:
if s3_uris:
    print("Presigned URLs (valid for 1 hour):\n")
    
    for uri in s3_uris:
        try:
            presigned_url = store.get_presigned_url(uri, expiration=3600)
            print(f"{uri}")
            print(f"  -> {presigned_url[:100]}...\n")
        except Exception as e:
            print(f"{uri}: Error - {e}\n")
else:
    print("No S3 URIs available.")

## 10. Full Pipeline (Single Function Call)

Run the complete extraction pipeline with a single function call.

In [None]:
# Run full pipeline
result = extract_pptx_images(
    pptx_path=PPTX_PATH,
    s3_bucket=S3_BUCKET,
    s3_prefix=S3_PREFIX,
    dpi=DPI,
    cleanup_pdf=True,  # Delete intermediate PDF after extraction
)

print("=" * 50)
print("EXTRACTION RESULT")
print("=" * 50)
print(f"Document ID: {result.document_id}")
print(f"Filename: {result.filename}")
print(f"Total slides: {result.total_slides}")
print(f"Slides with images: {result.slides_with_images}")
print(f"Images extracted: {len(result.images)}")
print(f"Errors: {len(result.errors)}")

if result.errors:
    print(f"\nErrors encountered:")
    for error in result.errors:
        print(f"  - {error}")

print(f"\nExtracted images:")
for img in result.images:
    print(f"  Slide {img.slide_number}: {img.s3_uri}")

## 11. Demo: Retrieve Image for a Slide

Given a slide number, retrieve the associated image S3 URI.

In [None]:
if result.images:
    # Pick a slide number to look up
    test_slide = result.images[0].slide_number
    
    print(f"Looking up image for slide {test_slide}...\n")
    
    # Find the image for this slide
    for img in result.images:
        if img.slide_number == test_slide:
            print(f"Found image:")
            print(f"  Image ID: {img.image_id}")
            print(f"  S3 URI: {img.s3_uri}")
            print(f"  Size: {img.width_px}x{img.height_px} px")
            
            # Generate presigned URL
            if img.s3_uri:
                url = store.get_presigned_url(img.s3_uri)
                print(f"\n  Presigned URL: {url[:80]}...")
            break
else:
    print("No images extracted. Run cells above first.")

## 12. Cleanup (Optional)

Delete test images from S3.

In [None]:
# Uncomment to delete uploaded images
# WARNING: This will delete all images for this document from S3!

# if result.images:
#     doc_id = result.document_id
#     deleted = store.delete_document_images(doc_id)
#     print(f"Deleted {deleted} images from S3")

# Clean up local PDF if it still exists
# if pdf_path and pdf_path.exists():
#     pdf_path.unlink()
#     print(f"Deleted local PDF: {pdf_path}")

---

## Summary

This notebook demonstrated the PDF-based approach for extracting images from PPTX:

| Step | Class | Description |
|------|-------|-------------|
| 1 | `PPTXImageDetector` | Detect slides with SmartArt/images/charts |
| 2 | `PPTXtoPDFConverter` | Convert PPTX to PDF (LibreOffice) |
| 3 | `PDFPageExtractor` | Render PDF pages as PNG (PyMuPDF) |
| 4 | `S3ImageStore` | Upload to S3, generate presigned URLs |

**Full pipeline:** `extract_pptx_images()` runs all steps in one call.

### Integration with RAG

```python
# During retrieval, if a chunk has slide_number:
chunk = retrieve_chunk(query)
if chunk.metadata.slide_number:
    # Look up image S3 URI by slide number
    for img in result.images:
        if img.slide_number == chunk.metadata.slide_number:
            presigned_url = store.get_presigned_url(img.s3_uri)
            # Include in response
            break
```