# PPT Content Extraction Demo - 2-Step Framework

This notebook demonstrates the 2-step framework for extracting and embedding content from PowerPoint presentations.

**Step 1 (PPT Path):** Parse PPTX directly → Extract text → Generate embeddings
**Step 2 (PDF Path):** Parse PDF version → Extract text + images → Generate embeddings

**Key Features:**
- Text extraction from **both** PPT and PDF for redundancy
- Duplicate text between PPT and PDF is detected and skipped
- Images from PDF are linked to text by slide_number/page_number
- If image extraction fails, text is still retrievable

## 1. Configuration

**Edit these values for your environment:**

In [None]:
# =============================================================================
# CONFIGURATION - Edit these values
# =============================================================================

# S3 bucket for storing extracted images
S3_BUCKET = "your-bucket-name"  # <-- Edit this

# S3 prefix (folder path) for images
S3_PREFIX = "images/ppt/"  # <-- Edit if needed

# Path to your PowerPoint file
PPTX_PATH = "./sample_presentation.pptx"  # <-- Edit this

# Path to your PDF file (manually converted from PPT)
PDF_PATH = "./sample_presentation.pdf"  # <-- Edit this

# Image quality (DPI - higher = better quality, larger files)
DPI = 150  # Recommended: 150 for web, 300 for print

# OpenSearch configuration (if using full pipeline)
OPENSEARCH_HOST = "your-opensearch-endpoint"  # <-- Edit this
OPENSEARCH_INDEX = "ppt-content"

# =============================================================================
print(f"S3 Bucket: {S3_BUCKET}")
print(f"S3 Prefix: {S3_PREFIX}")
print(f"PPTX Path: {PPTX_PATH}")
print(f"PDF Path: {PDF_PATH}")
print(f"DPI: {DPI}")

## 2. Install Dependencies

In [None]:
# Uncomment and run if dependencies are not installed
# !pip install python-pptx PyMuPDF Pillow boto3 opensearch-py

## 3. Import Modules

In [None]:
import sys
from pathlib import Path

# Add src to path if running from notebooks directory
src_path = Path("../src").resolve()
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the extraction modules
from rag_assist.ingestion.images import (
    # Data models
    SlideTextContent,
    PPTExtractionResult,
    PDFPageContent,
    PDFExtractionResult,
    # Extractors
    PPTXTextExtractor,
    PDFContentExtractor,
    # Deduplication
    TextDeduplicator,
    # S3 storage
    S3ImageStore,
    # Embedders (optional - requires Bedrock)
    # CohereTextEmbedder,
    # TitanMultimodalEmbedder,
)

print("Modules imported successfully!")

## 4. Step 1: Extract Text from PPTX

Extract text content from all slides using python-pptx.

In [None]:
# Initialize PPT text extractor
ppt_extractor = PPTXTextExtractor(
    include_speaker_notes=True,
    include_tables=True,
    detect_visual_content=True,
)

# Extract text from PPTX
pptx_path = Path(PPTX_PATH)
if not pptx_path.exists():
    print(f"ERROR: File not found: {pptx_path}")
    print("Please update PPTX_PATH in Cell 2 to point to a valid .pptx file")
    ppt_result = None
else:
    ppt_result = ppt_extractor.extract(PPTX_PATH)
    
    print(f"\n{'='*60}")
    print("STEP 1: PPT TEXT EXTRACTION RESULT")
    print(f"{'='*60}")
    print(f"Document ID: {ppt_result.document_id}")
    print(f"Filename: {ppt_result.filename}")
    print(f"Total slides: {ppt_result.total_slides}")
    print(f"Slides extracted: {ppt_result.slide_count}")
    print(f"Slides with visual content: {ppt_result.slides_with_visuals}")
    print(f"Errors: {len(ppt_result.errors)}")

In [None]:
# Display extracted text from each slide
if ppt_result:
    print("\nExtracted Slides:")
    print(f"{'='*60}")
    
    for slide in ppt_result.slides:
        print(f"\n--- Slide {slide.slide_number} ---")
        print(f"Title: {slide.title or '(no title)'}")
        print(f"Has visual content: {slide.has_visual_content}")
        print(f"Body text preview: {slide.body_text[:200]}..." if len(slide.body_text) > 200 else f"Body text: {slide.body_text}")
        if slide.speaker_notes:
            print(f"Speaker notes: {slide.speaker_notes[:100]}...")
        if slide.tables:
            print(f"Tables: {len(slide.tables)}")

## 5. Step 2: Extract Content from PDF

Extract both text AND images from the PDF version of the presentation.

In [None]:
# Initialize PDF content extractor
pdf_extractor = PDFContentExtractor(dpi=DPI)

# Extract content from PDF
pdf_path = Path(PDF_PATH)
if not pdf_path.exists():
    print(f"ERROR: File not found: {pdf_path}")
    print("Please update PDF_PATH in Cell 2 to point to a valid .pdf file")
    pdf_result = None
else:
    # Use same document_id as PPT for linking
    document_id = ppt_result.document_id if ppt_result else None
    pdf_result = pdf_extractor.extract(PDF_PATH, document_id=document_id)
    
    print(f"\n{'='*60}")
    print("STEP 2: PDF CONTENT EXTRACTION RESULT")
    print(f"{'='*60}")
    print(f"Document ID: {pdf_result.document_id}")
    print(f"Filename: {pdf_result.filename}")
    print(f"Total pages: {pdf_result.total_pages}")
    print(f"Pages extracted: {pdf_result.page_count}")
    print(f"Pages with images: {pdf_result.pages_with_images}")
    print(f"Errors: {len(pdf_result.errors)}")

In [None]:
# Preview extracted PDF pages (text + images)
from IPython.display import display, Image as IPImage

if pdf_result:
    print("\nExtracted PDF Pages:")
    print(f"{'='*60}")
    
    for page in pdf_result.pages[:3]:  # Show first 3 pages
        print(f"\n--- Page {page.page_number} ---")
        print(f"Text length: {len(page.text_content)} chars")
        print(f"Text preview: {page.text_content[:200]}..." if len(page.text_content) > 200 else f"Text: {page.text_content}")
        print(f"Image size: {page.width_px}x{page.height_px} px ({page.size_bytes / 1024:.1f} KB)")
        
        if page.has_image:
            display(IPImage(data=page.image_bytes, width=500))

## 6. Text Deduplication

Compare PPT text with PDF text and identify duplicates to avoid indexing the same content twice.

In [None]:
if ppt_result and pdf_result:
    # Initialize deduplicator
    deduplicator = TextDeduplicator(similarity_threshold=0.85)
    
    # Find duplicates
    duplicates = deduplicator.find_duplicates(ppt_result.slides, pdf_result.pages)
    
    print(f"\n{'='*60}")
    print("DEDUPLICATION ANALYSIS")
    print(f"{'='*60}")
    print(f"\n{'Slide/Page':<12} {'PPT Len':<10} {'PDF Len':<10} {'Similarity':<12} {'Duplicate?'}")
    print("-" * 60)
    
    for dup in duplicates:
        print(
            f"{dup.slide_number:<12} "
            f"{dup.ppt_text_length:<10} "
            f"{dup.pdf_text_length:<10} "
            f"{dup.similarity_score:.2%:<12} "
            f"{'Yes' if dup.is_duplicate else 'No'}"
        )
    
    # Get unique PDF text
    unique_pdf_text = deduplicator.get_unique_pdf_text(ppt_result.slides, pdf_result.pages)
    
    print(f"\n\nSummary:")
    print(f"  Total PPT slides: {len(ppt_result.slides)}")
    print(f"  Total PDF pages: {len(pdf_result.pages)}")
    print(f"  Duplicate pages: {sum(1 for d in duplicates if d.is_duplicate)}")
    print(f"  Unique PDF text pages: {len(unique_pdf_text)}")
else:
    print("Missing PPT or PDF result. Run extraction cells first.")

## 7. Upload Images to S3

In [None]:
if pdf_result:
    # Initialize S3 store
    s3_store = S3ImageStore(
        bucket=S3_BUCKET,
        prefix=S3_PREFIX,
    )
    
    # Upload images from all PDF pages
    print("\nUploading images to S3...")
    
    s3_uris = []
    for page in pdf_result.pages:
        if not page.has_image:
            continue
            
        try:
            uri = s3_store.upload(
                page.image_bytes,
                document_id=page.document_id,
                page_number=page.page_number,
            )
            page.s3_uri = uri
            s3_uris.append(uri)
            print(f"Page {page.page_number}: {uri}")
        except Exception as e:
            print(f"Page {page.page_number}: FAILED - {e}")
    
    print(f"\nUploaded {len(s3_uris)} images to S3")
else:
    print("No PDF result available. Run PDF extraction first.")

## 8. Generate Presigned URLs

In [None]:
if pdf_result and s3_uris:
    print("\nPresigned URLs (valid for 1 hour):")
    
    for page in pdf_result.pages:
        if page.s3_uri:
            try:
                presigned_url = s3_store.get_presigned_url(page.s3_uri, expiration=3600)
                print(f"\nPage {page.page_number}:")
                print(f"  S3 URI: {page.s3_uri}")
                print(f"  URL: {presigned_url[:100]}...")
            except Exception as e:
                print(f"Page {page.page_number}: Error - {e}")
else:
    print("No S3 URIs available. Run upload cell first.")

## 9. Generate Embeddings (Optional)

Generate embeddings using Amazon Bedrock (requires AWS credentials).

In [None]:
# Uncomment to generate embeddings
# Requires: AWS credentials with Bedrock access

# from rag_assist.ingestion.images import CohereTextEmbedder, TitanMultimodalEmbedder

# # Initialize embedders
# text_embedder = CohereTextEmbedder()
# image_embedder = TitanMultimodalEmbedder()

# # Generate text embeddings for PPT slides
# if ppt_result:
#     print("Generating text embeddings...")
#     for slide in ppt_result.slides[:3]:  # First 3 for demo
#         embedding = text_embedder.embed(slide.full_text)
#         print(f"Slide {slide.slide_number}: {len(embedding)} dimensions")

# # Generate image embeddings for PDF pages
# if pdf_result:
#     print("\nGenerating image embeddings...")
#     for page in pdf_result.pages[:3]:  # First 3 for demo
#         if page.has_image:
#             embedding = image_embedder.embed_image(page.image_bytes)
#             print(f"Page {page.page_number}: {len(embedding)} dimensions")

## 10. Full Indexing Pipeline (Optional)

Index content to OpenSearch using the full pipeline.

In [None]:
# Uncomment to run full indexing pipeline
# Requires: OpenSearch cluster, AWS credentials with Bedrock access

# from opensearchpy import OpenSearch
# from rag_assist.ingestion.images import (
#     PPTContentIndexer,
#     CohereTextEmbedder,
#     TitanMultimodalEmbedder,
#     S3ImageStore,
# )

# # Initialize OpenSearch client
# opensearch_client = OpenSearch(
#     hosts=[{'host': OPENSEARCH_HOST, 'port': 443}],
#     http_compress=True,
#     use_ssl=True,
# )

# # Initialize components
# text_embedder = CohereTextEmbedder()
# image_embedder = TitanMultimodalEmbedder()
# s3_store = S3ImageStore(bucket=S3_BUCKET, prefix=S3_PREFIX)

# # Initialize indexer
# indexer = PPTContentIndexer(
#     opensearch_client=opensearch_client,
#     text_embedder=text_embedder,
#     image_embedder=image_embedder,
#     s3_store=s3_store,
#     index_name=OPENSEARCH_INDEX,
# )

# # Ensure index exists
# indexer.ensure_index_exists()

# # Run full pipeline
# ppt_result, pdf_result = indexer.index_full_pipeline(PPTX_PATH, PDF_PATH)

# print(f"\nIndexing Results:")
# print(f"  PPT text indexed: {ppt_result.text_documents_indexed}")
# print(f"  PDF text indexed: {pdf_result.text_documents_indexed}")
# print(f"  PDF images indexed: {pdf_result.image_documents_indexed}")

## 11. Search and Retrieve (Optional)

Search for content and retrieve linked images.

In [None]:
# Uncomment to search content
# Requires: Completed indexing (Cell 22)

# from rag_assist.ingestion.images import PPTContentRetriever

# # Initialize retriever
# retriever = PPTContentRetriever(
#     opensearch_client=opensearch_client,
#     text_embedder=text_embedder,
#     s3_store=s3_store,
#     index_name=OPENSEARCH_INDEX,
# )

# # Search for content
# query = "What is the architecture?"
# results = retriever.search(query, top_k=5, include_images=True)

# print(f"\nSearch Results for: '{query}'")
# print("=" * 60)

# for r in results:
#     print(f"\nSlide {r.slide_number} (score: {r.text_score:.3f})")
#     print(f"Title: {r.title}")
#     print(f"Text: {r.text_content[:200]}...")
#     if r.image_presigned_url:
#         print(f"Image: {r.image_presigned_url[:80]}...")
#         # display(IPImage(url=r.image_presigned_url, width=400))

## 12. Cleanup (Optional)

In [None]:
# Uncomment to delete uploaded images from S3
# WARNING: This will delete all images for this document from S3!

# if pdf_result:
#     doc_id = pdf_result.document_id
#     deleted = s3_store.delete_document_images(doc_id)
#     print(f"Deleted {deleted} images from S3")

---

## Summary

This notebook demonstrated the 2-step framework for PPT content extraction:

| Step | Component | Description |
|------|-----------|-------------|
| 1 | `PPTXTextExtractor` | Extract text from PPTX (primary text source) |
| 2 | `PDFContentExtractor` | Extract text + images from PDF |
| - | `TextDeduplicator` | Detect duplicate text between PPT and PDF |
| - | `S3ImageStore` | Upload images to S3, generate presigned URLs |
| - | `CohereTextEmbedder` | Generate text embeddings via Bedrock |
| - | `TitanMultimodalEmbedder` | Generate image embeddings via Bedrock |
| - | `PPTContentIndexer` | Index to OpenSearch |
| - | `PPTContentRetriever` | Search and retrieve with linked images |

### Integration with RAG

```python
# During retrieval, get text + linked image:
results = retriever.search(query, include_images=True)

for result in results:
    # Text content (from PPT or unique PDF text)
    context = result.text_content
    
    # Linked image (if available)
    if result.image_presigned_url:
        # Include image in response
        image_url = result.image_presigned_url
```