# PPT Content Extraction Demo – Local Testing (No AWS)

This notebook runs **entirely locally** on your machine (e.g. Windows). No AWS, S3, Bedrock, or OpenSearch required for the main flow.

**Local flow:**
1. **Step 1:** Extract text from PPTX (python-pptx)
2. **Step 2:** Extract text + images from PDF (PyMuPDF)
3. **Deduplication:** Compare PPT vs PDF text
4. **Save images:** Write PDF page images to a local folder

Use local paths for your PPT and PDF files (same folder as the notebook or any folder on your machine). Optional cloud cells (S3, Bedrock, OpenSearch) are at the end for when you run on SageMaker.

## 1. Configuration (Local Paths)

**Edit these for your Windows/local environment.** Use a folder path where your PPTX and PDF live (e.g. same folder as this notebook, or `C:\\Users\\You\\data`).

In [None]:
# =============================================================================
# LOCAL CONFIGURATION - No AWS required
# =============================================================================
from pathlib import Path

# Folder where your PPTX and PDF files live (use . for same folder as notebook)
DATA_DIR = Path(".")  # e.g. Path(".") or Path("C:/Users/You/data")

# Filenames (or use full paths in PPTX_PATH / PDF_PATH below)
PPTX_FILENAME = "sample_presentation.pptx"
PDF_FILENAME = "sample_presentation.pdf"

# Full paths to files (Windows-friendly: use Path or raw strings)
PPTX_PATH = DATA_DIR / PPTX_FILENAME
PDF_PATH = DATA_DIR / PDF_FILENAME

# Where to save extracted PDF page images locally (no S3)
OUTPUT_DIR = Path("./output")
OUTPUT_IMAGES_DIR = OUTPUT_DIR / "images"  # images saved as output/images/<doc_id>/page_001.png

# Image quality for PDF rendering (DPI)
DPI = 150  # 150 for web, 300 for print

# =============================================================================
print("Local config (no AWS):")
print(f"  DATA_DIR:      {DATA_DIR.resolve()}")
print(f"  PPTX_PATH:     {PPTX_PATH}")
print(f"  PDF_PATH:     {PDF_PATH}")
print(f"  OUTPUT_DIR:   {OUTPUT_DIR.resolve()}")
print(f"  DPI:          {DPI}")

## 2. Install Dependencies (Local Only)

For local testing you only need **python-pptx** and **PyMuPDF**. No boto3 or OpenSearch for Sections 1–8.

In [None]:
# Local testing only (no AWS):
# !pip install python-pptx PyMuPDF

# Optional – for cloud sections 9–12:
# !pip install boto3 opensearch-py

## 3. Import Modules

In [None]:
import sys
from pathlib import Path

# Add current folder to path (notebook and .py files in same folder)
folder = Path(".").resolve()
if str(folder) not in sys.path:
    sys.path.insert(0, str(folder))

# Local-only imports (no AWS)
from ppt_text_extractor import PPTXTextExtractor
from pdf_content_extractor import PDFContentExtractor
from text_deduplicator import TextDeduplicator

print("Modules imported successfully (local mode, no AWS).")

## 4. Step 1: Extract Text from PPTX

Extract text content from all slides using python-pptx.

In [None]:
# Initialize PPT text extractor
ppt_extractor = PPTXTextExtractor(
    include_speaker_notes=True,
    include_tables=True,
    detect_visual_content=True,
)

# Extract text from PPTX
pptx_path = Path(PPTX_PATH)
if not pptx_path.exists():
    print(f"ERROR: File not found: {pptx_path}")
    print("Please update PPTX_PATH in Cell 2 to point to a valid .pptx file")
    ppt_result = None
else:
    ppt_result = ppt_extractor.extract(PPTX_PATH)
    
    print(f"\n{'='*60}")
    print("STEP 1: PPT TEXT EXTRACTION RESULT")
    print(f"{'='*60}")
    print(f"Document ID: {ppt_result.document_id}")
    print(f"Filename: {ppt_result.filename}")
    print(f"Total slides: {ppt_result.total_slides}")
    print(f"Slides extracted: {ppt_result.slide_count}")
    print(f"Slides with visual content: {ppt_result.slides_with_visuals}")
    print(f"Errors: {len(ppt_result.errors)}")

In [None]:
# Display extracted text from each slide
if ppt_result:
    print("\nExtracted Slides:")
    print(f"{'='*60}")
    
    for slide in ppt_result.slides:
        print(f"\n--- Slide {slide.slide_number} ---")
        print(f"Title: {slide.title or '(no title)'}")
        print(f"Has visual content: {slide.has_visual_content}")
        print(f"Body text preview: {slide.body_text[:200]}..." if len(slide.body_text) > 200 else f"Body text: {slide.body_text}")
        if slide.speaker_notes:
            print(f"Speaker notes: {slide.speaker_notes[:100]}...")
        if slide.tables:
            print(f"Tables: {len(slide.tables)}")

## 5. Step 2: Extract Content from PDF

Extract both text AND images from the PDF version of the presentation.

In [None]:
# Initialize PDF content extractor
pdf_extractor = PDFContentExtractor(dpi=DPI)

# Extract content from PDF
pdf_path = Path(PDF_PATH)
if not pdf_path.exists():
    print(f"ERROR: File not found: {pdf_path}")
    print("Please update PDF_PATH in Cell 2 to point to a valid .pdf file")
    pdf_result = None
else:
    # Use same document_id as PPT for linking
    document_id = ppt_result.document_id if ppt_result else None
    pdf_result = pdf_extractor.extract(PDF_PATH, document_id=document_id)
    
    print(f"\n{'='*60}")
    print("STEP 2: PDF CONTENT EXTRACTION RESULT")
    print(f"{'='*60}")
    print(f"Document ID: {pdf_result.document_id}")
    print(f"Filename: {pdf_result.filename}")
    print(f"Total pages: {pdf_result.total_pages}")
    print(f"Pages extracted: {pdf_result.page_count}")
    print(f"Pages with images: {pdf_result.pages_with_images}")
    print(f"Errors: {len(pdf_result.errors)}")

In [None]:
# Preview extracted PDF pages (text + images)
from IPython.display import display, Image as IPImage

if pdf_result:
    print("\nExtracted PDF Pages:")
    print(f"{'='*60}")
    
    for page in pdf_result.pages[:3]:  # Show first 3 pages
        print(f"\n--- Page {page.page_number} ---")
        print(f"Text length: {len(page.text_content)} chars")
        print(f"Text preview: {page.text_content[:200]}..." if len(page.text_content) > 200 else f"Text: {page.text_content}")
        print(f"Image size: {page.width_px}x{page.height_px} px ({page.size_bytes / 1024:.1f} KB)")
        
        if page.has_image:
            display(IPImage(data=page.image_bytes, width=500))

## 6. Text Deduplication

Compare PPT text with PDF text and identify duplicates to avoid indexing the same content twice.

In [None]:
if ppt_result and pdf_result:
    # Initialize deduplicator
    deduplicator = TextDeduplicator(similarity_threshold=0.85)
    
    # Find duplicates
    duplicates = deduplicator.find_duplicates(ppt_result.slides, pdf_result.pages)
    
    print(f"\n{'='*60}")
    print("DEDUPLICATION ANALYSIS")
    print(f"{'='*60}")
    print(f"\n{'Slide/Page':<12} {'PPT Len':<10} {'PDF Len':<10} {'Similarity':<12} {'Duplicate?'}")
    print("-" * 60)
    
    for dup in duplicates:
        print(
            f"{dup.slide_number:<12} "
            f"{dup.ppt_text_length:<10} "
            f"{dup.pdf_text_length:<10} "
            f"{dup.similarity_score:.2%:<12} "
            f"{'Yes' if dup.is_duplicate else 'No'}"
        )
    
    # Get unique PDF text
    unique_pdf_text = deduplicator.get_unique_pdf_text(ppt_result.slides, pdf_result.pages)
    
    print(f"\n\nSummary:")
    print(f"  Total PPT slides: {len(ppt_result.slides)}")
    print(f"  Total PDF pages: {len(pdf_result.pages)}")
    print(f"  Duplicate pages: {sum(1 for d in duplicates if d.is_duplicate)}")
    print(f"  Unique PDF text pages: {len(unique_pdf_text)}")
else:
    print("Missing PPT or PDF result. Run extraction cells first.")

## 7. Save Images Locally (No S3)

Save extracted PDF page images to a folder on disk so you can verify the pipeline without AWS.

In [None]:
if pdf_result:
    # Save each page image to a local folder (Windows-friendly paths)
    out_dir = OUTPUT_IMAGES_DIR / pdf_result.document_id
    out_dir.mkdir(parents=True, exist_ok=True)
    saved = 0
    for page in pdf_result.pages:
        if not page.has_image:
            continue
        try:
            path = out_dir / f"page_{page.page_number:03d}.png"
            path.write_bytes(page.image_bytes)
            print(f"Page {page.page_number}: {path}")
            saved += 1
        except Exception as e:
            print(f"Page {page.page_number}: FAILED - {e}")
    print(f"\nSaved {saved} images to {out_dir.resolve()}")
else:
    print("No PDF result. Run PDF extraction first.")

## 8. Local Testing Complete

You have run the full **local** pipeline: PPT text → PDF text + images → deduplication → images saved to disk. No AWS required.

---
**Optional (cloud):** The cells below use AWS (S3, Bedrock, OpenSearch). Run them only when testing on SageMaker or with credentials.

In [None]:
# Local testing: images are in OUTPUT_IMAGES_DIR / document_id / page_001.png ...
# No presigned URLs needed locally. Use optional S3/cloud cells below when on SageMaker.
if pdf_result and OUTPUT_IMAGES_DIR.exists():
    out_dir = OUTPUT_IMAGES_DIR / pdf_result.document_id
    if out_dir.exists():
        files = list(out_dir.glob("*.png"))
        print(f"Local images saved: {len(files)} files in {out_dir.resolve()}")

## 9. [Cloud] Generate Embeddings (Optional – AWS Bedrock)

Only when testing on SageMaker or with AWS credentials. Skip for local testing.

In [None]:
# Uncomment to generate embeddings
# Requires: AWS credentials with Bedrock access

# from embedders import CohereTextEmbedder, TitanMultimodalEmbedder

# # Initialize embedders
# text_embedder = CohereTextEmbedder()
# image_embedder = TitanMultimodalEmbedder()

# # Generate text embeddings for PPT slides
# if ppt_result:
#     print("Generating text embeddings...")
#     for slide in ppt_result.slides[:3]:  # First 3 for demo
#         embedding = text_embedder.embed(slide.full_text)
#         print(f"Slide {slide.slide_number}: {len(embedding)} dimensions")

# # Generate image embeddings for PDF pages
# if pdf_result:
#     print("\nGenerating image embeddings...")
#     for page in pdf_result.pages[:3]:  # First 3 for demo
#         if page.has_image:
#             embedding = image_embedder.embed_image(page.image_bytes)
#             print(f"Page {page.page_number}: {len(embedding)} dimensions")

## 10. [Cloud] Full Indexing Pipeline (Optional – OpenSearch + Bedrock)

Only when testing on SageMaker. Requires OpenSearch and AWS credentials.

In [None]:
# Uncomment to run full indexing pipeline
# Requires: OpenSearch cluster, AWS credentials with Bedrock access

# from opensearchpy import OpenSearch
# from content_indexer import PPTContentIndexer
# from embedders import CohereTextEmbedder, TitanMultimodalEmbedder
# from s3_store import S3ImageStore

# # Initialize OpenSearch client
# opensearch_client = OpenSearch(
#     hosts=[{'host': OPENSEARCH_HOST, 'port': 443}],
#     http_compress=True,
#     use_ssl=True,
# )

# # Initialize components
# text_embedder = CohereTextEmbedder()
# image_embedder = TitanMultimodalEmbedder()
# s3_store = S3ImageStore(bucket=S3_BUCKET, prefix=S3_PREFIX)

# # Initialize indexer
# indexer = PPTContentIndexer(
#     opensearch_client=opensearch_client,
#     text_embedder=text_embedder,
#     image_embedder=image_embedder,
#     s3_store=s3_store,
#     index_name=OPENSEARCH_INDEX,
# )

# # Ensure index exists
# indexer.ensure_index_exists()

# # Run full pipeline
# ppt_result, pdf_result = indexer.index_full_pipeline(PPTX_PATH, PDF_PATH)

# print(f"\nIndexing Results:")
# print(f"  PPT text indexed: {ppt_result.text_documents_indexed}")
# print(f"  PDF text indexed: {pdf_result.text_documents_indexed}")
# print(f"  PDF images indexed: {pdf_result.image_documents_indexed}")

## 11. [Cloud] Search and Retrieve (Optional – OpenSearch + S3)

Only when indexing (Section 10) has been run on SageMaker.

In [None]:
# Uncomment to search content
# Requires: Completed indexing (Cell 22)

# from content_retriever import PPTContentRetriever

# # Initialize retriever
# retriever = PPTContentRetriever(
#     opensearch_client=opensearch_client,
#     text_embedder=text_embedder,
#     s3_store=s3_store,
#     index_name=OPENSEARCH_INDEX,
# )

# # Search for content
# query = "What is the architecture?"
# results = retriever.search(query, top_k=5, include_images=True)

# print(f"\nSearch Results for: '{query}'")
# print("=" * 60)

# for r in results:
#     print(f"\nSlide {r.slide_number} (score: {r.text_score:.3f})")
#     print(f"Title: {r.title}")
#     print(f"Text: {r.text_content[:200]}...")
#     if r.image_presigned_url:
#         print(f"Image: {r.image_presigned_url[:80]}...")
#         # display(IPImage(url=r.image_presigned_url, width=400))

## 12. [Cloud] Cleanup (Optional – S3)

Delete uploaded images from S3 when testing on SageMaker.

In [None]:
# Uncomment to delete uploaded images from S3
# WARNING: This will delete all images for this document from S3!

# if pdf_result:
#     doc_id = pdf_result.document_id
#     deleted = s3_store.delete_document_images(doc_id)
#     print(f"Deleted {deleted} images from S3")

---

## Summary

**Local testing (no AWS):** Sections 1–8 run entirely on your machine (e.g. Windows). Use local paths for PPTX and PDF; images are saved to `output/images/<document_id>/`.

| Step | Component | Description |
|------|-----------|-------------|
| 1 | `PPTXTextExtractor` | Extract text from PPTX (primary text source) |
| 2 | `PDFContentExtractor` | Extract text + images from PDF |
| - | `TextDeduplicator` | Detect duplicate text between PPT and PDF |
| - | Local save | Save PDF page images to `OUTPUT_IMAGES_DIR` |

**Cloud (optional):** Sections 9–12 use AWS (Bedrock, S3, OpenSearch). Run those only on SageMaker or when you have credentials.