PDF to Markdown Pipeline - Usage Example Notebook
================================================

This notebook demonstrates how to use the high-fidelity PDF to Markdown pipeline
with LangChain Ollama integration.

Prerequisites:
- Ollama server running with a vision model (e.g., llama3.2-vision:11b)
- Required Python packages: PyMuPDF, langchain-ollama, PIL, etc.

In [None]:
# Install required packages (run once)
%pip install pymupdf langchain-ollama pillow

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


ModuleNotFoundError: No module named 'simple_vision_pipeline'

# Simple pipeline (vision only)

In [None]:
from src.vision_parser.parser import convert_pdf_to_markdown, PDFToMarkdownPipeline

In [None]:
# Configuration
PDF_PATH = "sample_document.pdf"
OLLAMA_MODEL = "qwen2.5vl:3b-q4_K_M"  # Or any vision model you have
OLLAMA_BASE_URL = "http://127.0.0.1:11434"  #"http://192.168.100.80:1818"
OUTPUT_DIR = "./output"

print("=== Simple Vision-Only PDF Conversion ===")

# Method 1: Simple one-liner
result = convert_pdf_to_markdown(
    pdf_path=PDF_PATH,
    ollama_model=OLLAMA_MODEL,
    ollama_base_url=OLLAMA_BASE_URL,
    output_dir=OUTPUT_DIR,
    dpi=300,  # Higher DPI for better quality
    log_level="INFO"
)

if result.success:
    print(f"Successfully converted {len(result.pages)} pages")
    print(f"Output saved to: {OUTPUT_DIR}")
    
    # Show first page preview
    if result.pages:
        first_page = result.pages[0]
        print(f"\nFirst page preview ({len(first_page)} characters):")
        print("=" * 50)
        print(first_page[:500] + "..." if len(first_page) > 500 else first_page)
        print("=" * 50)

=== Simple Vision-Only PDF Conversion ===
2025-07-07 12:12:49,916 - src.vision_parser.parser - INFO - Connected to Ollama at http://127.0.0.1:11434
2025-07-07 12:12:49,919 - src.vision_parser.parser - INFO - VisionProcessor initialized with model: qwen2.5vl:3b-q4_K_M
2025-07-07 12:12:49,922 - src.vision_parser.parser - INFO - Text validation enabled with threshold: 0.65
2025-07-07 12:12:49,925 - src.vision_parser.parser - INFO - Processing PDF with 4 pages...
2025-07-07 12:12:49,925 - src.vision_parser.parser - INFO - Processing page 1/4
2025-07-07 12:18:24,955 - src.vision_parser.parser - INFO - Processing page 2/4
2025-07-07 12:23:49,168 - src.vision_parser.parser - INFO - Processing page 3/4
2025-07-07 12:28:41,672 - src.vision_parser.parser - INFO - Processing page 4/4
2025-07-07 12:33:08,641 - src.vision_parser.parser - INFO - Running text validation...
2025-07-07 12:33:08,737 - src.vision_parser.validation - INFO - Text validation ✅ PASSED (score: 0.895)
2025-07-07 12:33:08,745 -

### Example of using the pipeline directly for more control

In [None]:
# Initialize pipeline
pipeline = PDFToMarkdownPipeline(
    ollama_model="qwen2.5vl:3b-q4_K_M",
    ollama_base_url="http://192.168.100.80:1818",
    enable_validation=False
    dpi=400  # Higher resolution
)

# Convert PDF
result = pipeline.convert_pdf("sample_document.pdf")

if result.success:
    # Process each page individually
    for i, page_content in enumerate(result.pages):
        print(f"\nPage {i+1} ({len(page_content)} characters)")
    
    # Save with custom directory
    saved_files = pipeline.save_results(result, "./custom_output")
    print(f"Saved {len(saved_files)} files")

# Hybrid parser

In [5]:
# Import the pipeline components
import sys
import fitz
sys.path.append('.')  # Adjust path as needed

from src.hybrid_parser.pipeline import (
    PDFToMarkdownPipeline, 
    PipelineConfig,
    convert_pdf_to_markdown
)
from pathlib import Path

=== Simple Vision-Only PDF Conversion ===


In [2]:
# Configuration
OLLAMA_MODEL = "qwen2.5vl:3b-q4_K_M"  # Change to your preferred model
OLLAMA_BASE_URL = "http://192.168.100.80:1818"  # Your Ollama server URL
PDF_PATH = "sample_document.pdf"  # Path to your PDF file
OUTPUT_DIR = "./output"



print(f"Using model: {OLLAMA_MODEL}")
print(f"Ollama server: {OLLAMA_BASE_URL}")

Using model: qwen2.5vl:3b-q4_K_M
Ollama server: http://192.168.100.80:1818


### Method 1: Simple conversion (output is saved in a file)

In [3]:
print("=== Simple Conversion ===")

result = convert_pdf_to_markdown(
    pdf_path=PDF_PATH,
    ollama_model=OLLAMA_MODEL,
    ollama_base_url=OLLAMA_BASE_URL,
    output_dir=OUTPUT_DIR,
    log_level="DEBUG"
)

if result.success:
    print(f"✅ Successfully converted {len(result.pages)} pages")
    print(f"📁 Output saved to: {OUTPUT_DIR}")
else:
    print("❌ Conversion failed:")
    for error in result.errors:
        print(f"   - {error}")

=== Simple Conversion ===
2025-07-05 14:47:36,901 - src.vision_processor - INFO - Connected to Ollama at http://192.168.100.80:1818
2025-07-05 14:47:36,904 - src.vision_processor - INFO - VisionProcessor initialized with model: qwen2.5vl:3b-q4_K_M
2025-07-05 14:47:36,906 - src.markdown_generator - DEBUG - MarkdownGenerator initialized with config: {'dpi': 300, 'vision_model_temp': 0.1, 'text_extraction_priority': True, 'image_embed_mode': 'base64', 'preserve_formatting': True, 'table_detection_threshold': 0.7, 'formula_detection_threshold': 0.8, 'min_image_size': (50, 50)}
2025-07-05 14:47:36,914 - src.pipeline - INFO - Processing PDF with 4 pages...
2025-07-05 14:47:36,922 - src.pipeline - INFO - Processing page 1/4
2025-07-05 14:47:36,923 - src.pipeline - INFO -   Analyzing page structure...
2025-07-05 14:47:36,936 - src.pdf_analyzer - DEBUG - Has text:True
2025-07-05 14:47:36,941 - src.pdf_analyzer - DEBUG - Text_coverage:0.1297491770806737
2025-07-05 14:47:37,006 - src.pdf_analyzer

### 2: Advanced usage with custom configuration

In [6]:
print("\n=== Advanced Configuration ===")

# Create custom configuration
config = PipelineConfig()
config.dpi = 400  # Higher resolution for better OCR
config.vision_model_temp = 0.1  # Lower temperature for consistent output
config.text_extraction_priority = True  # Prefer text extraction when possible
config.preserve_formatting = True  # Maintain original formatting
config.image_embed_mode = "base64"  # Embed images as base64

# Initialize pipeline with custom config
pipeline = PDFToMarkdownPipeline(
    ollama_model=OLLAMA_MODEL,
    ollama_base_url=OLLAMA_BASE_URL,
    config=config
)

# Show pipeline information
print("Pipeline Configuration:")
info = pipeline.get_pipeline_info()
for component, name in info["components"].items():
    print(f"  {component}: {name}")


=== Advanced Configuration ===
Pipeline Configuration:
  analyzer: PDFAnalyzer
  text_extractor: TextExtractor
  vision_processor: VisionProcessor
  integrator: ContentIntegrator
  markdown_generator: MarkdownGenerator


### 3: Process single page for testing

In [7]:
print("\n=== Single Page Processing ===")

import fitz

# Open PDF and process first page only
with  fitz.open(PDF_PATH) as doc:
    if doc.page_count > 0:
        first_page = doc[0]
        
        print("Analyzing first page...")
        analysis = pipeline.analyzer.analyze_page_content(first_page)
        
        print(f"Page Analysis:")
        print(f"  - Has extractable text: {analysis.has_extractable_text}")
        print(f"  - Text coverage: {analysis.text_coverage:.2f}")
        print(f"  - Has images: {analysis.has_images}")
        print(f"  - Has tables: {analysis.has_tables}")
        print(f"  - Has formulas: {analysis.has_formulas}")
        print(f"  - Recommended strategy: {analysis.strategy.value}")
        print(f"  - Confidence: {analysis.confidence:.2f}")
        
        # Process the page
        print("\nProcessing page...")
        page_markdown = pipeline.convert_page(first_page)
        
        print(f"\nGenerated markdown ({len(page_markdown)} characters):")
        print("=" * 50)
        print(page_markdown[:500] + "..." if len(page_markdown) > 500 else page_markdown)
        print("=" * 50)


=== Single Page Processing ===
Analyzing first page...
Page Analysis:
  - Has extractable text: True
  - Text coverage: 0.12
  - Has images: False
  - Has tables: False
  - Has formulas: False
  - Layout complexity: 0.90
  - Recommended strategy: vision_only
  - Confidence: 0.80

Processing page...
  Analyzing page structure...
  Strategy: vision_only (confidence: 0.80)
  Processing with vision model...
    General content extraction...
  Integrating content...
  Generating markdown...

Generated markdown (0 characters):



### Method 4: Batch processing with custom content handling

In [7]:
print("\n=== Batch Processing with Content Analysis ===")

def analyze_pdf_structure(pdf_path: str):
    """Analyze entire PDF structure before processing"""
    with fitz.open(pdf_path) as doc:
        analyses = {}
        
        print(f"Analyzing PDF structure ({doc.page_count} pages)...")
        
        for page_num in range(doc.page_count):
            page = doc[page_num]
            analysis = pipeline.analyzer.analyze_page_content(page)
            analyses[page_num] = analysis
            
            print(f"Page {page_num + 1}: {analysis.strategy.value} "
                f"(conf: {analysis.confidence:.2f}, ")
    
    return analyses

# Analyze structure first
if Path(PDF_PATH).exists():
    pdf_analyses = analyze_pdf_structure(PDF_PATH)
    
    # Show summary statistics
    strategies = [a.strategy.value for a in pdf_analyses.values()]
    strategy_counts = {s: strategies.count(s) for s in set(strategies)}
    
    print("\nStrategy Distribution:")
    for strategy, count in strategy_counts.items():
        print(f"  {strategy}: {count} pages")


=== Batch Processing with Content Analysis ===
Analyzing PDF structure (1 pages)...
Page 1: vision_only (conf: 0.80, complex: 0.90)

Strategy Distribution:
  vision_only: 1 pages

Average layout complexity: 0.90


### 5: Testing different vision models

In [None]:
print("\n=== Model Comparison ===")

# List of models to test (uncomment available models)
test_models = [
    "llama3.2-vision:11b",
    "qwen2.5vl:3b-q4_K_M",
    # "llava:13b",
    # "bakllava",
]

def test_model_performance(models: list, test_pdf: str):
    """Test different models on the same page"""
    if not Path(test_pdf).exists():
        print(f"Test PDF not found: {test_pdf}")
        return
    
    with fitz.open(test_pdf) as doc:
        test_page = doc[0]  # Use first page for testing
        
        results = {}
        
        for model in models:
            try:
                print(f"\nTesting model: {model}")
                
                # Create pipeline with this model
                test_pipeline = PDFToMarkdownPipeline(model, OLLAMA_BASE_URL)
                
                # Process page
                markdown = test_pipeline.convert_page(test_page)
                
                results[model] = {
                    "success": True,
                    "length": len(markdown),
                    "preview": markdown[:200] + "..." if len(markdown) > 200 else markdown
                }
                
                print(f"  ✅ Success - {len(markdown)} chars")
                
            except Exception as e:
                results[model] = {
                    "success": False,
                    "error": str(e)
                }
                print(f"  ❌ Failed: {e}")

    return results

# Run model comparison (only if you have multiple models)
if len(test_models) > 1:
    model_results = test_model_performance(test_models, PDF_PATH)
    
    print("\n=== Model Comparison Results ===")
    for model, result in model_results.items():
        if result["success"]:
            print(f"{model}: {result['length']} characters")
        else:
            print(f"{model}: FAILED - {result['error']}")

### 6: Error handling and debugging

In [None]:
print("\n=== Error Handling Examples ===")

# Test with non-existent file
print("Testing with non-existent file...")
bad_result = convert_pdf_to_markdown("nonexistent.pdf", OLLAMA_MODEL, OLLAMA_BASE_URL)
print(f"Expected failure: {not bad_result.success}")

# Test with wrong Ollama URL
print("\nTesting with wrong Ollama URL...")
try:
    bad_pipeline = PDFToMarkdownPipeline(OLLAMA_MODEL, "http://localhost:99999")
    # This will fail when we try to use the vision processor
    print("Pipeline created (will fail on actual processing)")
except Exception as e:
    print(f"Connection error: {e}")

# %%
# Final summary
print("\n" + "="*60)
print("PDF to Markdown Pipeline Demo Complete!")
print("="*60)

if Path(OUTPUT_DIR).exists():
    output_files = list(Path(OUTPUT_DIR).glob("*"))
    print(f"\nGenerated files in {OUTPUT_DIR}:")
    for file in output_files:
        size = file.stat().st_size if file.is_file() else 0
        print(f"  📄 {file.name} ({size:,} bytes)")

print(f"\nPipeline ready for production use!")
print(f"💡 Tip: Adjust PipelineConfig settings for your specific needs")
print(f"🔧 Remember to tune vision model temperature and DPI settings")

: 