# LlamaParse Parsing Pipeline

This notebook demonstrates the complete LlamaParse parsing pipeline for PDF documents, including:
- Document parsing with LlamaIndex's advanced PDF processing
- Markdown conversion and formatting
- Performance timing and analysis
- Comparison with other parsing methods

## Setup and Imports

In [None]:
import sys
import os
from pathlib import Path
import time
import json
from typing import Dict, Any

# Add the src directory to Python path
sys.path.append('../src')

from simple_rag.parsers.parser_llama import LlamaParseProcessor
from simple_rag.main_parser import MainParserProcessor

## Configuration

Set up the input PDF file and output directory for processing.

In [None]:
# Configuration
PDF_FILE = "../data/raw/test_p1_7.pdf"  # Change this to your PDF file
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"📄 Input PDF: {PDF_FILE}")
print(f"📁 Output directory: {OUTPUT_DIR}")
print(f"✅ PDF exists: {os.path.exists(PDF_FILE)}")

## Environment Check

Verify that the LLAMA_CLOUD_API_KEY is properly configured.

In [None]:
# Check for LlamaCloud API key
api_key = os.getenv('LLAMA_CLOUD_API_KEY')
if api_key:
    print(f"🔑 LLAMA_CLOUD_API_KEY found: {api_key[:8]}...{api_key[-4:]}")
else:
    print("⚠️  LLAMA_CLOUD_API_KEY not found in environment variables")
    print("   Please set your API key: export LLAMA_CLOUD_API_KEY='your_key_here'")

## Initialize LlamaParse Parser

Create the LlamaParseProcessor instance with advanced parsing capabilities.

In [None]:
# Initialize the LlamaParse parser
try:
    parser = LlamaParseProcessor()
    print("🦙 LlamaParse parser initialized successfully")
    print(f"   API Key configured: {parser.api_key is not None}")
except Exception as e:
    print(f"❌ Failed to initialize LlamaParse parser: {e}")
    print("   Please check your LLAMA_CLOUD_API_KEY configuration")

## Document Parsing

Parse the PDF document using LlamaParse's advanced AI-powered extraction.

In [None]:
# Start timing
start_time = time.time()

print("🚀 Starting LlamaParse parsing...")
print("=" * 50)
print("⏳ This may take a few moments as LlamaParse processes the document...")

try:
    # Parse the document
    documents = parser.parse_document(PDF_FILE, verbose=True)
    
    parsing_time = time.time() - start_time
    print(f"\n⏱️  LlamaParse parsing completed in {parsing_time:.2f} seconds")
    print(f"📊 Documents extracted: {len(documents)}")
    
    # Show document info
    for i, doc in enumerate(documents):
        content_length = len(doc.text) if hasattr(doc, 'text') else 0
        print(f"   Document {i+1}: {content_length} characters")
        
except Exception as e:
    print(f"❌ LlamaParse parsing failed: {e}")
    documents = []

## Markdown Conversion

Convert the parsed content to well-formatted Markdown.

In [None]:
if documents:
    print("\n📝 Converting to Markdown format...")
    print("=" * 35)
    
    # Convert to markdown
    markdown_content = parser.convert_to_markdown(documents, verbose=True)
    
    print(f"\n📄 Markdown conversion completed")
    print(f"📊 Total markdown length: {len(markdown_content)} characters")
    
    # Show a preview of the markdown
    preview_length = 500
    print(f"\n📖 Markdown Preview (first {preview_length} chars):")
    print("-" * 50)
    print(markdown_content[:preview_length])
    if len(markdown_content) > preview_length:
        print("...")
else:
    print("⚠️  No documents to convert - skipping markdown conversion")
    markdown_content = ""

## Content Analysis

Analyze the structure and content of the parsed markdown.

In [None]:
if markdown_content:
    print("\n📋 Content Structure Analysis:")
    print("=" * 35)
    
    # Count different markdown elements
    lines = markdown_content.split('\n')
    
    headers = [line for line in lines if line.strip().startswith('#')]
    paragraphs = [line for line in lines if line.strip() and not line.strip().startswith('#') and not line.strip().startswith('|')]
    tables = [line for line in lines if '|' in line]
    
    print(f"📊 Structure Summary:")
    print(f"   Total lines: {len(lines)}")
    print(f"   Headers: {len(headers)}")
    print(f"   Content paragraphs: {len(paragraphs)}")
    print(f"   Table lines: {len(tables)}")
    
    # Show headers structure
    if headers:
        print(f"\n📑 Document Structure (Headers):")
        for header in headers[:10]:  # Show first 10 headers
            level = len(header) - len(header.lstrip('#'))
            title = header.strip('#').strip()
            indent = "  " * (level - 1)
            print(f"   {indent}{'#' * level} {title}")
        if len(headers) > 10:
            print(f"   ... and {len(headers) - 10} more headers")
else:
    print("⚠️  No markdown content to analyze")

## Save Results

Save the processed markdown content to a file for further use.

In [None]:
if markdown_content:
    # Save markdown results
    output_filename = f"{Path(PDF_FILE).stem}_llamaparse_notebook.md"
    output_path = OUTPUT_DIR / output_filename
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    
    print(f"\n💾 Markdown saved to: {output_path}")
    print(f"📁 File size: {output_path.stat().st_size / 1024:.1f} KB")
    
    # Also save metadata as JSON
    metadata = {
        "source_file": PDF_FILE,
        "parser": "LlamaParse",
        "processing_time": parsing_time,
        "document_count": len(documents),
        "markdown_length": len(markdown_content),
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
    }
    
    metadata_path = OUTPUT_DIR / f"{Path(PDF_FILE).stem}_llamaparse_notebook_metadata.json"
    with open(metadata_path, 'w', encoding='utf-8') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"📋 Metadata saved to: {metadata_path}")
else:
    print("⚠️  No content to save")

## Performance Analysis

Analyze the performance characteristics of the LlamaParse parsing pipeline.

In [None]:
if documents:
    total_time = time.time() - start_time
    
    print("\n⚡ PERFORMANCE ANALYSIS")
    print("=" * 30)
    
    file_size_mb = os.path.getsize(PDF_FILE) / (1024 * 1024)
    chars_per_second = len(markdown_content) / total_time if total_time > 0 else 0
    mb_per_second = file_size_mb / total_time if total_time > 0 else 0
    
    print(f"📄 Input file size: {file_size_mb:.2f} MB")
    print(f"⏱️  Total processing time: {total_time:.2f} seconds")
    print(f"⚡ Processing speed: {chars_per_second:.0f} chars/second")
    print(f"⚡ Throughput: {mb_per_second:.2f} MB/second")
    
    # Quality metrics
    if markdown_content:
        output_size_mb = len(markdown_content.encode('utf-8')) / (1024 * 1024)
        compression_ratio = file_size_mb / output_size_mb if output_size_mb > 0 else 0
        print(f"💾 Output size: {output_size_mb:.2f} MB")
        print(f"📉 Compression ratio: {compression_ratio:.1f}x")
        
        # Content density
        words = len(markdown_content.split())
        print(f"📝 Word count: {words:,}")
        print(f"📊 Words per MB: {words/file_size_mb:.0f}")
else:
    print("⚠️  No performance data available due to parsing failure")

## Results Summary

Display comprehensive results from the LlamaParse parsing pipeline.

In [None]:
print("\n📊 FINAL RESULTS SUMMARY")
print("=" * 50)

if documents:
    print(f"📄 Document: {os.path.basename(PDF_FILE)}")
    print(f"🦙 Parser: LlamaParse (AI-powered)")
    print(f"⏱️  Total Processing Time: {total_time:.2f} seconds")
    print(f"✅ Status: Successfully processed")
    print()
    print(f"📋 Content Summary:")
    print(f"   - Documents extracted: {len(documents)}")
    print(f"   - Markdown length: {len(markdown_content):,} characters")
    print(f"   - Word count: {len(markdown_content.split()):,} words")
    
    if headers:
        print(f"   - Headers found: {len(headers)}")
    if tables:
        print(f"   - Table lines: {len(tables)}")
    
    print(f"\n💾 Output Files:")
    if 'output_path' in locals():
        print(f"   - Markdown: {output_path}")
    if 'metadata_path' in locals():
        print(f"   - Metadata: {metadata_path}")
else:
    print(f"📄 Document: {os.path.basename(PDF_FILE)}")
    print(f"🦙 Parser: LlamaParse (AI-powered)")
    print(f"❌ Status: Processing failed")
    print(f"⚠️  Please check your API key configuration")

## Comparison Notes

### LlamaParse vs Unstructured Parsing

**LlamaParse Advantages:**
- AI-powered understanding of document structure
- Better handling of complex layouts and tables
- Clean markdown output with proper formatting
- Excellent for documents with complex visual elements

**LlamaParse Considerations:**
- Requires API key and internet connection
- Processing time depends on cloud service
- Usage may be subject to rate limits and costs
- Less granular control over individual elements

**Best Use Cases:**
- Research papers with complex formatting
- Technical documents with tables and figures
- Documents where structure preservation is critical
- When high-quality markdown output is needed

## Conclusion

The LlamaParse parsing pipeline provides AI-powered document understanding with:

- **Advanced structure recognition** for complex documents
- **Clean markdown output** with proper formatting
- **Comprehensive performance metrics** and timing analysis
- **Cloud-based processing** with state-of-the-art AI models

The processed markdown is ready for further chunking, embedding, and integration into your RAG system.