# StormParse PDF Parsing Tutorial

This interactive notebook demonstrates how to use the StormParse API for PDF document parsing with practical examples.

## Setup and Installation

First, let's import the necessary libraries and set up our API configuration.

In [None]:
%pip install requests

# Import required libraries
import requests
import time
import os
import re
import json
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed

# Configuration
# StormParse API configuration
API_BASE_URL = "https://live-storm-apis-parse-router.sionic.im"
API_KEY = "demo-test"  # Replace with your API key

# Headers with authorization
headers = {
    'Authorization': f'Bearer {API_KEY}',
    'Accept': '*/*'
}

# Test connection
try:
    # Note: The base URL might not accept GET requests, so we'll skip the test
    print(f"✅ API configured for: {API_BASE_URL}")
    print(f"✅ Using API key: {API_KEY[:10]}...")
except Exception as e:
    print(f"Configuration loaded. API URL: {API_BASE_URL}")

Note: you may need to restart the kernel to use updated packages.
✅ API configured for: https://live-storm-apis-parse-router.sionic.im
✅ Using API key: demo_Kx8fH...


## Helper Functions

Let's define some helper functions that we'll use throughout the tutorial.

In [19]:
def upload_document(file_path, api_key=API_KEY, api_base_url=API_BASE_URL):
    """Upload a document to StormParse API."""
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Accept': '*/*'
    }
    
    with open(file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(f"{api_base_url}/api/v1/parsing/upload", files=files, headers=headers)
    
    if response.status_code != 200:
        raise Exception(f"Upload failed with status {response.status_code}")
    
    result = response.json()
    if result['result_type'] != 'SUCCESS':
        raise Exception(f"Upload error: {result['error']}")
    
    return result['success']['job_id']

def wait_for_job(job_id, api_key=API_KEY, api_base_url=API_BASE_URL, timeout=300):
    """Wait for a parsing job to complete."""
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Accept': '*/*'
    }
    
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        response = requests.get(f"{api_base_url}/api/v1/parsing/job/{job_id}", headers=headers)
        
        if response.status_code != 200:
            raise Exception(f"Status check failed with status {response.status_code}")
        
        result = response.json()
        if result['result_type'] != 'SUCCESS':
            raise Exception(f"Status error: {result['error']}")
        
        data = result['success']
        
        if data['state'] == 'COMPLETED':
            return data['pages']
        elif data['state'] == 'FAILED':
            raise Exception("Parsing job failed")
        
        time.sleep(2)
    
    raise Exception(f"Timeout after {timeout} seconds")

def parse_document(file_path, api_key=API_KEY, api_base_url=API_BASE_URL):
    """Upload and parse a document, returning the pages."""
    print(f"Uploading: {file_path}")
    job_id = upload_document(file_path, api_key, api_base_url)
    print(f"Job ID: {job_id}")
    
    print("Waiting for completion...", end="")
    pages = wait_for_job(job_id, api_key, api_base_url)
    print(" ✅ Done!")
    
    return pages

print("Helper functions loaded successfully!")

Helper functions loaded successfully!


## Example 1: Basic PDF Parsing

Let's start with a simple example of parsing a PDF document.

In [20]:
def parse_simple_pdf(file_path):
    """Parse a PDF and display basic information."""
    try:
        pages = parse_document(file_path)
        
        print(f"\n✅ Successfully parsed {len(pages)} pages")
        
        # Show statistics
        total_chars = sum(len(page['content']) for page in pages)
        total_words = sum(len(page['content'].split()) for page in pages)
        
        print(f"📊 Total characters: {total_chars:,}")
        print(f"📊 Total words: {total_words:,}")
        print(f"📊 Average words per page: {total_words // len(pages) if pages else 0:,}")
        
        return pages
        
    except Exception as e:
        print(f"❌ Error parsing PDF: {e}")
        return None

# Test with actual PDF files from 99_example_docs directory
# Available PDFs:
# - 삼성 재무제표.pdf (Samsung Financial Statement)
# - 삼성 청소기 퀵 가이드.pdf (Samsung Vacuum Cleaner Quick Guide)  
# - 신라스테이_여수_호텔안내자료.pdf (Shilla Stay Yeosu Hotel Information)

sample_pdf = "../99_example_docs/삼성 재무제표.pdf"
pages = parse_simple_pdf(sample_pdf)

Uploading: ../99_example_docs/삼성 재무제표.pdf
Job ID: 6090af87-c73f-47fa-95e9-35e267fbe30d
Waiting for completion... ✅ Done!

✅ Successfully parsed 2 pages
📊 Total characters: 3,875
📊 Total words: 587
📊 Average words per page: 293


## Example 2: Display Page Content

Let's examine the content of individual pages.

In [21]:
def display_page_content(pages, page_num=1, preview_length=500):
    """Display content from a specific page."""
    if not pages:
        print("No pages to display")
        return
    
    if page_num > len(pages) or page_num < 1:
        print(f"Invalid page number. Document has {len(pages)} pages.")
        return
    
    page = pages[page_num - 1]
    
    print(f"\n📄 Page {page['pageNumber']} Content:")
    print("=" * 60)
    
    content = page['content']
    
    # Show preview or full content
    if len(content) > preview_length:
        print(content[:preview_length] + "...")
        print(f"\n[Showing first {preview_length} characters of {len(content)} total]")
    else:
        print(content)
    
    print("=" * 60)

# Display first page content from the parsed PDF
if 'pages' in globals() and pages:
    display_page_content(pages, page_num=1)


📄 Page 1 Content:
이 문서는 삼성전자주식회사와 그 종속기업의 연결재무상태표를 나타냅니다. 제54기는 2022년 12월 31일 현재이며, 제53기는 2021년 12월 31일 현재 기준입니다. 모든 금액의 단위는 백만원입니다.

이 표는 자산과 부채 항목을 제54기(당기)와 제53기(전기)로 나누어 비교하여 보여줍니다. 각 항목 옆에는 관련 주석 번호가 기재되어 있습니다.

### 자산

#### I. 유동자산

유동자산의 총액은 제54기 218,470,581백만원으로, 제53기 218,163,185백만원에 비해 소폭 증가했습니다. 구체적인 항목별 금액은 다음과 같습니다.
1.  현금및현금성자산(주석 4, 28): 제54기 49,680,710백만원, 제53기 39,031,415백만원입니다.
2.  단기금융상품(주석 4, 28): 제54기 65,102,886백만원, 제53기 81,708,986백만원입니다.
3.  단기상각후원가금융자산(주석 4, 28): 제54기 414,610백만원, 제53기 3,369,0...

[Showing first 500 characters of 2547 total]


## Example 3: Batch Processing Multiple PDFs

Process multiple PDF files from a directory.

In [22]:
def batch_process_and_save_pdfs(directory_path, max_workers=3, api_key=API_KEY, api_base_url=API_BASE_URL, save_format='txt'):
    """Process all PDFs in a directory and save each one."""
    # Check if directory exists
    if not os.path.exists(directory_path):
        print(f"Directory not found: {directory_path}")
        return {}
    
    # Find all PDF files
    pdf_files = [
        os.path.join(directory_path, f)
        for f in os.listdir(directory_path)
        if f.lower().endswith('.pdf')
    ]
    
    if not pdf_files:
        print(f"No PDF files found in {directory_path}")
        return {}
    
    print(f"\n📁 Found {len(pdf_files)} PDF files to process")
    for pdf in pdf_files:
        print(f"  • {os.path.basename(pdf)}")
    
    results = {}
    
    def process_and_save_single_pdf(file_path):
        """Process a single PDF file and save it."""
        try:
            # Parse the document
            pages = parse_document(file_path, api_key, api_base_url)
            
            # Save the parsed content
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            saved_file = save_parsed_content(pages, base_name, save_format)
            
            return {
                'status': 'success',
                'file': os.path.basename(file_path),
                'pages': len(pages),
                'words': sum(len(p['content'].split()) for p in pages),
                'chars': sum(len(p['content']) for p in pages),
                'saved_as': saved_file,
                'content': pages  # Store the actual content if needed
            }
        except Exception as e:
            return {
                'status': 'error',
                'file': os.path.basename(file_path),
                'error': str(e)
            }
    
    # Process PDFs concurrently
    print("\n⏳ Processing and saving PDFs...")
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(process_and_save_single_pdf, pdf): pdf
            for pdf in pdf_files
        }
        
        for future in as_completed(future_to_file):
            result = future.result()
            results[result['file']] = result
            
            # Show progress
            if result['status'] == 'success':
                print(f"  ✅ {result['file']}: {result['pages']} pages, {result['words']:,} words")
                print(f"      → Saved as: {result['saved_as']}")
            else:
                print(f"  ❌ {result['file']}: {result['error']}")
    
    # Summary
    successful = sum(1 for r in results.values() if r['status'] == 'success')
    total_pages = sum(r.get('pages', 0) for r in results.values())
    total_words = sum(r.get('words', 0) for r in results.values())
    
    print(f"\n📊 Batch Processing Summary:")
    print(f"  • Processed: {successful}/{len(pdf_files)} PDFs successfully")
    print(f"  • Total pages: {total_pages:,}")
    print(f"  • Total words: {total_words:,}")
    print(f"\n💾 Saved Files:")
    for result in results.values():
        if result['status'] == 'success' and 'saved_as' in result:
            print(f"  • {result['saved_as']}")
    
    return results

# Process and save all PDFs in multiple formats
def batch_process_all_formats(directory_path, api_key=API_KEY):
    """Process PDFs and save in all available formats."""
    formats = ['txt', 'json', 'md']
    all_results = {}
    
    for fmt in formats:
        print(f"\n{'='*60}")
        print(f"📄 Saving all PDFs as {fmt.upper()} format")
        print(f"{'='*60}")
        
        results = batch_process_and_save_pdfs(
            directory_path, 
            max_workers=3, 
            api_key=api_key,
            save_format=fmt
        )
        all_results[fmt] = results
    
    return all_results

# Example usage: Process and save all PDFs
example_dir = "../99_example_docs"

# Option 1: Save all as text files
print("\n🚀 Processing and saving all PDFs as text files...")
results_txt = batch_process_and_save_pdfs(example_dir, api_key=API_KEY, save_format='txt')

# Option 2: Save all in multiple formats (uncomment to use)
# print("\n🚀 Processing and saving all PDFs in all formats...")
# all_format_results = batch_process_all_formats(example_dir, api_key=API_KEY)

# Function to save all PDFs combined into one file
def save_all_pdfs_combined(results, output_filename="all_pdfs_combined"):
    """Combine all successfully parsed PDFs into a single file."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"{output_filename}_{timestamp}.txt"
    
    successful_pdfs = [r for r in results.values() if r['status'] == 'success' and 'content' in r]
    
    if not successful_pdfs:
        print("No successfully parsed PDFs to combine")
        return None
    
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(f"# Combined PDF Documents\n")
        f.write(f"# Parsed on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"# Total documents: {len(successful_pdfs)}\n\n")
        f.write("="*80 + "\n\n")
        
        for result in successful_pdfs:
            f.write(f"\n{'='*80}\n")
            f.write(f"DOCUMENT: {result['file']}\n")
            f.write(f"Pages: {result['pages']}, Words: {result['words']:,}\n")
            f.write(f"{'='*80}\n\n")
            
            for page in result['content']:
                f.write(f"--- Page {page['pageNumber']} ---\n")
                f.write(page['content'])
                f.write("\n\n")
    
    print(f"✅ Combined all PDFs into: {output_file}")
    return output_file

# Combine all successfully parsed PDFs (if you want a single combined file)
if results_txt:
    combined_file = save_all_pdfs_combined(results_txt)


🚀 Processing and saving all PDFs as text files...

📁 Found 3 PDF files to process
  • 삼성 청소기 퀵 가이드.pdf
  • 신라스테이_여수_호텔안내자료.pdf
  • 삼성 재무제표.pdf

⏳ Processing and saving PDFs...
Uploading: ../99_example_docs/삼성 청소기 퀵 가이드.pdf
Uploading: ../99_example_docs/신라스테이_여수_호텔안내자료.pdf
Uploading: ../99_example_docs/삼성 재무제표.pdf
Job ID: 3a65adde-c803-4df5-b1b3-ca169098ad49
Waiting for completion...Job ID: bad2668c-4e91-425f-9798-71bd8153d0fe
Waiting for completion...Job ID: 1f588a35-792a-438c-bad7-526459fcad0e
Waiting for completion... ✅ Done!
✅ Saved as text: 신라스테이_여수_호텔안내자료_parsed_20250806_182406.txt
  ✅ 신라스테이_여수_호텔안내자료.pdf: 14 pages, 3,223 words
      → Saved as: 신라스테이_여수_호텔안내자료_parsed_20250806_182406.txt
 ✅ Done!
✅ Saved as text: 삼성 재무제표_parsed_20250806_182410.txt
  ✅ 삼성 재무제표.pdf: 2 pages, 679 words
      → Saved as: 삼성 재무제표_parsed_20250806_182410.txt
 