Skip to content

ernestofgonzalez/epub-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

epub-utils

PyPI Changelog Python 3.x License

A Python library and CLI tool for inspecting ePub from the terminal.

Features

  • Complete EPUB Support - Parse both EPUB 2.0.1 and EPUB 3.0+ specifications with container, package, manifest, spine, and table of contents inspection
  • Rich Metadata Extraction - Extract Dublin Core metadata (title, author, language, publisher) with key-value, XML, and raw output formats for easy scripting
  • Content Analysis - Access document content by manifest ID or file path, with plain text extraction for content analysis and word counting
  • File System Navigation - Browse and extract any file within EPUB archives (XHTML, CSS, images, fonts) with detailed file information including sizes and compression ratios
  • Multiple Output Formats - XML with syntax highlighting, raw content, key-value pairs, plain text, and formatted tables to suit different workflows
  • CLI and Python API - Comprehensive command-line tool for terminal workflows plus a clean Python library for programmatic access
  • Standards Compliance - Built-in validation capabilities and adherence to W3C/IDPF specifications for reliable EPUB processing
  • Performance Optimized - Lazy loading, efficient ZIP parsing, and optional lxml support for handling large EPUB collections

Installation

epub-utils is available as a PyPI package

pip install epub-utils

Use as a CLI tool

The basic format is:

epub-utils EPUB_PATH COMMAND [OPTIONS]

Commands

  • container - Display the container.xml contents

    # Show container.xml with syntax highlighting
    epub-utils book.epub container
    
    # Show container.xml as raw content
    epub-utils book.epub container --format raw
    
    # Show container.xml with pretty formatting
    epub-utils book.epub container --pretty-print
  • package - Display the package OPF file contents

    # Show package.opf with syntax highlighting
    epub-utils book.epub package
    
    # Show package.opf as raw content
    epub-utils book.epub package --format raw
  • toc - Display the table of contents file contents

    # Show toc.ncx/nav.xhtml with syntax highlighting (auto-detect)
    epub-utils book.epub toc
    
    # Show toc.ncx/nav.xhtml as raw content
    epub-utils book.epub toc --format raw
    
    # Force NCX format (EPUB 2 navigation control file)
    epub-utils book.epub toc --ncx
    
    # Force Navigation Document (EPUB 3 navigation file)
    epub-utils book.epub toc --nav
  • metadata - Display the metadata information from the package file

    # Show metadata with syntax highlighting
    epub-utils book.epub metadata
    
    # Show metadata as key-value pairs
    epub-utils book.epub metadata --format kv
    
    # Show metadata with pretty formatting
    epub-utils book.epub metadata --pretty-print
  • manifest - Display the manifest information from the package file

    # Show manifest with syntax highlighting
    epub-utils book.epub manifest
    
    # Show manifest as raw content
    epub-utils book.epub manifest --format raw
  • spine - Display the spine information from the package file

    # Show spine with syntax highlighting
    epub-utils book.epub spine
    
    # Show spine as raw content
    epub-utils book.epub spine --format raw
  • content - Display the content of a document by its manifest item ID

    # Show content with syntax highlighting
    epub-utils book.epub content chapter1
    
    # Show raw HTML/XML content
    epub-utils book.epub content chapter1 --format raw
    
    # Show plain text content (HTML tags stripped)
    epub-utils book.epub content chapter1 --format plain
  • files - List all files in the EPUB archive or display content of a specific file

    # List all files in table format (default)
    epub-utils book.epub files
    
    # List all files as simple paths
    epub-utils book.epub files --format raw
    
    # Display content of a specific file by path
    epub-utils book.epub files OEBPS/chapter1.xhtml
    
    # Display XHTML file content in different formats
    epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
    epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
    epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
    
    # Display non-XHTML files (CSS, images, etc.)
    epub-utils book.epub files OEBPS/styles/main.css
    epub-utils book.epub files META-INF/container.xml

Options

  • -h, --help - Show help message and exit

  • -v, --version - Show program version and exit

  • -fmt, --format - Output format (default: xml)

    • xml - Display with XML syntax highlighting (default)
    • raw - Display raw content without formatting
    • plain - Display plain text content (HTML tags stripped, for content command only)
    • kv - Display key-value pairs (where supported)
  • -pp, --pretty-print - Pretty-print XML output (applies to xml and raw formats only)

    # Display as raw content
    epub-utils book.epub package --format raw
    
    # Display with XML syntax highlighting (default)
    epub-utils book.epub package --format xml
    
    # Display as key-value pairs (for supported commands)
    epub-utils book.epub metadata --format kv
    
    # Display plain text content (content command only)
    epub-utils book.epub content chapter1 --format plain
    
    # Pretty-print XML with proper indentation
    epub-utils book.epub package --pretty-print
    
    # Combine format and pretty-print options
    epub-utils book.epub metadata --format raw --pretty-print

Use as a Python library

from epub_utils import Document

# Load an EPUB document
doc = Document("path/to/book.epub")

Basic Document Access

Access the main components of an EPUB document:

# Get container information
container = doc.container
print(container.to_xml())  # Formatted XML with syntax highlighting
print(container.to_str())  # Raw XML content

# Get package information  
package = doc.package
print(package.to_xml())    # Formatted XML with syntax highlighting
print(package.to_str())    # Raw XML content

# Get table of contents
toc = doc.toc
if toc:  # TOC might be None if not present
    print(toc.to_xml())    # Formatted XML with syntax highlighting
    print(toc.to_str())    # Raw XML content

# Access specific navigation formats
ncx = doc.ncx  # NCX format (EPUB 2 or EPUB 3 with NCX)
if ncx:
    print("NCX navigation available")
    print(ncx.to_xml())

nav = doc.nav  # Navigation Document (EPUB 3 only)
if nav:
    print("Navigation Document available")
    print(nav.to_xml())
    print(toc.to_str())    # Raw XML content

Working with Metadata

Access and format metadata information:

# Access package metadata
metadata = doc.package.metadata

# Basic Dublin Core elements
print(f"Title: {metadata.title}")
print(f"Creator: {metadata.creator}")
print(f"Identifier: {metadata.identifier}")
print(f"Language: {metadata.language}")
print(f"Publisher: {metadata.publisher}")
print(f"Date: {metadata.date}")

# Dynamic attribute access for any metadata field
isbn = getattr(metadata, 'isbn', 'Not available')
series = getattr(metadata, 'series', 'Not available')

# Get formatted metadata output
print(metadata.to_xml())     # Formatted XML with syntax highlighting
print(metadata.to_str())     # Raw XML content  
print(metadata.to_kv())      # Key-value format for easy parsing

Working with Manifest

Access the manifest to see all files in the EPUB:

# Get manifest information
manifest = doc.package.manifest

# Access all manifest items
for item in manifest.items:
    print(f"ID: {item['id']}")
    print(f"File: {item['href']}")
    print(f"Type: {item['media_type']}")
    print(f"Properties: {item['properties']}")

# Find specific items
nav_item = manifest.find_by_property('nav')
chapter = manifest.find_by_id('chapter1')
xhtml_items = manifest.find_by_media_type('application/xhtml+xml')

# Get formatted manifest output
print(manifest.to_xml())     # Formatted XML with syntax highlighting
print(manifest.to_str())     # Raw XML content

Working with Spine

Access the spine to see the reading order:

# Get spine information
spine = doc.package.spine

# Access spine properties
print(f"TOC reference: {spine.toc}")
print(f"Page progression: {spine.page_progression_direction}")

# Access spine items in reading order
for itemref in spine.itemrefs:
    print(f"ID: {itemref['idref']}")
    print(f"Linear: {itemref['linear']}")
    print(f"Properties: {itemref['properties']}")

# Find specific spine item
spine_item = spine.find_by_idref('chapter1')

# Get formatted spine output
print(spine.to_xml())        # Formatted XML with syntax highlighting
print(spine.to_str())        # Raw XML content

Content Extraction

Extract content from specific documents within the EPUB:

# Access content by manifest item ID
try:
    content = doc.find_content_by_id('chapter1')
    
    # Get content in different formats
    print(content.to_xml())      # Formatted XHTML with syntax highlighting
    print(content.to_str())      # Raw XHTML content
    print(content.to_plain())    # Plain text with HTML tags stripped
    
    # Access the parsed content tree for advanced processing
    tree = content.tree
    inner_text = content.inner_text
    
except ValueError as e:
    print(f"Content not found: {e}")

# Find publication resources by ID (for non-spine items)
try:
    resource = doc.find_pub_resource_by_id('cover-image')
except ValueError as e:
    print(f"Resource not found: {e}")

File Operations

List and access files directly by their paths in the EPUB archive:

# Get information about all files
files_info = doc.get_files_info()
for file_info in files_info:
    print(f"Path: {file_info['path']}")
    print(f"Size: {file_info['size']} bytes")
    print(f"Compressed: {file_info['compressed_size']} bytes")
    print(f"Modified: {file_info['modified']}")

# Access specific file by path
try:
    # For XHTML files, returns XHTMLContent object
    xhtml_content = doc.get_file_by_path('OEBPS/chapter1.xhtml')
    print(xhtml_content.to_xml())
    print(xhtml_content.to_plain())
    
    # For other files, returns raw string content
    css_content = doc.get_file_by_path('OEBPS/styles/main.css')
    print(css_content)
    
except ValueError as e:
    print(f"File not found: {e}")

Output Formatting Options

All document components support flexible output formatting:

# Pretty-printed XML output
print(metadata.to_str(pretty_print=True))
print(manifest.to_xml(pretty_print=True))

# Syntax highlighting can be controlled
print(package.to_xml(highlight_syntax=True))   # With highlighting (default)
print(package.to_xml(highlight_syntax=False))  # Without highlighting

Industry Standards & Compliance

epub-utils provides comprehensive support for industry-standard ePub specifications and related technologies, ensuring broad compatibility across the digital publishing ecosystem.

Supported EPUB Standards

  • EPUB 2.0.1 (IDPF, 2010)

    • Complete OPF 2.0 package document support
    • NCX navigation control file support
    • Dublin Core metadata extraction
    • Legacy EPUB compatibility
  • EPUB 3.0+ (IDPF/W3C, 2011-present)

    • EPUB 3.3 specification compliance
    • HTML5-based content documents
    • Navigation document (nav.xhtml) support
    • Enhanced accessibility features
    • Media overlays and scripting support

Metadata Standards

  • Dublin Core Metadata Initiative (DCMI)

    • Dublin Core Metadata Element Set v1.1
    • Dublin Core Metadata Terms (DCTERMS)
  • Open Packaging Format (OPF)

    • OPF 2.0 specification (EPUB 2.0.1)
    • OPF 3.0 specification (EPUB 3.0+)

The library maintains strict adherence to published specifications while providing robust handling of real-world EPUB variations commonly found in commercial and open-source reading applications.

About

A Python library and CLI tool for inspecting ePub from the terminal

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •