epub-utils

A Python library and CLI tool for inspecting ePub from the terminal.

Features

Complete EPUB Support - Parse both EPUB 2.0.1 and EPUB 3.0+ specifications with container, package, manifest, spine, and table of contents inspection
Rich Metadata Extraction - Extract Dublin Core metadata (title, author, language, publisher) with key-value, XML, and raw output formats for easy scripting
Content Analysis - Access document content by manifest ID or file path, with plain text extraction for content analysis and word counting
File System Navigation - Browse and extract any file within EPUB archives (XHTML, CSS, images, fonts) with detailed file information including sizes and compression ratios
Multiple Output Formats - XML with syntax highlighting, raw content, key-value pairs, plain text, and formatted tables to suit different workflows
CLI and Python API - Comprehensive command-line tool for terminal workflows plus a clean Python library for programmatic access
Standards Compliance - Built-in validation capabilities and adherence to W3C/IDPF specifications for reliable EPUB processing
Performance Optimized - Lazy loading, efficient ZIP parsing, and optional lxml support for handling large EPUB collections

Installation

epub-utils is available as a PyPI package

pip install epub-utils

Use as a CLI tool

The basic format is:

epub-utils EPUB_PATH COMMAND [OPTIONS]

Commands

container - Display the container.xml contents

# Show container.xml with syntax highlighting
epub-utils book.epub container

# Show container.xml as raw content
epub-utils book.epub container --format raw

# Show container.xml with pretty formatting
epub-utils book.epub container --pretty-print

package - Display the package OPF file contents

# Show package.opf with syntax highlighting
epub-utils book.epub package

# Show package.opf as raw content
epub-utils book.epub package --format raw

toc - Display the table of contents file contents

# Show toc.ncx/nav.xhtml with syntax highlighting (auto-detect)
epub-utils book.epub toc

# Show toc.ncx/nav.xhtml as raw content
epub-utils book.epub toc --format raw

# Force NCX format (EPUB 2 navigation control file)
epub-utils book.epub toc --ncx

# Force Navigation Document (EPUB 3 navigation file)
epub-utils book.epub toc --nav

metadata - Display the metadata information from the package file

# Show metadata with syntax highlighting
epub-utils book.epub metadata

# Show metadata as key-value pairs
epub-utils book.epub metadata --format kv

# Show metadata with pretty formatting
epub-utils book.epub metadata --pretty-print

manifest - Display the manifest information from the package file

# Show manifest with syntax highlighting
epub-utils book.epub manifest

# Show manifest as raw content
epub-utils book.epub manifest --format raw

spine - Display the spine information from the package file

# Show spine with syntax highlighting
epub-utils book.epub spine

# Show spine as raw content
epub-utils book.epub spine --format raw

content - Display the content of a document by its manifest item ID

# Show content with syntax highlighting
epub-utils book.epub content chapter1

# Show raw HTML/XML content
epub-utils book.epub content chapter1 --format raw

# Show plain text content (HTML tags stripped)
epub-utils book.epub content chapter1 --format plain

files - List all files in the EPUB archive or display content of a specific file

# List all files in table format (default)
epub-utils book.epub files

# List all files as simple paths
epub-utils book.epub files --format raw

# Display content of a specific file by path
epub-utils book.epub files OEBPS/chapter1.xhtml

# Display XHTML file content in different formats
epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain

# Display non-XHTML files (CSS, images, etc.)
epub-utils book.epub files OEBPS/styles/main.css
epub-utils book.epub files META-INF/container.xml

Options

-h, --help - Show help message and exit
-v, --version - Show program version and exit
-fmt, --format - Output format (default: xml)
- xml - Display with XML syntax highlighting (default)
- raw - Display raw content without formatting
- plain - Display plain text content (HTML tags stripped, for content command only)
- kv - Display key-value pairs (where supported)

-pp, --pretty-print - Pretty-print XML output (applies to xml and raw formats only)

# Display as raw content
epub-utils book.epub package --format raw

# Display with XML syntax highlighting (default)
epub-utils book.epub package --format xml

# Display as key-value pairs (for supported commands)
epub-utils book.epub metadata --format kv

# Display plain text content (content command only)
epub-utils book.epub content chapter1 --format plain

# Pretty-print XML with proper indentation
epub-utils book.epub package --pretty-print

# Combine format and pretty-print options
epub-utils book.epub metadata --format raw --pretty-print

Use as a Python library

from epub_utils import Document

# Load an EPUB document
doc = Document("path/to/book.epub")

Basic Document Access

Access the main components of an EPUB document:

# Get container information
container = doc.container
print(container.to_xml())  # Formatted XML with syntax highlighting
print(container.to_str())  # Raw XML content

# Get package information  
package = doc.package
print(package.to_xml())    # Formatted XML with syntax highlighting
print(package.to_str())    # Raw XML content

# Get table of contents
toc = doc.toc
if toc:  # TOC might be None if not present
    print(toc.to_xml())    # Formatted XML with syntax highlighting
    print(toc.to_str())    # Raw XML content

# Access specific navigation formats
ncx = doc.ncx  # NCX format (EPUB 2 or EPUB 3 with NCX)
if ncx:
    print("NCX navigation available")
    print(ncx.to_xml())

nav = doc.nav  # Navigation Document (EPUB 3 only)
if nav:
    print("Navigation Document available")
    print(nav.to_xml())
    print(toc.to_str())    # Raw XML content

Working with Metadata

Access and format metadata information:

# Access package metadata
metadata = doc.package.metadata

# Basic Dublin Core elements
print(f"Title: {metadata.title}")
print(f"Creator: {metadata.creator}")
print(f"Identifier: {metadata.identifier}")
print(f"Language: {metadata.language}")
print(f"Publisher: {metadata.publisher}")
print(f"Date: {metadata.date}")

# Dynamic attribute access for any metadata field
isbn = getattr(metadata, 'isbn', 'Not available')
series = getattr(metadata, 'series', 'Not available')

# Get formatted metadata output
print(metadata.to_xml())     # Formatted XML with syntax highlighting
print(metadata.to_str())     # Raw XML content  
print(metadata.to_kv())      # Key-value format for easy parsing

Working with Manifest

Access the manifest to see all files in the EPUB:

# Get manifest information
manifest = doc.package.manifest

# Access all manifest items
for item in manifest.items:
    print(f"ID: {item['id']}")
    print(f"File: {item['href']}")
    print(f"Type: {item['media_type']}")
    print(f"Properties: {item['properties']}")

# Find specific items
nav_item = manifest.find_by_property('nav')
chapter = manifest.find_by_id('chapter1')
xhtml_items = manifest.find_by_media_type('application/xhtml+xml')

# Get formatted manifest output
print(manifest.to_xml())     # Formatted XML with syntax highlighting
print(manifest.to_str())     # Raw XML content

Working with Spine

Access the spine to see the reading order:

# Get spine information
spine = doc.package.spine

# Access spine properties
print(f"TOC reference: {spine.toc}")
print(f"Page progression: {spine.page_progression_direction}")

# Access spine items in reading order
for itemref in spine.itemrefs:
    print(f"ID: {itemref['idref']}")
    print(f"Linear: {itemref['linear']}")
    print(f"Properties: {itemref['properties']}")

# Find specific spine item
spine_item = spine.find_by_idref('chapter1')

# Get formatted spine output
print(spine.to_xml())        # Formatted XML with syntax highlighting
print(spine.to_str())        # Raw XML content

Content Extraction

Extract content from specific documents within the EPUB:

# Access content by manifest item ID
try:
    content = doc.find_content_by_id('chapter1')
    
    # Get content in different formats
    print(content.to_xml())      # Formatted XHTML with syntax highlighting
    print(content.to_str())      # Raw XHTML content
    print(content.to_plain())    # Plain text with HTML tags stripped
    
    # Access the parsed content tree for advanced processing
    tree = content.tree
    inner_text = content.inner_text
    
except ValueError as e:
    print(f"Content not found: {e}")

# Find publication resources by ID (for non-spine items)
try:
    resource = doc.find_pub_resource_by_id('cover-image')
except ValueError as e:
    print(f"Resource not found: {e}")

File Operations

List and access files directly by their paths in the EPUB archive:

# Get information about all files
files_info = doc.get_files_info()
for file_info in files_info:
    print(f"Path: {file_info['path']}")
    print(f"Size: {file_info['size']} bytes")
    print(f"Compressed: {file_info['compressed_size']} bytes")
    print(f"Modified: {file_info['modified']}")

# Access specific file by path
try:
    # For XHTML files, returns XHTMLContent object
    xhtml_content = doc.get_file_by_path('OEBPS/chapter1.xhtml')
    print(xhtml_content.to_xml())
    print(xhtml_content.to_plain())
    
    # For other files, returns raw string content
    css_content = doc.get_file_by_path('OEBPS/styles/main.css')
    print(css_content)
    
except ValueError as e:
    print(f"File not found: {e}")

Output Formatting Options

All document components support flexible output formatting:

# Pretty-printed XML output
print(metadata.to_str(pretty_print=True))
print(manifest.to_xml(pretty_print=True))

# Syntax highlighting can be controlled
print(package.to_xml(highlight_syntax=True))   # With highlighting (default)
print(package.to_xml(highlight_syntax=False))  # Without highlighting

Industry Standards & Compliance

epub-utils provides comprehensive support for industry-standard ePub specifications and related technologies, ensuring broad compatibility across the digital publishing ecosystem.

Supported EPUB Standards

EPUB 2.0.1 (IDPF, 2010)
- Complete OPF 2.0 package document support
- NCX navigation control file support
- Dublin Core metadata extraction
- Legacy EPUB compatibility
EPUB 3.0+ (IDPF/W3C, 2011-present)
- EPUB 3.3 specification compliance
- HTML5-based content documents
- Navigation document (nav.xhtml) support
- Enhanced accessibility features
- Media overlays and scripting support

Metadata Standards

Dublin Core Metadata Initiative (DCMI)
- Dublin Core Metadata Element Set v1.1
- Dublin Core Metadata Terms (DCTERMS)
Open Packaging Format (OPF)
- OPF 2.0 specification (EPUB 2.0.1)
- OPF 3.0 specification (EPUB 3.0+)

The library maintains strict adherence to published specifications while providing robust handling of real-world EPUB variations commonly found in commercial and open-source reading applications.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
epub_utils		epub_utils
requirements		requirements
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
ruff.toml		ruff.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

epub-utils

Features

Installation

Use as a CLI tool

Commands

Options

Use as a Python library

Basic Document Access

Working with Metadata

Working with Manifest

Working with Spine

Content Extraction

File Operations

Output Formatting Options

Industry Standards & Compliance

Supported EPUB Standards

Metadata Standards

About

Uh oh!

Releases

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

epub-utils

Features

Installation

Use as a CLI tool

Commands

Options

Use as a Python library

Basic Document Access

Working with Metadata

Working with Manifest

Working with Spine

Content Extraction

File Operations

Output Formatting Options

Industry Standards & Compliance

Supported EPUB Standards

Metadata Standards

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Languages