Skip to content

[format] ParquetInputStream doesn't support vectored reads #6657

@tchivs

Description

@tchivs

Description

ParquetInputStream currently doesn't override the readVectored() method from its parent class DelegatingSeekableInputStream, causing it to fall back to the default implementation that throws UnsupportedOperationException.

Problem

When ParquetFileReader attempts to use vectored reads for improved I/O performance, the operation fails because:

  1. ParquetFileReader calls readVectored() on the underlying stream (line 667 in ParquetFileReader.java)
  2. The stream is org.apache.parquet.io.SeekableInputStream which has a default implementation throwing UnsupportedOperationException
  3. ParquetInputStream (Paimon's wrapper) extends DelegatingSeekableInputStream but doesn't override readVectored()
  4. This causes the exception to be thrown, preventing vectored reads from working

Impact

  • Performance: Cannot leverage Parquet's vectored read optimization for parallel I/O
  • Efficiency: Falls back to sequential reads even when the underlying FileIO supports vectored reads
  • Cloud Storage: Missing optimization opportunities for S3, OSS, and other cloud storage systems that benefit from batch reads

Root Cause

The gap exists between two interface systems:

Paimon's Interface:

  • VectoredReadable interface with readVectored(List<FileRange>)
  • FileRange uses CompletableFuture<byte[]> for async results

Parquet's Interface:

  • SeekableInputStream.readVectored(List<ParquetFileRange>, ByteBufferAllocator)
  • ParquetFileRange uses CompletableFuture<ByteBuffer> for async results

ParquetInputStream needs to bridge these two interfaces.

Proposed Solution

Implement readVectored() in ParquetInputStream to:

  1. Check capability: Detect if underlying stream supports VectoredReadable
  2. Convert ranges: Transform ParquetFileRange to FileRange
  3. Delegate to Paimon: Use Paimon's VectoredReadable.readVectored()
  4. Transform data: Convert CompletableFuture<byte[]> to CompletableFuture<ByteBuffer>
  5. Fallback: Use serial reads when vectored reads are unavailable

Benefits

Performance: Enable vectored reads for Parquet files in Paimon
Compatibility: Work with both vectored and non-vectored FileIO implementations
Cloud Optimization: Better I/O performance on S3, OSS, Azure, GCS
Backward Compatible: Graceful fallback for older FileIO implementations

Testing

Comprehensive test coverage will include:

  • Vectored reads with VectoredReadable support
  • Fallback to serial reads without vectored support
  • Empty ranges handling
  • End-to-end testing with real Parquet files

Related Files

  • paimon-format/src/main/java/org/apache/paimon/format/parquet/ParquetInputStream.java
  • paimon-format/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
  • paimon-common/src/main/java/org/apache/paimon/fs/VectoredReadable.java
  • paimon-common/src/main/java/org/apache/paimon/fs/FileRange.java

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions