-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
ParquetInputStream currently doesn't override the readVectored() method from its parent class DelegatingSeekableInputStream, causing it to fall back to the default implementation that throws UnsupportedOperationException.
Problem
When ParquetFileReader attempts to use vectored reads for improved I/O performance, the operation fails because:
- ParquetFileReader calls
readVectored()on the underlying stream (line 667 in ParquetFileReader.java) - The stream is
org.apache.parquet.io.SeekableInputStreamwhich has a default implementation throwingUnsupportedOperationException - ParquetInputStream (Paimon's wrapper) extends
DelegatingSeekableInputStreambut doesn't overridereadVectored() - This causes the exception to be thrown, preventing vectored reads from working
Impact
- Performance: Cannot leverage Parquet's vectored read optimization for parallel I/O
- Efficiency: Falls back to sequential reads even when the underlying FileIO supports vectored reads
- Cloud Storage: Missing optimization opportunities for S3, OSS, and other cloud storage systems that benefit from batch reads
Root Cause
The gap exists between two interface systems:
Paimon's Interface:
VectoredReadableinterface withreadVectored(List<FileRange>)FileRangeusesCompletableFuture<byte[]>for async results
Parquet's Interface:
SeekableInputStream.readVectored(List<ParquetFileRange>, ByteBufferAllocator)ParquetFileRangeusesCompletableFuture<ByteBuffer>for async results
ParquetInputStream needs to bridge these two interfaces.
Proposed Solution
Implement readVectored() in ParquetInputStream to:
- Check capability: Detect if underlying stream supports
VectoredReadable - Convert ranges: Transform
ParquetFileRangetoFileRange - Delegate to Paimon: Use Paimon's
VectoredReadable.readVectored() - Transform data: Convert
CompletableFuture<byte[]>toCompletableFuture<ByteBuffer> - Fallback: Use serial reads when vectored reads are unavailable
Benefits
✅ Performance: Enable vectored reads for Parquet files in Paimon
✅ Compatibility: Work with both vectored and non-vectored FileIO implementations
✅ Cloud Optimization: Better I/O performance on S3, OSS, Azure, GCS
✅ Backward Compatible: Graceful fallback for older FileIO implementations
Testing
Comprehensive test coverage will include:
- Vectored reads with
VectoredReadablesupport - Fallback to serial reads without vectored support
- Empty ranges handling
- End-to-end testing with real Parquet files
Related Files
paimon-format/src/main/java/org/apache/paimon/format/parquet/ParquetInputStream.javapaimon-format/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.javapaimon-common/src/main/java/org/apache/paimon/fs/VectoredReadable.javapaimon-common/src/main/java/org/apache/paimon/fs/FileRange.java