fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result read performance#4293
fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result read performance#4293kunwp1 wants to merge 4 commits intoapache:mainfrom
IcebergDocument.hasNext() to improve result read performance#4293Conversation
|
@Xiao-zhen-Liu @bobbai00 Please review it. |
IcebergDocument.hasNext() to improve result writes performance
bobbai00
left a comment
There was a problem hiding this comment.
Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?
I checked the python side but seems like we don't have such corresponding logic. Can you confirm? |
IcebergDocument.hasNext() to improve result writes performanceIcebergDocument.hasNext() to improve result read performance
Under this folder: https://github.com/apache/texera/tree/main/amber/src/main/python/core/storage/iceberg For example: |
I checked those files and they don't have the problematic logic. Seems like the implementation on the python side is different. We don't have to fix the python side. |
What changes were proposed in this PR?
This PR addresses #4289 by optimizing
IcebergDocument.hasNext()to minimize redundant catalog round-trips. By introducing a guard condition, we ensureseekToUsableFile()and its subsequent catalog calls are only triggered when the current record iterator is fully exhausted.trueimmediately.usableFileIterator.usableFileIteratoris also empty, callseekToUsableFile().Any related issues, documentation, discussions?
Fix #4289
How was this PR tested?
storage.iceberg.table.commit.batch-sizeto 1M (matching the total record count).Was this PR authored or co-authored using generative AI tooling?
No.