Skip to content

fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result read performance#4293

Open
kunwp1 wants to merge 4 commits intoapache:mainfrom
kunwp1:chris-fix-4289
Open

fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result read performance#4293
kunwp1 wants to merge 4 commits intoapache:mainfrom
kunwp1:chris-fix-4289

Conversation

@kunwp1
Copy link
Contributor

@kunwp1 kunwp1 commented Mar 13, 2026

What changes were proposed in this PR?

This PR addresses #4289 by optimizing IcebergDocument.hasNext() to minimize redundant catalog round-trips. By introducing a guard condition, we ensure seekToUsableFile() and its subsequent catalog calls are only triggered when the current record iterator is fully exhausted.

  1. If the current file has more records, return true immediately.
  2. Only if the current file is exhausted, check usableFileIterator.
  3. Only if usableFileIterator is also empty, call seekToUsableFile().

Any related issues, documentation, discussions?

Fix #4289

How was this PR tested?

  1. Import and use Untitled workflow (9).json.
  2. Use a CSV file containing 1M records.
  3. Set storage.iceberg.table.commit.batch-size to 1M (matching the total record count).
  4. Compare the performance before fix and after fix. For me it was 2m 45s vs 36s.

Was this PR authored or co-authored using generative AI tooling?

No.

@kunwp1 kunwp1 requested a review from bobbai00 March 13, 2026 21:49
@kunwp1 kunwp1 self-assigned this Mar 13, 2026
@chenlica
Copy link
Contributor

@Xiao-zhen-Liu @bobbai00 Please review it.

@bobbai00 bobbai00 changed the title fix: Optimized IcebergDocument.hasNext() to reduce catalog round-trips fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result writes performance Mar 14, 2026
Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?

@kunwp1
Copy link
Contributor Author

kunwp1 commented Mar 16, 2026

Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?

I checked the python side but seems like we don't have such corresponding logic. Can you confirm?

@kunwp1 kunwp1 changed the title fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result writes performance fix: reduce catalog round-trips in IcebergDocument.hasNext() to improve result read performance Mar 16, 2026
@bobbai00
Copy link
Contributor

Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?

I checked the python side but seems like we don't have such corresponding logic. Can you confirm?

Under this folder: https://github.com/apache/texera/tree/main/amber/src/main/python/core/storage/iceberg

For example:
https://github.com/apache/texera/blob/main/amber/src/main/python/core/storage/iceberg/iceberg_document.py and

@kunwp1
Copy link
Contributor Author

kunwp1 commented Mar 16, 2026

Scala side LGTM! Can you also check the python side's iceberg document's corresponding logic, change it and teset it?

I checked the python side but seems like we don't have such corresponding logic. Can you confirm?

Under this folder: https://github.com/apache/texera/tree/main/amber/src/main/python/core/storage/iceberg

For example: https://github.com/apache/texera/blob/main/amber/src/main/python/core/storage/iceberg/iceberg_document.py and

I checked those files and they don't have the problematic logic. Seems like the implementation on the python side is different. We don't have to fix the python side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Excessive Catalog Round-trips in IcebergDocument.hasNext()

3 participants