fix: fix puffin file reader by chenzl25 · Pull Request #2513 · apache/iceberg-rust

chenzl25 · 2026-05-26T11:27:04Z

What

Fix deletion vector reads from Puffin files by using the manifest-provided blob range for direct access.

For Puffin position delete files, Iceberg manifest entries carry content_offset, content_size_in_bytes, and referenced_data_file. These identify the deletion-vector blob inside the Puffin file. The reader now uses that range directly instead of first parsing the Puffin footer.

Why

The previous path tried to parse Puffin file metadata before reading the deletion-vector blob. That fails when the read path is already scoped to the DV blob range, because the blob payload does not start with the Puffin file magic PFA1.

This could produce errors like:

Bad magic value: [1, 0, 0, 0] should be [80, 70, 65, 49]

Spark handles deletion vectors through the manifest-provided blob offset/size, so this aligns iceberg-rust with the Iceberg direct-access model for deletion vectors.

Changes:

Add referenced_data_file to FileScanTask.
Propagate referenced_data_file from delete manifest entries into scan tasks.
For Puffin deletion vectors, read content_offset..content_offset + content_size_in_bytes directly.
Construct a deletion-vector-v1 blob from the direct range and parse it with DeleteVector::from_puffin_blob.
Keep the existing Puffin footer parsing path as fallback when referenced_data_file is unavailable.
Use path#offset:length as the positional delete load key for Puffin files, so multiple DV blobs in one Puffin file are handled independently.
Add a focused test covering direct blob-range reads from a real Puffin file.

(cherry picked from commit ad87946)

(cherry picked from commit 2d07dcd)

(cherry picked from commit f040f26)

address comments (cherry picked from commit 8c60928)

Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Co-authored-by: ZENOTME <st810918843@gmail.com> (cherry picked from commit 6d4339e)

Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Co-authored-by: ZENOTME <st810918843@gmail.com> (cherry picked from commit 6519c98)

--------- Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Co-authored-by: ZENOTME <st810918843@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 3201d8c)

…nce sink (#82) (cherry picked from commit 494ca90)

* feat(iceberg): introduce rewrite files action * fix(iceberg): add test * fix test (cherry picked from commit a2b6cc4)

Co-authored-by: Dylan <chenzilin25@gmail.com> (cherry picked from commit cee1fa7)

(cherry picked from commit ef44e88)

* feat(iceberg): rewrite_files support use_starting_sequence_number * chore(test): add test_sequence_number_in_manifest_entry

* fix: delete file lost wake Signed-off-by: xxchan <xxchan22f@gmail.com> * . Signed-off-by: xxchan <xxchan22f@gmail.com> * . Signed-off-by: xxchan <xxchan22f@gmail.com> * revert * typo --------- Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: xxchan <xxchan22f@gmail.com> Co-authored-by: Li0k <yuli@singularity-data.com>

* fix(iceberg): fix rewrite-files partition-spec-id * fix(docker): update docker file * add test * update minio * Revert "update minio" This reverts commit 4464d90.

…ploads (#57)

* feat: support to branch * fix: fix ref name * fix: current_snapshot_id * refactor: refactor interface * fmt

* fix: fix snapshot for entries * refactor: refactor starting_sequence_number * feat: OverwriteFiles Action * fix: fix compile * fix: add more test * typo * fix: fix ut * fix: fix check

* feat: manifest filter feat: manifest filter * feat: add ut * feat: add snapshot-id * feat: integrate filter manager and snapshot producer * feat: remove drop partition * feat: intergrate snapshot producer and filter manager * address comments * fix: address comments and UT * typo * update UT * feat: add configuration to enable filter manager * typo fix: fix drop dangling delete files (#88)

* fix(azdls): enable append mode for AZDLS write operations * fix: fix doc

fix: use branch_snapshot instead of current snashot (#94)

* support position writer * fmt * fix: fix integration-tests and bug * fix --------- Co-authored-by: Li0k <yuli@singularity-data.com>

* feat: expose task writer * feat: impl IcebergWriter * add new_with_partition_splitter for TaskWriter * support equality delta writer * fix CurrentFileStatus so that it can be called when uninitialized * fix task writer close * fmt * expose position delete schema and genreate snapshot id * better validate_partition_value error msg * rename equality_delta_writer to delta_writer --------- Co-authored-by: Dylan Chen <zilin@singularity-data.com>

* fix record to struct deserialization * refine the test case

* fmt * cargo clippy * fix test * cargo clippy * fix tests

(cherry picked from commit d650906)

* feat: Add `iceberg.schema` to footer for compatibility (#126) * mock change * remove fmt * add unit tests * fix tests * format * commit * feat: fixed formatting --------- Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>

* feat: Add `iceberg.schema` to footer for compatibility (#126) * mock change * remove fmt * add unit tests * fix tests * format * commit * fix: embed iceberg.schema in Arrow schema metadata for Snowflake comp… (#134) * fix: embed iceberg.schema in Arrow schema metadata for Snowflake compatibility The ParquetWriter was only writing the iceberg.schema JSON into the Parquet footer key-value metadata (WriterProperties). Downstream readers like Snowflake also expect it in the Arrow schema metadata map, which is encoded in the ARROW:schema IPC section of the Parquet file. Inject the iceberg.schema JSON into the Arrow schema metadata during writer initialization so it is present in both locations, matching the behavior of the Java Iceberg implementation. * feat: fixed formatting * fix: normalize nested field names in RecordBatchTransformer (#133) * fix: normalize nested field names in RecordBatchTransformer Parquet files use "item" as the List inner field name (Parquet spec) while Iceberg uses "element" (Iceberg spec). Similarly, Parquet uses "entries" for Map inner fields while Iceberg uses "key_value". The RecordBatchTransformer previously used equals_datatype() (which ignores field names) to decide between PassThrough and Promote. This meant columns with mismatched nested field names were passed through unchanged, causing downstream consumers that use strict schema validation (like DataFusion's concat_batches) to fail with: "column types must match schema types, expected List(Field { name: element ..." Fix: use a 3-way comparison in generate_transform_operations: 1. Strict == match → PassThrough (no cast needed) 2. equals_datatype() but != (field names differ) → Promote (cast to normalize names) 3. Neither → Promote (actual type promotion) * style: apply nightly rustfmt formatting * style: fix cargo fmt formatting in parquet_writer.rs --------- Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>

* Add delete file filtering heuristics * fix: adapt delete file index tests to case_sensitive context * style: format delete file index tests * chore: remove unused parquet metadata reader import * test: cover delete file index pruning edge cases * test: reduce delete file index test boilerplate

* feat: Add `iceberg.schema` to footer for compatibility (#126) * mock change * remove fmt * add unit tests * fix tests * format * commit * fix: embed iceberg.schema in Arrow schema metadata for Snowflake comp… (#134) * fix: embed iceberg.schema in Arrow schema metadata for Snowflake compatibility The ParquetWriter was only writing the iceberg.schema JSON into the Parquet footer key-value metadata (WriterProperties). Downstream readers like Snowflake also expect it in the Arrow schema metadata map, which is encoded in the ARROW:schema IPC section of the Parquet file. Inject the iceberg.schema JSON into the Arrow schema metadata during writer initialization so it is present in both locations, matching the behavior of the Java Iceberg implementation. * feat: fixed formatting * fix: normalize nested field names in RecordBatchTransformer (#133) * fix: normalize nested field names in RecordBatchTransformer Parquet files use "item" as the List inner field name (Parquet spec) while Iceberg uses "element" (Iceberg spec). Similarly, Parquet uses "entries" for Map inner fields while Iceberg uses "key_value". The RecordBatchTransformer previously used equals_datatype() (which ignores field names) to decide between PassThrough and Promote. This meant columns with mismatched nested field names were passed through unchanged, causing downstream consumers that use strict schema validation (like DataFusion's concat_batches) to fail with: "column types must match schema types, expected List(Field { name: element ..." Fix: use a 3-way comparison in generate_transform_operations: 1. Strict == match → PassThrough (no cast needed) 2. equals_datatype() but != (field names differ) → Promote (cast to normalize names) 3. Neither → Promote (actual type promotion) * style: apply nightly rustfmt formatting * chore: add CODEOWNERS for automatic PR reviewer assignment --------- Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>

) ObjectCache::get_manifest_list panics on .unwrap() when building the cache key for snapshots without a schema_id (Iceberg v1 format tables don't require this field). Fall back to table_metadata.current_schema_id() when snapshot.schema_id() is None.

* feat: add rewrite_manifests transaction action and integration tests Add RewriteManifestsAction for reorganizing manifest files without changing underlying data files. Delegates snapshot construction to SnapshotProducer via SnapshotProduceOperation for consistency with other transaction actions. Features: - Custom clustering functions via cluster_by() - Manifest predicates via rewrite_if() - Manual manifest add/delete operations - Path-based manifest identity matching - No-op detection to avoid redundant snapshots - Delete-type manifest rejection - Duplicate manifest path validation - V1-safe file count validation (skips when counts are None) - Deterministic manifest ordering via BTreeMap - Internal summary metrics protected from user override Includes 13 integration tests covering clustering, predicates, multi-round rewrites, no-op scenarios, snapshot properties, partition spec preservation, partitioned tables, and error cases. * fix: reject manifests with unknown (None) file counts in add_manifest The validation in add_manifest previously used is_some_and() which silently allowed None counts through. V1 manifests have None for file counts, meaning a V1 manifest with Added/Deleted entries could bypass validation. Now uses has_added_files()/has_deleted_files() which treat None as non-zero per the Iceberg spec, correctly rejecting manifests with unknown counts. * fix: validate partition_spec_id exists in table metadata for add_manifest Reject manifests whose partition_spec_id does not correspond to any partition spec in the table metadata. Without this check, a snapshot could reference a manifest written with an unknown spec id, breaking downstream readers. * fix: reject rewrite_manifests for V3 tables to prevent row lineage corruption Rewriting manifests on V3 tables creates new ManifestFiles with first_row_id unset, causing ManifestListWriter to assign fresh row IDs and advance next_row_id even though no new rows were added. This breaks row lineage semantics by incorrectly bumping table.metadata().next_row_id(). Return FeatureUnsupported early in RewriteManifestsAction::commit() for tables with format version >= MIN_FORMAT_VERSION_ROW_LINEAGE (V3) until a strategy to preserve row IDs through manifest rewrites is implemented. * feat: add rewrite_manifests transaction action * Update crates/iceberg/src/transaction/rewrite_manifests.rs Co-authored-by: Li0k <yuli@singularity-data.com> * fix: resolve cargo fmt violations and remove duplicate code in rewrite_manifests Fix import ordering to match nightly rustfmt 2024 edition style (SCREAMING_CASE constants sorted with regular identifiers) and remove duplicate variable declarations in RewriteManifestsAction::commit(). * fix: correct cargo fmt import ordering to match nightly rustfmt Reorder imports to match the nightly-2025-06-23 rustfmt style where SCREAMING_CASE constants sort before regular identifiers and parent module imports sort before submodule imports. Also fix multi-line assert! macro formatting. * fix: add Apache license header to CODEOWNERS * fix: resolve clippy warnings for uninlined format args and collapsible if * fix: correct import ordering in rewrite_manifests.rs for nightly rustfmt * fix: resolve clippy warnings in integration tests * fix: correct import ordering in integration test for nightly rustfmt --------- Co-authored-by: Li0k <yuli@singularity-data.com>

* feat(writer): support background close in rolling file writer * test(writer): cover pending close wait path * fix: fmt

* fix: make RemoveOrphanFilesAction::execute() future Send-safe The delete loop iterated over &orphan_files, creating borrowed &String references captured by async blocks. This caused a higher-ranked trait bound (HRTB) error when the execute() future was used inside tokio::spawn or any other Send-requiring context. Fix: iterate over owned Strings (via clone()) so each async task owns its data, eliminating the lifetime issue. * fix: correct import ordering for nightly rustfmt Reorder imports to match the project's rustfmt.toml config which uses group_imports = "StdExternalCrate" (nightly-only feature). The CI runs nightly-2025-06-23 rustfmt which enforces this ordering.

…ation cleanup (#143) * fix: skip DELETED entries when protecting files during snapshot expiration cleanup After compaction, data files replaced by compacted files are tracked as manifest entries with ManifestStatus::Deleted in the current snapshot's manifests. These are tombstone entries indicating the file was removed from the table. Previously, find_files_to_delete treated all entries in surviving manifests as still-referenced, including Deleted entries. This prevented the underlying data files from being removed from storage after their snapshots expired. Now only entries with Added or Existing status protect data files from deletion. Deleted entries are tombstones and no longer block cleanup. * fix: skip DELETED entries when protecting files during snapshot expiration cleanup After compaction, data files replaced by compacted files are tracked as manifest entries with ManifestStatus::Deleted in the current snapshot's manifests. These are tombstone entries indicating the file was removed from the table. Previously, find_files_to_delete treated all entries in surviving manifests as still-referenced, including Deleted entries. This prevented the underlying data files from being removed from storage after their snapshots expired. Now only entries with Added or Existing status protect data files from deletion. Deleted entries are tombstones and no longer block cleanup. * fix: skip DELETED entries when protecting files during snapshot expiration cleanup After compaction, data files replaced by compacted files are tracked as manifest entries with ManifestStatus::Deleted in the current snapshot's manifests. These are tombstone entries indicating the file was removed from the table. Previously, find_files_to_delete treated all entries in surviving manifests as still-referenced, including Deleted entries. This prevented the underlying data files from being removed from storage after their snapshots expired. Now only entries with Added or Existing status protect data files from deletion. Deleted entries are tombstones and no longer block cleanup.

* perf: parallelize manifest loading in rewrite_manifests Load manifest files concurrently (up to 64 at a time) using futures::stream::buffer_unordered instead of sequentially in a for loop. For tables with thousands of manifests, sequential S3/R2 reads dominate runtime (~100ms latency each), causing 8000-manifest tables to take ~27 minutes on I/O alone. With 64-way concurrency the same I/O completes in ~25 seconds. * refactor: reuse utils::load_manifests and load_manifest_lists for concurrent manifest loading Deduplicate manifest loading logic across the codebase by replacing inline stream/buffer_unordered patterns and sequential for-loops with the shared utils::load_manifests and utils::load_manifest_lists helpers. - rewrite_manifests: replace inline stream::iter().buffer_unordered() with load_manifests, remove local MANIFEST_LOAD_CONCURRENCY constant - snapshot (validate_added_files_not_existing): replace sequential manifest loading with concurrent load_manifests - snapshot (merge_bin): replace sequential manifest loading with concurrent load_manifests - remove_snapshots: replace sequential manifest list loading with concurrent load_manifest_lists, resolving the existing TODO comment * refactor: reuse utils::load_manifests and load_manifest_lists for concurrent manifest loading Deduplicate manifest loading logic across the codebase by replacing inline stream/buffer_unordered patterns and sequential for-loops with the shared utils::load_manifests and utils::load_manifest_lists helpers. - rewrite_manifests: replace inline stream::iter().buffer_unordered() with load_manifests, remove local MANIFEST_LOAD_CONCURRENCY constant - snapshot (validate_added_files_not_existing): replace sequential manifest loading with concurrent load_manifests - snapshot (merge_bin): replace sequential manifest loading with concurrent load_manifests - remove_snapshots: replace sequential manifest list loading with concurrent load_manifest_lists, resolving the existing TODO comment * refactor: reuse utils::load_manifests and load_manifest_lists for concurrent manifest loading Deduplicate manifest loading logic across the codebase by replacing inline stream/buffer_unordered patterns and sequential for-loops with the shared utils::load_manifests and utils::load_manifest_lists helpers. - rewrite_manifests: replace inline stream::iter().buffer_unordered() with load_manifests, remove local MANIFEST_LOAD_CONCURRENCY constant - snapshot (validate_added_files_not_existing): replace sequential manifest loading with concurrent load_manifests - snapshot (merge_bin): replace sequential manifest loading with concurrent load_manifests - remove_snapshots: replace sequential manifest list loading with concurrent load_manifest_lists, resolving the existing TODO comment

* feat: support sort order id in data file writer * chore: remove extra test from add-sort-id port

* chore: expose delete vector Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com> * make clippy happy Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com> --------- Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

* deps: upgrade DataFusion to 53.0, Arrow to 58 (apache#2206) - Closes #. - Bump DataFusion to 53.0.0, Arrow/Parquet to 58, sqllogictest to 0.29, pyo3 to 0.28. - Adapt to DataFusion 53 API changes in physical plan executors and python bindings. - Update SLT expected test output. Existing tests. --------- Co-authored-by: Xander <zander181@googlemail.com> * fix: restore workspace manifest consistency * fix: make df53 upgrade pass lint checks * chore: minimize non-df lockfile drift --------- Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com> Co-authored-by: Xander <zander181@googlemail.com>

* perf: avoid extra copy in `DeletionVectorWriter::write`

…#151) * fix(transaction): correct previous-snapshot lookup for summary rollup SnapshotProducer::summary resolved the previous snapshot via snapshot_by_id(self.snapshot_id), but self.snapshot_id is the *new* snapshot ID being created by this transaction and is not yet present in table_metadata. The lookup therefore always returned None, which was propagated into update_snapshot_summaries as previous_summary = None. update_totals then fell back to previous_total = 0 and recomputed every cumulative field from scratch: total-records = 0 + added_records - removed_records total-data-files = 0 + added_files - removed_files total-delete-files = 0 + added_delete - removed_delete total-position-deletes = 0 + added_pos - removed_pos total-equality-deletes = 0 + added_eq - removed_eq For any snapshot past the first on a table, this produced totals that reflected only the current commit rather than the full post-commit table state. REPLACE operations (e.g. compaction RewriteFiles and RewriteManifestsAction) and FastAppend past the first snapshot were most visibly affected: a compaction that replaced 2 of 4 files would report total-data-files=2 / total-records=<sum of 2 new files> in the new snapshot summary, while scans via manifest lists correctly observed all 4 files. Table data and scan planning are unaffected — readers walk the manifest list rather than trusting the summary. Consumers that rely on the snapshot summary for accounting (dashboards, cost and size reporting, change-data tracking) would see incorrect values. The correct previous snapshot is the current tip of the target branch, which is exactly what the companion commit() call selects 40 lines later to write as parent_snapshot_id on the manifest list. Resolve it the same way here using snapshot_for_ref(&self.target_branch) so the two code paths agree on which snapshot is the parent, and the rollup advances the totals rather than resetting them. * test(transaction): cover previous-total rollup on fast append Adds a regression test that builds a V2 table whose main-branch snapshot already carries cumulative totals in its summary (total-data-files = 5, total-records = 100), runs a FastAppend that adds one data file with 10 records, and asserts that the new snapshot summary reports total-data-files = 6 and total-records = 110 rather than 1 and 10. On the previous implementation this test fails with total-records = 10, reproducing the 'totals reset to zero every commit' behavior that affected every snapshot past the first on a table with summary totals set. With the fix applied it passes. The test is self-contained: it seeds the parent snapshot and main-ref in memory via TableMetadata::into_builder().add_snapshot().set_ref(), and stages an empty V2 manifest list in the in-memory FileIO so that the FastAppend commit path can load the parent's manifests without needing a real object store.

* fix(transaction): account for removed files in snapshot summary SnapshotProducer::summary only fed the snapshot summary collector with added_data_files. Files removed by the action (e.g. on RewriteFiles / OverwriteFiles commits) were never reported, so deleted-records, removed-files-size, deleted-data-files, removed-*-deletes, and the matching removed-file-size stayed at zero on the produced summary. update_snapshot_summaries rolls cumulative totals forward from the parent as new_total = previous_total + added - removed so with removed stuck at 0 the totals advance as if every REPLACE or OVERWRITE were a pure append: total-data-files = previous_total_data_files + added_data_files total-records = previous_total_records + added_records total-files-size = previous_total_files_size + added_file_size A compaction that rewrites N files into M files of the same row count therefore inflates total-data-files by M and total-records by the row count on each commit, and the summary never records that any file was removed. Table data and scan planning are unaffected (readers walk the manifest list), but consumers that rely on the snapshot summary for accounting (dashboards, cost and size reporting, change-data tracking) observe incorrect values that diverge further from reality with every rewrite/overwrite. Retain the full DataFile objects for removed data files on SnapshotProducer (previously only the paths were kept, for manifest filtering) and feed both removed_data_files and removed_delete_files through SnapshotSummaryCollector::remove_file alongside added files. The collector already emits all removed-* / deleted-* properties through UpdateMetrics::to_map, so this propagates correctly into update_snapshot_summaries and produces the expected totals. test(transaction): cover rewrite summary rollup for removed files Adds a regression test that constructs a V2 table whose main-branch snapshot already reports cumulative totals (total-data-files = 5, total-records = 100), then exercises SnapshotProducer::summary directly with one added and one removed data file (each 10 records) under RewriteFilesOperation. It asserts the resulting summary contains deleted-records = 10 and deleted-data-files = 1, and that total-records and total-data-files stay at 100 and 5 (zero-delta rewrite). On the previous implementation this fails on deleted-records (None vs Some("10")) and, consequently, the totals roll forward as 110 / 6 rather than 100 / 5 — reproducing the 'removed files are invisible to the summary' behavior. With the fix it passes. The test is self-contained: it seeds the parent snapshot and main-ref in memory via TableMetadata::into_builder().add_snapshot().set_ref() and never touches the manifest-list walk, which is unrelated to the summary path. * fix(transaction): skip removed-file summary updates on Overwrite The previous commit on this branch fed SnapshotProducer's removed_data_files / removed_delete_files through SnapshotSummaryCollector::remove_file so that REPLACE commits emit correct deleted-* / removed-* fields and update_totals rolls totals down rather than treating every rewrite as a pure append. That fix is correct for Replace and Delete, but it breaks Overwrite. update_snapshot_summaries routes Overwrite through truncate_table_summary, which implements full-table truncate semantics: set all current total-* to 0 and copy the parent's total-* into the current summary as removed-* / deleted-*. The truncate step guards each copy with `if value != 0`, so any removed-* that summary() pre-populated is silently kept when the parent totals are 0 (i.e. the parent was itself produced by a truncating overwrite), and clobbered otherwise. The 'silently kept' case then flows into update_totals as: new_total = previous_total(0) + added(0) - removed(>0) which underflows on u64 and panics with 'attempt to subtract with overflow'. This happens in practice whenever Overwrite is used iteratively — e.g. the overwrite_files_test::test_partition_spec_id_in_manifest integration test appends 10 files, then calls overwrite_files(). delete_files(one_file) in 10 separate commits. Commit #1 runs truncate successfully (parent totals were non-zero, truncate clobbers the per-file removed-* we pre-populated). Commit #2 panics on the u64 subtraction. Skip the remove_file feed-through for Overwrite. truncate_table_ summary already derives removed-* from the parent's total-*, which is the semantics the JVM Iceberg reference uses for Overwrite, so dropping our per-file accounting there restores the original behavior while leaving Replace / Delete (the operations this branch actually needed to fix) unaffected. test(transaction): cover overwrite summary underflow after prior truncate Adds a regression test that constructs a V2 table whose main-branch parent has total-* already at 0 (i.e. prior commit was a truncating overwrite), then runs SnapshotProducer::summary with OverwriteFilesOperation and one removed data file and asserts the call does not panic with 'attempt to subtract with overflow' and emits total-records=0 / total-data-files=0. Before this fix the summary() call panics at snapshot_summary.rs:513 in update_totals. With the fix it returns the expected zero-rollup summary, matching JVM reference behavior for a no-op overwrite on an empty (post-truncate) table.

…153) The object_cache uses moka for its cache. It stores Manifests and ManifestLists. It currently only checks the size of the values on the stack, where these datastructures point to potentially large amounts of data on the heap. This means the cache does not correctly enforce it's limit (default of 32MB). For example, an iceberg table with +100K data files can lead to large ManifestLists held in memory. The DataFiles also maintain 6 hashmaps containing column stats. If the table has a large nubmer of columns this can also increase the amount of data held on the heap. This change adds weighing functions which traverse the datastructures to accurately estimate their size on the heap.

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

…r schema evolution (#156)

Co-authored-by: xxhx <xxhx@xxhxdeMacBook-Pro.local>

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

xxchan and others added 30 commits March 3, 2026 14:30

feat: support incremental scan between 2 snapshots (#13)

1dbb595

(cherry picked from commit ad87946)

expose data file serialized

2d45340

(cherry picked from commit 2d07dcd)

support set snapshot id for fast append

ed47fb4

(cherry picked from commit f040f26)

feat(iceberg): introduce remove snapshot action

04b8ecf

address comments (cherry picked from commit 8c60928)

feat: support append delete file

3eb9f1b

Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Co-authored-by: ZENOTME <st810918843@gmail.com> (cherry picked from commit 6d4339e)

feat: support merge append

3bf3297

Signed-off-by: xxchan <xxchan22f@gmail.com> Co-authored-by: ZENOTME <43447882+ZENOTME@users.noreply.github.com> Co-authored-by: ZENOTME <st810918843@gmail.com> (cherry picked from commit 6519c98)

chore: pick public function generate_unique_snapshot_id for exactly o…

3cea15e

…nce sink (#82) (cherry picked from commit 494ca90)

feat(iceberg): rewrite files action (#47) (#86)

da40f9f

* feat(iceberg): introduce rewrite files action * fix(iceberg): add test * fix test (cherry picked from commit a2b6cc4)

fix: cherry-pick #27

b2a6e80

Co-authored-by: Dylan <chenzilin25@gmail.com> (cherry picked from commit cee1fa7)

feat: optimize plan files memory consumption (#64)

73e30a1

(cherry picked from commit ef44e88)

fix(test): adapt delete task initializers to Arc<FileScanTask>

5ed3eee

azblob

35b8c3e

fix(iceberg): Introduce new data sequence for RewriteFilesAction (#51)

3c17dbd

* feat(iceberg): rewrite_files support use_starting_sequence_number * chore(test): add test_sequence_number_in_manifest_entry

fix(cherry-pick): remove duplicate file_size_in_bytes assignments

e2eb873

fix(iceberg): fix rewrite-files partition-spec-id (#54)

ab94c27

* fix(iceberg): fix rewrite-files partition-spec-id * fix(docker): update docker file * add test * update minio * Revert "update minio" This reverts commit 4464d90.

Feature: Optionally configure consistent chunk sizes for multi-part u…

0855e4f

…ploads (#57)

feat: support write to branch (#62)

37d925f

* feat: support to branch * fix: fix ref name * fix: current_snapshot_id * refactor: refactor interface * fmt

feat: support overwrite files action (#63)

3cee389

* fix: fix snapshot for entries * refactor: refactor starting_sequence_number * feat: OverwriteFiles Action * fix: fix compile * fix: add more test * typo * fix: fix ut * fix: fix check

fix(test): update split_offsets to Option in manifest_filter tests

36911e6

fix(azdls): enable append mode for AZDLS write operations (#89)

735b002

* fix(azdls): enable append mode for AZDLS write operations * fix: fix doc

feat: check file existence (#92)

e03a49a

fix: use branch_snapshot instead of current snashot (#94)

feat: support position delete writer (#95)

ae80e04

* support position writer * fmt * fix: fix integration-tests and bug * fix --------- Co-authored-by: Li0k <yuli@singularity-data.com>

fix: align position delete writer builder with current trait signature

53d4e14

fix: adapt task and delta writers to current main interfaces

6f62381

fix: fix record to struct deserialization (#102)

70d7b37

* fix record to struct deserialization * refine the test case

public upper bounds (#103)

f027bb4

chenzl25 and others added 29 commits March 3, 2026 17:34

fix: remove unused null buffer import in record batch transformer

4030d86

fix record batch transform timestamp type

ab5bfe9

ci: enable workflow for dev_rebase_main_20260303

3f2f2b2

chore: format (#130)

cc6e6b6

* fmt * cargo clippy * fix test * cargo clippy * fix tests

skip_serializing the partitioin fields from FileScanTask

96a69c0

(cherry picked from commit d650906)

feat(writer): support background close in rolling file writer (#141)

baaa9c7

* feat(writer): support background close in rolling file writer * test(writer): cover pending close wait path * fix: fmt

feat: support add sort id for data file writer (#139)

1a07913

* feat: support sort order id in data file writer * chore: remove extra test from add-sort-id port

chore: expose delete vector (#146)

645f02a

* chore: expose delete vector Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com> * make clippy happy Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com> --------- Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

feat: add write_with_position for IcebergWriter (#149)

b55e080

perf: avoid extra copy in DeletionVectorWriter::write (#150)

40dfa51

* perf: avoid extra copy in `DeletionVectorWriter::write`

feat: add DataFile::set_partition (#154)

45a80d1

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

fix: handle all Iceberg types when filling missing columns files afte…

d6bd578

…r schema evolution (#156)

fix(rest): refresh token after unauthorized response (#157)

fc28b1f

feat: support dropping schema fields (#155)

ea0ab2d

Co-authored-by: xxhx <xxhx@xxhxdeMacBook-Pro.local>

fix: skip opendal TimeoutLayer under madsim (#160)

8f7c952

Signed-off-by: Mingzhuo Yin <yinmingzhuo@gmail.com>

fix puffin file reader

1ea2c6e

chenzl25 closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix puffin file reader#2513

fix: fix puffin file reader#2513
chenzl25 wants to merge 76 commits into
apache:mainfrom
risingwavelabs:dylan/fix_puffin_reader

chenzl25 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

chenzl25 commented May 26, 2026

What

Why

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants