Skip to content

Improve REST Support for lazy snapshot loading#16207

Open
grantatspothero wants to merge 1 commit intoapache:mainfrom
grantatspothero:gn/lazySnapshotLog
Open

Improve REST Support for lazy snapshot loading#16207
grantatspothero wants to merge 1 commit intoapache:mainfrom
grantatspothero:gn/lazySnapshotLog

Conversation

@grantatspothero
Copy link
Copy Markdown
Contributor

@grantatspothero grantatspothero commented May 4, 2026

Previously this PR added support for lazy snapshot loading: https://github.com/apache/iceberg/pull/6850/changes

This PR improves the lazy loading by supporting lazy loading of snapshotLog.

For tables with high numbers of snapshots (eg: tables with low latency commits) this can result in significant memory savings.

Considerations:

  • Wanted to maintain backwards compatibility so kept and deprecated setSnapshotsSupplier

@github-actions github-actions Bot added the core label May 4, 2026
@grantatspothero grantatspothero changed the title Gn/lazy snapshot log Improve REST Support for lazy snapshot loading May 4, 2026
@grantatspothero grantatspothero force-pushed the gn/lazySnapshotLog branch 2 times, most recently from c9bc366 to c756769 Compare May 4, 2026 17:11
@gaborkaszab
Copy link
Copy Markdown
Contributor

Hi @grantatspothero ,
I see this is a draft PR but grabbed my attention as I was investigating the lazy snapshot loading area recently. Could you help me understand what is exactly that worries you wrt the snapshot log? Is it network traffic, is it memory consumption on client side or something else?
The reason I ask is that initially I'd think that there isn't much we can win with lazily loading the snapshot log because each log entry is just 2 longs. So basically with thousands of snapshots we still are in the low kilobytes territory in terms of memory usage. Above that amount of snapshots you're doomed anyway :)

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 4, 2026

Our problem was excessive memory usage due to caching TableMetadata on the client side.

Storing a List<HistoryEntry> in memory is fine for small numbers of snapshots, but each entry takes ~32 bytes and this grows quickly when you have a single coordinator service caching iceberg metadata in memory.

Example:

  • 1000 table metadata cached in memory
  • each table commits every 30s, with 30 days of snapshot retention = 2*60*24*30 ~100K snapshots in iceberg metadata
  • 32 bytes * 100K = 3.2 MB snapshotLog per table
  • 3.2MB/table * 1000 tables = 32GB

Note: this is "resident set size" not "total allocations" which tends to be significantly higher due to intermediate allocations of parsing JSON.

For multi-tenant coordinator services (eg: query engines, cache services) this memory usage is a problem. The biggest memory hog is by far the snapshots array, but snapshotLog is the next biggest. Since iceberg already defers snapshots, it seemed reasonable to defer snapshotLog.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 4, 2026

Above that amount of snapshots you're doomed anyway :)

It is becoming more common to have large numbers of snapshots in iceberg due to prevalence of streaming ingestion/low latency commits.

See mailing list discussions: https://www.mail-archive.com/dev@iceberg.apache.org/msg12764.html
Examples: kafka-connect iceberg sink, Confluent Tableflow, Starburst streaming ingestion.

This doesn't solve the full problem mentioned in that mailing list thread (writes still pay the full cost of writing snapshots/snapshotLog), but it does solve the problem for readers. And for query engine/caching usecases, reads >> writes so this could be beneficial.

Previously only lazily loaded snapshots
@grantatspothero grantatspothero marked this pull request as ready for review May 4, 2026 22:03
@gaborkaszab
Copy link
Copy Markdown
Contributor

Thanks you for the explanation, @grantatspothero !
I feel that ~100k snapshot tables are at the very extreme end of use-cases. I'm wondering if the table is changed every 30sec, then is there any point storing it in a cache. We'd need to reload it frequently anyway.
I wanted to advise you to reach out to the dev@ list to see wider community feedback on this. I see you've already done so, thanks!

I can take a look at the code, if the improvement is simple enough, I don't see why not to include. If it's messy or complicated, we might need some community support to get it through.

@grantatspothero
Copy link
Copy Markdown
Contributor Author

grantatspothero commented May 5, 2026

I'm wondering if the table is changed every 30sec, then is there any point storing it in a cache.

Two different definitions of cache:

  1. "Within query metadata caching". Within a single query's lifetime, TableMetadata must live in coordinator memory. Queries are usually short but sometimes can take hours, wasting coordinator memory for hours for long running queries. This wasted memory is exacerbated by: # of concurrent queries and # of tables per query. Compare this to the hive table model where coordinator memory is mostly bounded.
  2. "Cross-query metadata caching". I believe this is what you are talking about. Trino does not support cross-query table metadata caching today, but some engines do and have problems. With a cross-query cache it is difficult to control caching at a fine granularity. "Cache these long lived table metadatas but not these constantly changing ones"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants