Improve REST Support for lazy snapshot loading#16207
Improve REST Support for lazy snapshot loading#16207grantatspothero wants to merge 1 commit intoapache:mainfrom
Conversation
e0ba204 to
e606251
Compare
c9bc366 to
c756769
Compare
|
Hi @grantatspothero , |
|
Our problem was excessive memory usage due to caching TableMetadata on the client side. Storing a Example:
Note: this is "resident set size" not "total allocations" which tends to be significantly higher due to intermediate allocations of parsing JSON. For multi-tenant coordinator services (eg: query engines, cache services) this memory usage is a problem. The biggest memory hog is by far the snapshots array, but snapshotLog is the next biggest. Since iceberg already defers snapshots, it seemed reasonable to defer snapshotLog. |
It is becoming more common to have large numbers of snapshots in iceberg due to prevalence of streaming ingestion/low latency commits. See mailing list discussions: https://www.mail-archive.com/dev@iceberg.apache.org/msg12764.html This doesn't solve the full problem mentioned in that mailing list thread (writes still pay the full cost of writing snapshots/snapshotLog), but it does solve the problem for readers. And for query engine/caching usecases, reads >> writes so this could be beneficial. |
c756769 to
083fb92
Compare
Previously only lazily loaded snapshots
083fb92 to
3c6b025
Compare
|
Thanks you for the explanation, @grantatspothero ! I can take a look at the code, if the improvement is simple enough, I don't see why not to include. If it's messy or complicated, we might need some community support to get it through. |
Two different definitions of cache:
|
Previously this PR added support for lazy snapshot loading: https://github.com/apache/iceberg/pull/6850/changes
This PR improves the lazy loading by supporting lazy loading of
snapshotLog.For tables with high numbers of snapshots (eg: tables with low latency commits) this can result in significant memory savings.
Considerations:
setSnapshotsSupplier