-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Log Compaction in Delta #2072
Comments
In |
This PR adds read support for log compactions described here: delta-io#2072 Closes delta-io#2073 GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
Protocol changes for log compaction Issue: delta-io#2072 Closes delta-io#2122 GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
This PR adds read support for log compactions described here: delta-io#2072 Closes delta-io#2073 GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
Protocol changes for log compaction Issue: delta-io#2072 Closes delta-io#2122 GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
This PR adds read support for log compactions described here: delta-io#2072 Closes delta-io#2073 GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
Protocol changes for log compaction Issue: delta-io#2072 Closes delta-io#2122 GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
Creating a checkpoint whenever a large commit lands is useful in many ways:
|
@prakharjain09 currently Delta only supports reading Log Compaction? Any plans for writing? |
Feature request
Which Delta project/connector is this regarding?
Overview
This issues talks about problems related to checkpoint and how could they be handled using Log Compactions.
Motivation
Frequent Checkpointing is a major issue especially for large size tables. Today we checkpoint every 10th commit. Frequent checkpointing causes following issues:
Why are frequent Checkpoints needed:
We try to reduce the total number of delta files below a certain threshold to make sure performance doesn’t go down.
More deltas => more files to read for State reconstruction => more tail latencies
Requirements
Further details
Proposal sketch
Today DeltaLog has following files:
Commit File
00000000000000000005.json -> Represents a Delta file for a given version.
Checkpoint File
00000000000000000010.checkpoint.parquet -> Represents a Checkpoint file which captures everything from commit 0.
We could introduce a new "Log Compaction" file.
Log Compaction
A Log Compaction represents compaction of deltas between version ‘X’ and version ‘Y’ (both bounds inclusive). It can have following structure:
x.y.compact.json -> Represents all the changes from commit ‘X’ through commit ‘Y’. e.g. 00000000000000000100.00000000000000000200.compact.json
Instead of doing a full checkpoint every 10 commits, We could do a mix of minor compactions or full checkpointing. We could use different heuristics/policies to decide when to do minor compaction as against full checkpoint.
We could use a post-commit hook to trigger creating the minor compactions.
When to trigger MinorCompaction, Checkpointing
Old Rule
New Rule
Metadata Cleanup
We could apply LOG_RETENTION (defaults to 30 days) even to compacted delta files. i.e. when we delete all old checkpoints before a given version ‘X’, we also delete all compacted deltas that have startVersion <= X.
Compatibility with older Delta versions
Older Delta versions won’t have capability to read/write compacted delta files. They will ignore such files and create a Snapshot backed by the last available checkpoint and delta files.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: