Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Log Compaction in Delta #2072

Open
2 of 8 tasks
prakharjain09 opened this issue Sep 18, 2023 · 3 comments
Open
2 of 8 tasks

[Feature Request] Log Compaction in Delta #2072

prakharjain09 opened this issue Sep 18, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@prakharjain09
Copy link
Collaborator

prakharjain09 commented Sep 18, 2023

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other: Delta Protocol

Overview

This issues talks about problems related to checkpoint and how could they be handled using Log Compactions.

Motivation

Frequent Checkpointing is a major issue especially for large size tables. Today we checkpoint every 10th commit. Frequent checkpointing causes following issues:

  • Latency - Checkpoints involve full state reconstruction which is a slow operation. It could take more than 3-4 minutes for large delta tables even on large spark clusters.
  • Write Amplification - As we duplicate the majority of actions again. Amplification is O(N) where N is the number of files in the table – so total size of all checkpoints is O(N**2). Checkpoints generally have a lifespan of 30 days by default.

Why are frequent Checkpoints needed:

We try to reduce the total number of delta files below a certain threshold to make sure performance doesn’t go down.

More deltas => more files to read for State reconstruction => more tail latencies

Requirements

  • Increase the Checkpoint Interval to 20 without affecting the performance of Read/Write operations
  • Avoid any Protocol/TableFeature upgrades

Further details

Proposal sketch

Today DeltaLog has following files:

Commit File

00000000000000000005.json -> Represents a Delta file for a given version.

Checkpoint File

00000000000000000010.checkpoint.parquet -> Represents a Checkpoint file which captures everything from commit 0.

We could introduce a new "Log Compaction" file.

Log Compaction

A Log Compaction represents compaction of deltas between version ‘X’ and version ‘Y’ (both bounds inclusive). It can have following structure:

x.y.compact.json -> Represents all the changes from commit ‘X’ through commit ‘Y’. e.g. 00000000000000000100.00000000000000000200.compact.json

Instead of doing a full checkpoint every 10 commits, We could do a mix of minor compactions or full checkpointing. We could use different heuristics/policies to decide when to do minor compaction as against full checkpoint.
We could use a post-commit hook to trigger creating the minor compactions.

When to trigger MinorCompaction, Checkpointing

Old Rule

  • Trigger a checkpoint every ‘x’ commits

New Rule

  • Trigger a checkpoint every ‘y’ commits
  • Trigger a minor-compaction every ‘x’ commits if all individual ‘x’ commits are small
  • Trigger a checkpoint if a large commit lands

Metadata Cleanup

We could apply LOG_RETENTION (defaults to 30 days) even to compacted delta files. i.e. when we delete all old checkpoints before a given version ‘X’, we also delete all compacted deltas that have startVersion <= X.

Compatibility with older Delta versions

Older Delta versions won’t have capability to read/write compacted delta files. They will ignore such files and create a Snapshot backed by the last available checkpoint and delta files.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@prakharjain09 prakharjain09 added the enhancement New feature or request label Sep 18, 2023
@prakharjain09 prakharjain09 changed the title [Feature Request] [Feature Request] Log Compaction in Delta Sep 18, 2023
@felipepessoto
Copy link
Contributor

New Rule

Trigger a checkpoint every ‘y’ commits
Trigger a minor-compaction every ‘x’ commits if all individual ‘x’ commits are small
Trigger a checkpoint if a large commit lands

In Trigger a checkpoint if a large commit lands, do we want to trigger a checkpoint whenever we have a large commit? Or only after 'x'?

vkorukanti pushed a commit that referenced this issue Oct 2, 2023
This PR adds read support for log compactions described here: #2072

Closes #2073

GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
vkorukanti pushed a commit that referenced this issue Oct 2, 2023
Protocol changes for log compaction
Issue: #2072

Closes #2122

GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
vkorukanti pushed a commit to vkorukanti/delta that referenced this issue Oct 3, 2023
This PR adds read support for log compactions described here: delta-io#2072

Closes delta-io#2073

GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
vkorukanti pushed a commit to vkorukanti/delta that referenced this issue Oct 3, 2023
Protocol changes for log compaction
Issue: delta-io#2072

Closes delta-io#2122

GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
vkorukanti pushed a commit to vkorukanti/delta that referenced this issue Oct 3, 2023
This PR adds read support for log compactions described here: delta-io#2072

Closes delta-io#2073

GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
vkorukanti pushed a commit to vkorukanti/delta that referenced this issue Oct 3, 2023
Protocol changes for log compaction
Issue: delta-io#2072

Closes delta-io#2122

GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
Kimahriman pushed a commit to Kimahriman/delta that referenced this issue Oct 3, 2023
This PR adds read support for log compactions described here: delta-io#2072

Closes delta-io#2073

GitOrigin-RevId: 6f4a09c3fa09c303cdeb747c382cedcfda5a2a4c
Kimahriman pushed a commit to Kimahriman/delta that referenced this issue Oct 3, 2023
Protocol changes for log compaction
Issue: delta-io#2072

Closes delta-io#2122

GitOrigin-RevId: c15ff24a2a4242520f5cf8ffdb8604a4ffc36805
@prakharjain09
Copy link
Collaborator Author

Creating a checkpoint whenever a large commit lands is useful in many ways:

  1. If the implementation can do predicate pushdown and reduce the anount of deltalog to read in a READ query
  2. Reading a parquet file might be more efficient than reading a big commit json file due to 8-10X compression in parquet.

@felipepessoto
Copy link
Contributor

@prakharjain09 currently Delta only supports reading Log Compaction? Any plans for writing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants