Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary files filling up _delta_log folder - increasing table load time #2351

Closed
mortnstak opened this issue Mar 27, 2024 · 9 comments · Fixed by #2356
Closed

Temporary files filling up _delta_log folder - increasing table load time #2351

mortnstak opened this issue Mar 27, 2024 · 9 comments · Fixed by #2356
Labels
bug Something isn't working

Comments

@mortnstak
Copy link

Environment

Delta-rs version: 0.17.1

Binding: Python

Environment:

  • Cloud provider: Azure
  • OS: Windows 11
  • Other:

Bug

What happened:

We are streaming data from distributed writers. The main issue is that the time to run a DeltaTable() call for reading or writing increases after a while, until it becomes several minutes.

First - the source of the increasing load time was identified as the increasing number of json files in the _delta_log folder. The DeltaTable call appearantly does a file listing operations, which takes longer as the number of files increases. This was fixed by setting the delta.logRetentionDuration to 1 hour and running cleanup_metadata() on schedule.

The second identified issue was a high number of temporary files filling up the _delta_log folder. The effect is the same, after three days of streaming the number of .tmp files was above 30 000. Running DeltaTable() hence takes 5 minutes. Manually deleting the file reduces load time to 5 seconds.

Do not know why these files in some cases are generated and not deleted. I manually created a CommitFailedError due to concurrent compact operations, but this did not produce a tmp file.

What you expected to happen:
I would expect that the files are removed by the cleanup_metadata operations. And if they are not needed, they should be deleted as a part of the operation that generates them.

How to reproduce it:

More details:
The files have names of the following pattern: _commit_00014eab-4847-44c3-91a1-3f06a2f7f5f5.json.tmp. They appear to be a part of compact operations, snippet from one of the files: {"commitInfo":{"timestamp":1711099511948,"operation":"OPTIMIZE" ...

The current workaround is to delete the temporary files on schedule.

Some more info on the workload: We are generating checkpoints about every 10th write. Whe are running frequent compact operations on individual partitions, some of which are conflicting and generating CommitErrors to due concurrent deletes.

@mortnstak mortnstak added the bug Something isn't working label Mar 27, 2024
@ion-elgreco
Copy link
Collaborator

You could likely include .json.tmp in the regex parsing of files that can be removed as long as they have a creation time before a checkpoint.

@ion-elgreco
Copy link
Collaborator

Also in the future if we switch to put if absent: #2296 those tmp files won't be written anyways

@ion-elgreco
Copy link
Collaborator

@mortnstak added a pr to fix the issue

ion-elgreco added a commit that referenced this issue Mar 28, 2024
# Description
We didn't clean up failed commits when possible.

# Related Issue(s)
- fixes #2351
@mortnstak
Copy link
Author

Hi - updated to Python v 0.16.4 today, which explicitly refers to #2356 in the release notes. However, the issue does not appear to be fixed, tmp files are being generated and not removed by cleanup_metadata()

@ion-elgreco
Copy link
Collaborator

@mortnstak they are only removed up to a checkpoint

@mortnstak
Copy link
Author

@ion-elgreco -in our streaming data, the last checkpoint was created at 14:10. When running a metadata_cleanup, all tmp files generated prior to 14:10 should be deleted, right? This does not happen.

@ion-elgreco
Copy link
Collaborator

@mortnstak did you also set a logRetentionDuration? Clean up metadata respects that as well.

@mortnstak
Copy link
Author

@ion-elgreco - yes this is set to 10 minutes. And json files are deleted according to this value

@ion-elgreco
Copy link
Collaborator

@mortnstak in that case please make a reproducible example, so I can look into it. Because I added a test that checks the .tmp files being deleted and they passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants