New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasional delay in deletion of compacted files #488
Labels
kind/enhancement
This is an enhancement of an existing feature
Projects
Comments
kmuthukk
added
the
kind/enhancement
This is an enhancement of an existing feature
label
Sep 28, 2018
yugabyte-ci
pushed a commit
that referenced
this issue
Dec 16, 2018
Summary: Here is an excerpt from RocksDB wiki with a description of "Version" data structure: > The list of files in an LSM tree is kept in a data structure called version. In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree. New get requests or new iterators will use the current version through the whole read process or life cycle of iterator. All versions used by get or iterators need to be kept. An out-of-date version that is not used by any get or iterator needs to be dropped. All files not used by any other version need to be deleted." > ... > Both of an SST file and a version have a reference count. While we create a version, we incremented the reference counts for all files. If a version is not needed, all files’ of the version have their reference counts decremented. If a file’s reference count drops to 0, the file can be deleted. > In a similar way, each version has a reference count. When a version is created, it is an up-to-date one, so it has reference count 1. If the version is not up-to-date anymore, its reference count is decremented. Anyone who needs to work on the version has its reference count incremented by 1, and decremented by 1 when finishing using it. When a version’s reference count is 0, it should be removed. Either a version is up-to-date or someone is using it, its reference count is not 0, so it will be kept. A compaction job doesn't simply delete its input files. Instead, it finds obsoleted files (ignoring list of input files) and deletes them. When deleting obsolete files it doesn't delete live SST files and pending output files. There were several cases when deletion of compacted files was delayed: 1) A concurrent flush job is holding input version and therefore all files from this version. 2) At the end of a flush job, RocksDB can schedule a compaction and it starts holding its input version together with all files from this version (not only input files of scheduled compaction). 3) `DBImpl::FindObsoleteFiles` and `DBImplPurgeObsoleteFiles` functions don't delete unreferenced SST files with number greater than or equal to `min_pending_output`, which means that if some job is still writing file #4, already compacted and not used files #5, #6, #7 couldn't be deleted till next compaction which would trigger deleting obsolete files. This diff includes the following changes to address the issue: 1) Don't hold a version during flush. 2) In case of universal compaction, we don't actually need to hold the whole input version, so in this case we only hold input files and store some minimal information from input version. 3) Instead of relying on `min_pending_output`, utility classes `FileNumbersHolder` and `FileNumbersProvider` were implemented in order to allow tracking of the exact set of pending output files and don't block deletion of other unreferenced SST files. Test Plan: - Jenkins. - Long-running test with CassandraKeyValue workload. - Use debug check and logs to make sure SST files are deleted no later than 1 second after they were compacted. - Added unit tests for all 3 cases. Reviewers: mikhail, venkatesh, amitanand, sergei Reviewed By: sergei Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5526
Excellent work @ttyusupov on identifying the root cause and taking care of the tricky cases with test cases to go with. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In some situations, at the end of the compaction, not seeing the inputs get deleted in a timely fashion (took 6 minutes in this case) even though there are no outstanding IOs against those files.
Details:
Attached are two files
relevant_files.txt
and
compact_flush_events (1).txt
that are related to a particular tablet: tablet-73ad3ef2ca8b4e63912c1b1187a1b337
Additional notes on things seen in the logs:
a) job 2079 is the start of a compaction at 18:25...
b) when job 2079 finishes, around 19:10, it produces file "2084"
c) Notice that the files this job 2079 is deleting after the compaction are not the files that were inputs to its compaction. Instead they are files with newer numbers (numbers > 2084) for compactions that started after job 2079 and finished before end of job 2079.
d) The files that were inputs to compaction job 2079 are deleted much later. Six minutes later, when an unrelated compaction finishes.
So during this 6 minute window if the compaction was, for example, a major compaction, we are using twice the space. And we might be starting other major compactions during this time as well causing much more temporary/garbage space being used up then necessary.
The text was updated successfully, but these errors were encountered: