Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional delay in deletion of compacted files #488

Closed
kmuthukk opened this issue Sep 28, 2018 · 1 comment
Closed

occasional delay in deletion of compacted files #488

kmuthukk opened this issue Sep 28, 2018 · 1 comment
Assignees
Labels
kind/enhancement This is an enhancement of an existing feature

Comments

@kmuthukk
Copy link
Collaborator

In some situations, at the end of the compaction, not seeing the inputs get deleted in a timely fashion (took 6 minutes in this case) even though there are no outstanding IOs against those files.

Details:
Attached are two files
relevant_files.txt
and
compact_flush_events (1).txt
that are related to a particular tablet: tablet-73ad3ef2ca8b4e63912c1b1187a1b337

Additional notes on things seen in the logs:

a) job 2079 is the start of a compaction at 18:25...

I0906 18:25:23.904989 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536258323904962, "job": 2079, "event": "compaction_started", "files_L0": [2080, 2079, 2078, 2077, 2076, 2071, 2060, 2039, 1997, 1915, 1757, 1483,1018], "score": 2.6, "input_data_size": 93617473293}

b) when job 2079 finishes, around 19:10, it produces file "2084"

I0906 19:10:00.190961 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000190935, "cf_name": "default", "job": 2079, "event": "table_file_creation", "file_number": 2084, "file_size": 86248165981, "table_properties": {"data_size": 85800931365, "data_index_size": 144715064, "filter_size": 374949932, "filter_index_size": 163671, "raw_key_size": 27596363009, "raw_average_key_size": 44, "raw_value_size": 80726824602, "raw_average_value_size": 129, "num_data_blocks": 3067595, "num_entries": 625789338, "num_filter_blocks": 5726, "num_data_index_blocks": 4422, "filter_policy_name": "DocKeyHashedComponentsFilter", "kDeletedKeys": "0"}}
I0906 19:10:00.216228 22483 event_logger.cc:77] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000206121, "job": 2079, "event": "compaction_finished", "compaction_time_micros": 2676291620, "output_level": 0, "num_output_files": 1, "total_output_size": 86248165981, "num_input_records": 625789338, "num_output_records": 625789338, "num_subcompactions": 1, "lsm_state": [8]}
I0906 19:10:00.238049 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000238041, "job": 2079, "event": "table_file_deletion", "file_number": 2096}
I0906 19:10:00.248080 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000248073, "job": 2079, "event": "table_file_deletion", "file_number": 2094}
I0906 19:10:00.255952 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000255946, "job": 2079, "event": "table_file_deletion", "file_number": 2093}
I0906 19:10:00.266968 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000266961, "job": 2079, "event": "table_file_deletion", "file_number": 2092}
I0906 19:10:00.287042 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000287033, "job": 2079, "event": "table_file_deletion", "file_number": 2091}
I0906 19:10:00.297960 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000297954, "job": 2079, "event": "table_file_deletion", "file_number": 2090}
I0906 19:10:00.312986 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000312978, "job": 2079, "event": "table_file_deletion", "file_number": 2089}
I0906 19:10:00.334969 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000334962, "job": 2079, "event": "table_file_deletion", "file_number": 2088}
I0906 19:10:00.345963 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000345958, "job": 2079, "event": "table_file_deletion", "file_number": 2087}
I0906 19:10:00.365905 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000365898, "job": 2079, "event": "table_file_deletion", "file_number": 2086}
I0906 19:10:00.376067 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261000376060, "job": 2079, "event": "table_file_deletion", "file_number": 2085}
I0906 19:16:52.336963 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412336956, "job": 2100, "event": "table_file_deletion", "file_number": 2079}

c) Notice that the files this job 2079 is deleting after the compaction are not the files that were inputs to its compaction. Instead they are files with newer numbers (numbers > 2084) for compactions that started after job 2079 and finished before end of job 2079.

d) The files that were inputs to compaction job 2079 are deleted much later. Six minutes later, when an unrelated compaction finishes.

I0906 19:16:52.019984 22483 event_logger.cc:77] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412015140, "job": 2100, "event": "compaction_finished", "compaction_time_micros": 25217153, "output_level": 0, "num_output_files": 1, "total_output_size": 960934045, "num_input_records": 5964734, "num_output_records": 5964734, "num_subcompactions": 1, "lsm_state": [5]}
I0906 19:16:52.063949 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412063939, "job": 2100, "event": "table_file_deletion", "file_number": 2101}
I0906 19:16:52.131935 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412131929, "job": 2100, "event": "table_file_deletion", "file_number": 2100}
I0906 19:16:52.150996 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412150990, "job": 2100, "event": "table_file_deletion", "file_number": 2099}
I0906 19:16:52.173065 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412173056, "job": 2100, "event": "table_file_deletion", "file_number": 2098}
I0906 19:16:52.183951 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412183944, "job": 2100, "event": "table_file_deletion", "file_number": 2097}
I0906 19:16:52.302953 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412302944, "job": 2100, "event": "table_file_deletion", "file_number": 2095}
I0906 19:16:52.315939 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412315931, "job": 2100, "event": "table_file_deletion", "file_number": 2080}
I0906 19:16:52.336963 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412336956, "job": 2100, "event": "table_file_deletion", "file_number": 2079}
I0906 19:16:52.351933 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412351923, "job": 2100, "event": "table_file_deletion", "file_number": 2078}
I0906 19:16:52.369967 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412369958, "job": 2100, "event": "table_file_deletion", "file_number": 2077}
I0906 19:16:52.392042 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412392035, "job": 2100, "event": "table_file_deletion", "file_number": 2076}
I0906 19:16:52.419409 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412419399, "job": 2100, "event": "table_file_deletion", "file_number": 2071}
I0906 19:16:52.461997 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412461987, "job": 2100, "event": "table_file_deletion", "file_number": 2060}
I0906 19:16:52.514164 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412514155, "job": 2100, "event": "table_file_deletion", "file_number": 2039}
I0906 19:16:52.581976 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412581964, "job": 2100, "event": "table_file_deletion", "file_number": 1997}
I0906 19:16:52.714953 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412714944, "job": 2100, "event": "table_file_deletion", "file_number": 1915}
I0906 19:16:52.916260 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261412916253, "job": 2100, "event": "table_file_deletion", "file_number": 1757}
I0906 19:16:53.351264 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261413351255, "job": 2100, "event": "table_file_deletion", "file_number": 1483}
I0906 19:16:54.065317 22483 event_logger.cc:67] T 73ad3ef2ca8b4e63912c1b1187a1b337: EVENT_LOG_v1 {"time_micros": 1536261414065311, "job": 2100, "event": "table_file_deletion", "file_number": 1018}

So during this 6 minute window if the compaction was, for example, a major compaction, we are using twice the space. And we might be starting other major compactions during this time as well causing much more temporary/garbage space being used up then necessary.

@kmuthukk kmuthukk added the kind/enhancement This is an enhancement of an existing feature label Sep 28, 2018
@kmuthukk kmuthukk added this to To Do in YBase features via automation Sep 28, 2018
@ttyusupov ttyusupov moved this from To Do to In progress in YBase features Dec 16, 2018
yugabyte-ci pushed a commit that referenced this issue Dec 16, 2018
Summary:
Here is an excerpt from RocksDB wiki with a description of "Version" data structure:
> The list of files in an LSM tree is kept in a data structure called version. In the end of a compaction or a mem table flush, a new version is created for the updated LSM tree. At one time, there is only one "current" version that represents the files in the up-to-date LSM tree. New get requests or new iterators will use the current version through the whole read process or life cycle of iterator. All versions used by get or iterators need to be kept. An out-of-date version that is not used by any get or iterator needs to be dropped. All files not used by any other version need to be deleted."
> ...
> Both of an SST file and a version have a reference count. While we create a version, we incremented the reference counts for all files. If a version is not needed, all files’ of the version have their reference counts decremented. If a file’s reference count drops to 0, the file can be deleted.
> In a similar way, each version has a reference count. When a version is created, it is an up-to-date one, so it has reference count 1. If the version is not up-to-date anymore, its reference count is decremented. Anyone who needs to work on the version has its reference count incremented by 1, and decremented by 1 when finishing using it. When a version’s reference count is 0, it should be removed. Either a version is up-to-date or someone is using it, its reference count is not 0, so it will be kept.

A compaction job doesn't simply delete its input files. Instead, it finds obsoleted files (ignoring list of input files) and deletes them. When deleting obsolete files it doesn't delete live SST files and pending output files.

There were several cases when deletion of compacted files was delayed:
1) A concurrent flush job is holding input version and therefore all files from this version.
2) At the end of a flush job, RocksDB can schedule a compaction and it starts holding its input version together with all files from this version (not only input files of scheduled compaction).
3) `DBImpl::FindObsoleteFiles` and `DBImplPurgeObsoleteFiles` functions don't delete unreferenced SST files with number greater than or equal to `min_pending_output`, which means that if some job is still writing file #4, already compacted and not used files #5, #6, #7 couldn't be deleted till next compaction which would trigger deleting obsolete files.

This diff includes the following changes to address the issue:
1) Don't hold a version during flush.
2) In case of universal compaction, we don't actually need to hold the whole input version, so in this case we only hold input files and store some minimal information from input version.
3) Instead of relying on `min_pending_output`, utility classes `FileNumbersHolder` and `FileNumbersProvider` were implemented in order to allow tracking of the exact set of pending output files and don't block deletion of other unreferenced SST files.

Test Plan:
- Jenkins.
- Long-running test with CassandraKeyValue workload.
- Use debug check and logs to make sure SST files are deleted no later than 1 second after they were compacted.
- Added unit tests for all 3 cases.

Reviewers: mikhail, venkatesh, amitanand, sergei

Reviewed By: sergei

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5526
@ttyusupov ttyusupov moved this from In progress to Done in YBase features Dec 16, 2018
@kmuthukk
Copy link
Collaborator Author

Excellent work @ttyusupov on identifying the root cause and taking care of the tricky cases with test cases to go with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This is an enhancement of an existing feature
Projects
Development

No branches or pull requests

2 participants