-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip deleted WALs during recovery #3765
Conversation
Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Test Plan: 1. Modify and run existing transaction_test; 2. manual test forward compatibility with DB generated with the newer release.
a920384
to
44fd82c
Compare
@siying has updated the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@siying has updated the pull request. View: changes, changes since last import |
@siying has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@siying has updated the pull request. View: changes, changes since last import |
assert(log != 0); | ||
std::lock_guard<std::mutex> lock(prepared_section_completed_mutex_); | ||
auto it = prepared_section_completed_.find(log); | ||
if (UNLIKELY(it == prepared_section_completed_.end())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is unlikely removed when it is moved to the new class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment addressed?
db/memtable_list.h
Outdated
@@ -243,7 +245,8 @@ class MemTableList { | |||
|
|||
size_t* current_memory_usage() { return ¤t_memory_usage_; } | |||
|
|||
uint64_t GetMinLogContainingPrepSection(); | |||
uint64_t PrecomputeMinLogContainingPrepSection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this inline comment:
Returns the min log containing the prep section after memtables listsed in memtables_to_flush are flushed and their status is persisted in manifest.
db/db_impl.h
Outdated
@@ -727,7 +730,6 @@ class DBImpl : public DB { | |||
uint64_t* seq_used = nullptr, size_t batch_cnt = 0, | |||
PreReleaseCallback* pre_release_callback = nullptr); | |||
|
|||
uint64_t FindMinLogContainingOutstandingPrep(); | |||
uint64_t FindMinPrepLogReferencedByMemTable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is not implemented. should be removed?
db/db_impl_write.cc
Outdated
@@ -1033,7 +1033,8 @@ Status DBImpl::SwitchWAL(WriteContext* write_context) { | |||
} | |||
|
|||
auto oldest_alive_log = alive_log_files_.begin()->number; | |||
auto oldest_log_with_uncommited_prep = FindMinLogContainingOutstandingPrep(); | |||
auto oldest_log_with_uncommited_prep = | |||
logs_with_prep_tracker_.FindMinLogContainingOutstandingPrep(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be moved to inside the next if-then clause? if allow_2pc is false, we do not have to invoke this function.
db/version_set.cc
Outdated
@@ -3647,6 +3668,11 @@ Status VersionSet::DumpManifest(Options& options, std::string& dscname, | |||
cfd->SetLogNumber(edit.log_number_); | |||
} | |||
|
|||
if (cfd != nullptr && edit.has_min_log_number_to_keep_ && | |||
edit.min_log_number_to_keep_ > latest_min_log_number_to_keep_) { | |||
latest_min_log_number_to_keep_ = edit.min_log_number_to_keep_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines below we call MarkMinLogNumberToKeep which does the same thing. The only difference is that over there it is not protected by (cfd != nullptr). What is the reason for the duplication?
db/version_set.h
Outdated
// REQUIRED: this is only called during single-threaded recovery or repair, or | ||
// from ::LogAndApply where the global mutex is held. | ||
void MarkMinLogNumberToKeep(uint64_t number); | ||
|
||
// Return the log file number for the log file that is currently | ||
// being compacted, or zero if there is no such log file. | ||
uint64_t prev_log_number() const { return prev_log_number_; } | ||
|
||
// Returns the minimum log number such that all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment correct? Logs lower than MinLogNumber might not be eligible for deletion because they have an uncommitted prepare record on them.
db/memtable_list.cc
Outdated
// Return the earliest log file to keep after the memtable flush is | ||
// finalized. | ||
uint64_t PrecomputeMinLogNumberToKeep( | ||
ColumnFamilyData* cfd, autovector<VersionEdit*> edit_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename cfd to cfd_to_flush. Also can we add this comment here:
cfd is the column family whose memtable will be flushed and thus will not depend on any WAL file.
db/memtable_list.cc
Outdated
LogsWithPrepTracker* prep_tracker, VersionSet* vset) { | ||
assert(prep_tracker != nullptr); | ||
// Calculate updated min_log_number_to_keep | ||
// Only do it in 2PC mode, because in non-2pc mode, log number in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a branch on is_2pc() here. Did you meant that the entire PrecomputeMinLogNumberToKeep should be called only in 2pc? if yes can we move this comment up to the function definition?
db/memtable_list.cc
Outdated
// Calculate updated min_log_number_to_keep | ||
// Only do it in 2PC mode, because in non-2pc mode, log number in | ||
// the version edit should be sufficient. | ||
uint64_t cf_min_log_number_to_keep = cfd->GetLogNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we initialize it with cfd->GetLogNumber()? Is not cfd going to be flushed anyway? After that do we need to hold on the to the log number of cfd?
db/version_set.h
Outdated
@@ -802,6 +802,10 @@ class VersionSet { | |||
|
|||
uint64_t current_next_file_number() const { return next_file_number_.load(); } | |||
|
|||
uint64_t latest_min_log_number_to_keep() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for not following the naming convention for functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db/db_impl_files.cc
Outdated
} | ||
return versions_->latest_min_log_number_to_keep(); | ||
} else { | ||
return versions_->MinLogNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name-wise latest_min_log_number_to_keep and MinLogNumber are suggesting similar things and that is confusing. Can we use a different name to distinguish what each is doing? We might even consider to put "2pc" in the name of latest_min_log_number_to_keep().
db/db_impl_files.cc
Outdated
@@ -350,6 +242,8 @@ bool CompareCandidateFile(const JobContext::CandidateFileInfo& first, | |||
}; // namespace | |||
|
|||
// Delete obsolete files and log status and information of file deletion | |||
// Note: All WAL files must be deleted through this function (unelss they are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this note necessary? Seems to be a left-over from the previously reverted commit.
db/memtable_list.cc
Outdated
@@ -319,12 +319,95 @@ void MemTableList::RollbackMemtableFlush(const autovector<MemTable*>& mems, | |||
imm_flush_needed.store(true, std::memory_order_release); | |||
} | |||
|
|||
namespace { | |||
uint64_t FindMinPrepLogReferencedByMemTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is a good idea to move this function memtable_list. This has resulted into duplicating the logic into TEST_FindMinPrepLogReferencedByMemTable when needed by tests. In long term this could result into having tests testing a code that will not be run in production.
@siying has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
db/db_impl.h
Outdated
// Return the earliest log file to keep after the memtable flush is | ||
// finalized. | ||
// `cfd_to_flush` is the column family whose memtable (specified in | ||
// `memtables_to_flush` will be flushed and thus will not depend on any WAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing parentheses after memtables_to_flush
db/db_impl.h
Outdated
// will not depend on any WAL file. nullptr means no memtable is being flushed. | ||
// The function is only applicable to 2pc mode. | ||
extern uint64_t FindMinPrepLogReferencedByMemTable( | ||
VersionSet* vset, ColumnFamilyData* cfd_to_flush, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: VersionSet* in this function is before ColumnFamilyData* but in PrecomputeMinLogNumberToKeep it is after. It would be better if the order are consistent between the two functions. Also it would be great to add a comment here telling the difference between FindMinPrepLogReferencedByMemTable and PrecomputeMinLogNumberToKeep.
db/db_impl_write.cc
Outdated
immutable_db_options_.info_log, | ||
"Unable to release oldest log due to uncommited transaction"); | ||
unable_to_release_oldest_log_ = true; | ||
flush_will_release_oldest_log = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this var name very confusing. Did it mean to say "flush_will_not_release_oldest_log"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Sorry for the confusion. I'll fix.
db/db_impl_write.cc
Outdated
logs_with_prep_tracker_.FindMinLogContainingOutstandingPrep(); | ||
|
||
if (oldest_log_with_uncommited_prep > 0 && | ||
oldest_log_with_uncommited_prep <= oldest_alive_log) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if alive_log_files_ is the set of whole log files known to rocksdb, then how could oldest_log_with_uncommited_prep be possibility less than oldest_alive_log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's existing code. I don't remember. It's probably won't Since it's not wrong to have "<=" I would keep it there. I dare not change it. Who knows whether changing it will trigger a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rethought about it. We can take the risk. Let me change it to an assert.
db/db_impl_files.cc
Outdated
if (cf_min_log_number_to_keep == 0) { | ||
// No version edit contains information on log number. The log number | ||
// should stay the same as it is. | ||
cf_min_log_number_to_keep = cfd_to_flush->GetLogNumber(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still confused why log number of cfd_to_flush gets special treatment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to change it to vset->MinLogNumberWithUnflushedData()
.
db/db_impl_write.cc
Outdated
|
||
if (oldest_log_with_uncommited_prep > 0 && | ||
oldest_log_with_uncommited_prep <= oldest_alive_log) { | ||
if (unable_to_release_oldest_log_) { | ||
// we already attempted to flush all column families dependent on | ||
// the oldest alive log but the log still contained uncommited transactions. | ||
// the oldest alive log STILL contains uncommited transaction so there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2nd and 3rd lines are duplicates.
@siying has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@siying has updated the pull request. View: changes, changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
oops, land failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siying is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes #3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
. PR facebook#3488 introduced a new format of `VersionEdit` encoding which is not understandable older versions. Thus, the change broke forward compatibility and has been reverted in PR facebook#3762. Later, PR facebook#3765 reiterated the concept but in a way that does not provide compatibility with the very short-living format of PR facebook#3488. This change tries to address the issue of accessing DBs in format of facebook#3488 by ignoring the special entries. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Test Plan: 1. Modify and run existing transaction_test; 2. manual test forward compatibility with DB generated with the newer release.