-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mempurge crash reported in #8958 #9671
Fix mempurge crash reported in #8958 #9671
Conversation
@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
It looks like the only test failure ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you build a regression test for this?
db/memtable_list.cc
Outdated
@@ -366,21 +366,17 @@ void MemTableList::PickMemtablesToFlush(uint64_t max_memtable_id, | |||
imm_flush_needed.store(false, std::memory_order_release); | |||
} | |||
m->flush_in_progress_ = true; // flushing will start very soon | |||
if (max_next_log_number) { | |||
*max_next_log_number = m->GetNextLogNumber() > *max_next_log_number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use std::max
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
3 similar comments
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Regression test: no performance impact detected. Details: Results over 10 experiments: |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
04622c5
to
862e70f
Compare
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
8b613ca
to
eb5804d
Compare
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. Are you ready to re-enable in stress test?
Ready to re-enable Mempurge in the stress test - I will re-enable it once this PR lands. |
@bjlemaire has updated the pull request. You must reimport the pull request before landing. |
@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Set `experimental_mempurge_threshold` back to `lambda: 10.0*random.random()` in crash test, reverting #8958 after fix provided in #9671 . Pull Request resolved: #9684 Reviewed By: pdillinger Differential Revision: D34820257 Pulled By: bjlemaire fbshipit-source-id: 1e5ae8c872c4ac4c4267c990ac5e3e793d77908c
Change the
MemPurge
code to address a failure during a crash test reported in #8958.Details and results of the crash investigation:
These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous
Mempurge
operation. Because thePickMemtablesToFlush
function included a sorting of the memtables (previous PR related to the Mempurge project), and because theVersionEdit
of the flush class is piggybacked onto a single one of these memtables, theVersionEdit
was not properly selected and applied to theVersionSet
of the DB. Since theVersionSet
was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).The following command consistently failed, which was quite convenient to investigate the issue:
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
Solution proposed
The memtables are no longer sorted based on their
memtableID
in thePickMemtablesToFlush
function. Additionally, thenext_log_number
of the memtable created as an output of theMempurge
function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximumnext_log_number
of the stack of memtables being flushed, which doesnt change anything when Mempurge isoff
but becomes necessary when Mempurge ison
.Testing of the solution
The following command no longer fails:
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
Additionally, I ran
db_crashtest
(whitebox
andblackbox
) for 2.5 hours with MemPurge on and did not observe any crash.