Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mempurge crash reported in #8958 #9671

Closed

Conversation

bjlemaire
Copy link
Contributor

Change the MemPurge code to address a failure during a crash test reported in #8958.

Details and results of the crash investigation:

These failures happened in a specific scenario where the list of immutable tables was composed of 2 or more memtables, and the last memtable was the output of a previous Mempurge operation. Because the PickMemtablesToFlush function included a sorting of the memtables (previous PR related to the Mempurge project), and because the VersionEdit of the flush class is piggybacked onto a single one of these memtables, the VersionEdit was not properly selected and applied to the VersionSet of the DB. Since the VersionSet was not edited properly, the database was losing track of the SST file created during the flush process, which was subsequently deleted (and as you can expect, caused the tests to crash).
The following command consistently failed, which was quite convenient to investigate the issue:
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done

Solution proposed

The memtables are no longer sorted based on their memtableID in the PickMemtablesToFlush function. Additionally, the next_log_number of the memtable created as an output of the Mempurge function now takes in the correct value (the log number of the first memtable being mempurged). Finally, the VersionEdit object of the flush class now takes the maximum next_log_number of the stack of memtables being flushed, which doesnt change anything when Mempurge is off but becomes necessary when Mempurge is on.

Testing of the solution

The following command no longer fails:
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
Additionally, I ran db_crashtest (whitebox and blackbox) for 2.5 hours with MemPurge on and did not observe any crash.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bjlemaire
Copy link
Contributor Author

It looks like the only test failure (ci/circleci: build-windows-vs2017) isn't related to this PR. I see other PRs with the same test failure.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you build a regression test for this?

@@ -366,21 +366,17 @@ void MemTableList::PickMemtablesToFlush(uint64_t max_memtable_id,
imm_flush_needed.store(false, std::memory_order_release);
}
m->flush_in_progress_ = true; // flushing will start very soon
if (max_next_log_number) {
*max_next_log_number = m->GetNextLogNumber() > *max_next_log_number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use std::max

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bjlemaire
Copy link
Contributor Author

Regression test: no performance impact detected.

Details:
The regression impact of this PR was assessed by comparing this branch with main (commit 565fcea )
The following command was used to create the db_bench executable:
make clean; make -j$(nproc) release
The following command was used to create a single db_bench experiment:
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq -num=20000000 -compression_type=none

Results over 10 experiments:
This branch:
2.513 +/- 0.041 micros/ops; 3.981e5 +/- 0.65e4 ops/sec; 44.05 +/- 0.72 MB/s
Main:
2.518 +/- 0.116 micros/ops; 3.978e5 +/- 1.66e4 ops/sec; 44.00 +/- 1.84 MB/s

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@bjlemaire bjlemaire force-pushed the fix_mempurge_multithread_conflict branch 2 times, most recently from 04622c5 to 862e70f Compare March 9, 2022 23:25
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bjlemaire bjlemaire force-pushed the fix_mempurge_multithread_conflict branch from 8b613ca to eb5804d Compare March 10, 2022 17:21
@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Are you ready to re-enable in stress test?

db/db_flush_test.cc Outdated Show resolved Hide resolved
@bjlemaire
Copy link
Contributor Author

Ready to re-enable Mempurge in the stress test - I will re-enable it once this PR lands.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@bjlemaire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 11, 2022
Summary:
Set `experimental_mempurge_threshold` back to `lambda: 10.0*random.random()` in crash test, reverting #8958 after fix provided in #9671 .

Pull Request resolved: #9684

Reviewed By: pdillinger

Differential Revision: D34820257

Pulled By: bjlemaire

fbshipit-source-id: 1e5ae8c872c4ac4c4267c990ac5e3e793d77908c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants