Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: data race in save_stages_controller #2647

Merged
merged 6 commits into from
Feb 23, 2024
Merged

fix: data race in save_stages_controller #2647

merged 6 commits into from
Feb 23, 2024

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Feb 22, 2024

fixes #2644

  • Initialize resources atomically to avoid race condition
  • Finalize/cleanup resources atomically to avoid race condition
  • Refactor SaveStagesController::Save

@kostasrim kostasrim self-assigned this Feb 22, 2024
}

detail::SaveInfo save_info = save_controller_->Save();
save_controller_->WaitAllSnapshots();
Copy link
Contributor Author

@kostasrim kostasrim Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points because now the function clearly suggests what's happening under the hood. @romange we can turn this function async easily. We could add a background fiber for that, which would turn BGSAVE from synchronous to asynchronous (bringing us on par with redis). This is not a priority but since it was mentioned in the issues I am bringing it up :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this answers my first question.


SaveInfo Finalize() {
RunStage(&SaveStagesController::CloseCb);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done atomically as well

dranikpg
dranikpg previously approved these changes Feb 22, 2024
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If to follow it fully, the mechanism behind GetTotalBuffersSize() is quite complicated

For DFS we dispatch immediately from SaveStagesController and assume we're placed on each shard correctly

For RDB we call RdbSnapshot::Impl::GetSaveBuffersSize transparently and dispatch only there on all the snapshot threads

So we have branching twice, but on different levels 🤯

Left some nits

Comment on lines 73 to 80
SaveStagesController(SaveStagesInputs&& input);
std::optional<SaveInfo> InitResourcesAndStart();

~SaveStagesController();

SaveInfo Save();
void WaitAllSnapshots();

SaveInfo Finalize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever a type has a multi-step interface, please provide proper comments on how to use it correctly (and why actually those are different steps).

If people won't find comments, they'll misuse it very quickly. Proven by experience 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially for a type that has outside synchronization which is sometimes difficult to follow. I'd make the case for internal synchronization, but we also need external for updating the pointer, so it's not worth dividing those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also comment what returning optional means.... It's not evident that this is an error in fact

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the interface a bit non-intuitive. I also agree that this PR should not do lots of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments. I wanted to push a refactor as I wrote somewhere else but I thought for not it's best to only fix the issue (even if it's not perfect)

Comment on lines 1303 to 1307
auto set_last_save_error = [this](const auto& save_info) mutable {
last_save_info_.last_error = save_info.error;
last_save_info_.last_error_time = save_info.save_time;
last_save_info_.failed_duration_sec = save_info.duration_sec;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If to speak strictly, it should really be an function on last safe info 😅 Actually, we have two types LastSaveInfo and SaveInfo, whereas the former should clearly be a subclass of the latter. We should add it as a refactoring todo (if nobody objects 😼)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and since it's a very quick change, I did it now :) The only todo I will leave is a major TODO :refactor 🤣

Comment on lines 157 to 158
return GetSaveInfo();
SaveInfo save = GetSaveInfo();
return std::optional<SaveInfo>(std::move(save));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm almost sure return GetSaveInfo() works as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my defense, I was experimenting with a false positive from you complete me. Basically I had it like this and I got a warning on Vim so I changed it because I was curious if the warning would go away (it didn't) and I left forgotten as is 👀

auto res = save_controller_->InitResourcesAndStart();

if (res) {
CHECK_EQ(res->error, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this invariant won't hold will we reach UB? I do not think so.
If not, we should replace CHECK with DCHECK - to reduce the risk of crashing because of minor bugs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see why arrogant checks are evil: #2649

Copy link
Contributor

@dranikpg dranikpg Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It asserts interface compatibility. The best way to do it is with proper typing, i.e. nonstd::expected. If this path is not covered in tests, the DCHECK will be never checked and we'll get puzzling behaviour in case someone breaks the interface. But I'd also favour a DCHECK (if it ever runs), in the future we should look at #2647 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it a DCHECK

save_controller_ = make_unique<SaveStagesController>(detail::SaveStagesInputs{
new_version, basename, trans, &service_, fq_threadpool_.get(), snapshot_storage_});

auto res = save_controller_->InitResourcesAndStart();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have "Init()" function that can fail and then keep Save() function that does Start and Wait as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what we do. But I assume you know this from:

ok, this answers my first question.

Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait to see if Roman has any objections because I'm as usually an early-approver 😆

Comment on lines +74 to +78
// Objects of this class are used concurrently. Call this function
// in a mutually exlusive context to avoid data races.
// Also call this function before any call to `WaitAllSnapshots`
// Returns empty optional on success and SaveInfo on failure
std::optional<SaveInfo> InitResourcesAndStart();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a little more specific (what means used concurrently) like: the save controller can be queried for stats from another threads at any time, so all setup and teardown operations must be guarded, etc...

@romange romange merged commit 3d27913 into main Feb 23, 2024
10 checks passed
@romange romange deleted the fix_info_all_race branch February 23, 2024 15:35
romange pushed a commit that referenced this pull request Feb 23, 2024
* fix: data race in save_stages_controller
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

race condition on INFO ALL && BGSAVE
3 participants