-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: data race in save_stages_controller #2647
Conversation
} | ||
|
||
detail::SaveInfo save_info = save_controller_->Save(); | ||
save_controller_->WaitAllSnapshots(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus points because now the function clearly suggests what's happening under the hood. @romange we can turn this function async easily. We could add a background fiber for that, which would turn BGSAVE
from synchronous to asynchronous (bringing us on par with redis). This is not a priority but since it was mentioned in the issues I am bringing it up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this answers my first question.
|
||
SaveInfo Finalize() { | ||
RunStage(&SaveStagesController::CloseCb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done atomically as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If to follow it fully, the mechanism behind GetTotalBuffersSize()
is quite complicated
For DFS we dispatch immediately from SaveStagesController
and assume we're placed on each shard correctly
For RDB we call RdbSnapshot::Impl::GetSaveBuffersSize
transparently and dispatch only there on all the snapshot threads
So we have branching twice, but on different levels 🤯
Left some nits
SaveStagesController(SaveStagesInputs&& input); | ||
std::optional<SaveInfo> InitResourcesAndStart(); | ||
|
||
~SaveStagesController(); | ||
|
||
SaveInfo Save(); | ||
void WaitAllSnapshots(); | ||
|
||
SaveInfo Finalize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever a type has a multi-step interface, please provide proper comments on how to use it correctly (and why actually those are different steps).
If people won't find comments, they'll misuse it very quickly. Proven by experience 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially for a type that has outside synchronization which is sometimes difficult to follow. I'd make the case for internal synchronization, but we also need external for updating the pointer, so it's not worth dividing those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also comment what returning optional means.... It's not evident that this is an error in fact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the interface a bit non-intuitive. I also agree that this PR should not do lots of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments. I wanted to push a refactor as I wrote somewhere else but I thought for not it's best to only fix the issue (even if it's not perfect)
src/server/server_family.cc
Outdated
auto set_last_save_error = [this](const auto& save_info) mutable { | ||
last_save_info_.last_error = save_info.error; | ||
last_save_info_.last_error_time = save_info.save_time; | ||
last_save_info_.failed_duration_sec = save_info.duration_sec; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If to speak strictly, it should really be an function on last safe info 😅 Actually, we have two types LastSaveInfo
and SaveInfo
, whereas the former should clearly be a subclass of the latter. We should add it as a refactoring todo (if nobody objects 😼)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and since it's a very quick change, I did it now :) The only todo I will leave is a major TODO :refactor 🤣
return GetSaveInfo(); | ||
SaveInfo save = GetSaveInfo(); | ||
return std::optional<SaveInfo>(std::move(save)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm almost sure return GetSaveInfo()
works as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my defense, I was experimenting with a false positive from you complete me. Basically I had it like this and I got a warning on Vim so I changed it because I was curious if the warning would go away (it didn't) and I left forgotten as is 👀
src/server/server_family.cc
Outdated
auto res = save_controller_->InitResourcesAndStart(); | ||
|
||
if (res) { | ||
CHECK_EQ(res->error, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this invariant won't hold will we reach UB? I do not think so.
If not, we should replace CHECK with DCHECK - to reduce the risk of crashing because of minor bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see why arrogant checks are evil: #2649
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It asserts interface compatibility. The best way to do it is with proper typing, i.e. nonstd::expected. If this path is not covered in tests, the DCHECK will be never checked and we'll get puzzling behaviour in case someone breaks the interface. But I'd also favour a DCHECK (if it ever runs), in the future we should look at #2647 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it a DCHECK
save_controller_ = make_unique<SaveStagesController>(detail::SaveStagesInputs{ | ||
new_version, basename, trans, &service_, fq_threadpool_.get(), snapshot_storage_}); | ||
|
||
auto res = save_controller_->InitResourcesAndStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have "Init()" function that can fail and then keep Save()
function that does Start and Wait as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what we do. But I assume you know this from:
ok, this answers my first question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wait to see if Roman has any objections because I'm as usually an early-approver 😆
// Objects of this class are used concurrently. Call this function | ||
// in a mutually exlusive context to avoid data races. | ||
// Also call this function before any call to `WaitAllSnapshots` | ||
// Returns empty optional on success and SaveInfo on failure | ||
std::optional<SaveInfo> InitResourcesAndStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be a little more specific (what means used concurrently) like: the save controller can be queried for stats from another threads at any time, so all setup and teardown operations must be guarded, etc...
* fix: data race in save_stages_controller
fixes #2644