Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add current_fork_perc in info all command #2640

Merged
merged 10 commits into from
Feb 26, 2024
Merged

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Feb 21, 2024

fixes #2523

  • add field current_snapshot_perc
  • add field current_save_keys_processed
  • add field current_save_keys_total

append("save_buffer_bytes", save_controller_->GetSaveBuffersSize());
}
if (has_save_controller) {
append("save_buffer_bytes", save_controller_->GetSaveBuffersSize());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to lock save_mu_ before accessing save_controller_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg 😱 😱 good catch!

size_t keys_total = 0;
};

CurrentSaveStats current_save_stats;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont you put this stats in SliceSnapshot class and add a funciton in SaveStagesController that aggregates the stats as we do in GetSaveBuffersSize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a new function GetSaveProgress you will not need the save_mode_ as well. you can always set the counters and reply them in info persistance only if save_controller_ exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that we need save_mode_ because otherwise a new SliceSnapshot (that is not part of the bgsave/save operation) might overwrite the thread local. We exercise almost the same path in replication, so it could be the case that replication and a bgsave operation happen at the same time. Both of the two command paths will happily increment the thread locals messing with the correctness.

Second is that with what I have here, we save a hop. We don't dispatch a cb to fetch and aggregate the thread locals. We are in non critical path so this is less important but still

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting to add the stats to SliceSnapshot class, not to use the thread local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the change and to be honest I am not very happy with it:

  1. It's more code and introduces a way more convoluted code path
  2. I needed 3 different proxy functions to reach the SliceSnapshot 😢
  3. code repetition because we need to handle explicitly different formats/requirements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's fine for now, but at some point I would consider a refactor. It's too much work to plug in stat tracking. That 3 proxy call in GetSCurrentSnapshotProgress killed it for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there are lots of proxy functions in this flow and it looks messy
we can check later if we can do some refactoring here to make it simpler

total = 0;
}

append("current_fork_perc", perc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can change the name here, use different one than redis as we dont have a fork

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I just opted in for consistency instead of adding a different name. I will rename accordingly

@@ -99,6 +99,8 @@ struct Metrics {
// Max length of the all the tx shard-queues.
uint32_t tx_queue_len = 0;

ServerState::CurrentSaveStats current_save_stats;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already removed :)

@@ -360,4 +364,8 @@ size_t SliceSnapshot::GetTotalChannelCapacity() const {
return dest_->GetSize();
}

std::pair<size_t, size_t> SliceSnapshot::GetCurrentSnapshotProgress() const {
return {stats_.loop_serialized, stats_.keys_total};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side_saved + loop_serialized
is the number of entries serialized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

lock_guard lk{save_mu_};
if (save_controller_) {
has_save_controller = true;
save_buffer_bytes = save_controller_->GetSaveBuffersSize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think each of this calls should be inside the relevant if otherwise you invoke the functions even if the user didnt ask for memory/persistence stats

Copy link
Contributor Author

@kostasrim kostasrim Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now we need to lock twice.... And we also now in the worst path, do an extra two hops.... The added overhead and complexity is piling up...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also already have below a lock for save_mu_ for
curent_durration_sec = save_controller_->GetCurrentSaveDuration();
I think its better to have them inside the relevant path
You are doing the hops right now regardless of calling the mutex once or twice

Copy link
Contributor Author

@kostasrim kostasrim Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also one more comment here. Let's zoom out, from one side, we are saying here that we care about the potential two extra dispatches to the shards and the cost associated with it but at the same time we don't care that we added two extra shard dispatches (for GetCurrentSnapshotProgress) when in fact we did not need them in the first place..

Anyway i made the change -- just thinking out loud :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one hop for GetCurrentSnapshotProgress right? not 2. Or I did not understand what you mean..

@@ -129,6 +131,11 @@ size_t RdbSnapshot::GetSaveBuffersSize() {
return saver_->GetTotalBuffersSize();
}

std::pair<size_t, size_t> RdbSnapshot::GetCurrentSnapshotProgress() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about returning a struct SnapshotStats, it will be more clear to reader and you wont need to add a comment what is the first and second param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@kostasrim
Copy link
Contributor Author

test failure is fixed by #2647

{
lock_guard lk{save_mu_};
if (save_controller_) {
auto res = save_controller_->GetCurrentSnapshotProgress();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this to line 1967 right? we already have this code
lock_guard lk{save_mu_};
if (save_controller_) {
there

@kostasrim kostasrim merged commit d54f220 into main Feb 26, 2024
10 checks passed
@kostasrim kostasrim deleted the save_progress_tracking branch February 26, 2024 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SAVE progress tracking
2 participants