Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(server): saving is not a server state #2613

Merged
merged 5 commits into from
Feb 19, 2024
Merged

fix(server): saving is not a server state #2613

merged 5 commits into from
Feb 19, 2024

Conversation

adiholden
Copy link
Collaborator

fixes #2597
The bug:
server crashed because when replicaof no one was executed the replica was creating a snapshot and there for the server state was not active but saving.
check fail -server_family.cc:2142] Check failed: service_.SwitchState(GlobalState::LOADING, GlobalState::ACTIVE).first == GlobalState::ACTIVE Server is set to replica no one, yet state is not active!
The fix:

  1. remove saving from server states
  2. move save stage controller to be a member of server family

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@@ -264,6 +256,10 @@ void SaveStagesController::SaveRdb() {
trans_->ScheduleSingleHop(std::move(cb));
}

uint32_t SaveStagesController::GetCurrentSaveDuratio() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetCurrentSaveDuration

StrCat(GlobalStateName(new_state), " - can not save database")};
}
auto state = service_.GetGlobalState();
if (!ignore_state && (state != GlobalState::ACTIVE)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment explaining the context for this check and for ignore_state variable.
i know that you did not add ignore_state, but curious to understand why we need this check.

@@ -19,16 +17,28 @@ class Service;

namespace detail {

class SnapshotStorage;

struct LastSaveInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we return it from a ServerFamily function, it can be declared outside detail namespace


// Load snapshot from file (.rdb file or summary.dfs file) and return
// future with error_code.
util::fb2::Future<GenericError> Load(const std::string& file_name);

bool IsSaving() const {
return is_saving_.load(std::memory_order_relaxed);
std::lock_guard lk(save_mu_);
return save_controller_ && save_controller_->IsSaving();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that this function is now used only for tests.
I think that returning bool(save_controller_) is enough for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually .. I debuged the test for a while to understand that I do need it, as it takes some time for the save executed command to be scheduled in the transaction queue. If I dont have this is_saving_ set after the scheduling the test SaveFlush fails as it will schedule flushall before the save

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not complicate real code for testing reasons (especially without comments or TEST prefix), we have lot's of other ways to check if saving is in progress. For example check tl_slice_snapshots, which should be easiest. We can also check transaction metrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was not introduced in this PR. is_saving_ was set there before. I change the logic there

private:
absl::Time start_time_;
std::filesystem::path full_path_;
bool is_cloud_;
bool is_saving_ = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's not really needed. see my comment below


// Load snapshot from file (.rdb file or summary.dfs file) and return
// future with error_code.
util::fb2::Future<GenericError> Load(const std::string& file_name);

bool IsSaving() const {
return is_saving_.load(std::memory_order_relaxed);
std::lock_guard lk(save_mu_);
return save_controller_ && save_controller_->IsSaving();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not complicate real code for testing reasons (especially without comments or TEST prefix), we have lot's of other ways to check if saving is in progress. For example check tl_slice_snapshots, which should be easiest. We can also check transaction metrics

@@ -125,21 +125,21 @@ async def test_dbfilenames(

@pytest.mark.slow
@dfly_args({**BASIC_ARGS, "dbfilename": "test-cron", "snapshot_cron": "* * * * *"})
async def test_cron_snapshot(tmp_path: Path, async_client: aioredis.Redis):
async def test_cron_snapshot(tmp_dir: Path, async_client: aioredis.Redis):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad 😅 ... but why did it pass 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it did not. the slow tests are not running in regression tests

Comment on lines 40 to 41
LastSaveInfo* last_save_info_ ABSL_GUARDED_BY(save_mu_);
util::fb2::Mutex* save_mu_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a though I realize now (and not while extracting this class), it'd be more convenient if returned LastSaveInfo just by value from Save() instead of passing a mutex by pointer

Comment on lines 2022 to 2025
save_task = asyncio.create_task(save_replica())
await asyncio.sleep(0.1) # let save start
await c_replica.execute_command("replicaof no one")
await save_task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we assert replicaof no one returns OK and doesn't crash, the test is over. We can kill the replica and cancel the save task (it'll be knocked out with an error either way)

Copy link
Contributor

@dranikpg dranikpg Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can busy loop on saving_metric > 0 to start quicker? Not to save time here, but to be able to use much less than 100k keys (we rarely use such large values like 100k in pytests + debug). The snapshot starting shouldn't take more than a few milliseconds on a slow machine, and a slow machine won't be quick enough to save even 10k large values until the replicaof arrives immediately after

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test to use 10k keys and a busy loop. I do want to want to wait for the save to finish and not just exit after issue replicaof no one, this test will check that saving will finish after promotion to master and not only try to reproduce a specific crash we had

Signed-off-by: adi_holden <adi@dragonflydb.io>
romange
romange previously approved these changes Feb 19, 2024
Comment on lines 177 to 180
bool TEST_IsSaving() const {
std::lock_guard lk(save_mu_);
return save_controller_ && save_controller_->TEST_IsSaving();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously not safe and that's why the test segfaults. I meant to check thread local snapshots not tied to any saver instance. Either way, I don't think it's that important, no matter how it works, we can clean up later 😅

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@adiholden adiholden enabled auto-merge (squash) February 19, 2024 15:16
@adiholden adiholden merged commit 15b3fb1 into main Feb 19, 2024
10 checks passed
@adiholden adiholden deleted the fix_2597 branch February 19, 2024 15:20
adiholden added a commit that referenced this pull request Feb 20, 2024
* fix(server): saving is not a server state

Signed-off-by: adi_holden <adi@dragonflydb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server crashes after issuing replicaof no one
3 participants