-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(server): saving is not a server state #2613
Conversation
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@@ -264,6 +256,10 @@ void SaveStagesController::SaveRdb() { | |||
trans_->ScheduleSingleHop(std::move(cb)); | |||
} | |||
|
|||
uint32_t SaveStagesController::GetCurrentSaveDuratio() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetCurrentSaveDuration
StrCat(GlobalStateName(new_state), " - can not save database")}; | ||
} | ||
auto state = service_.GetGlobalState(); | ||
if (!ignore_state && (state != GlobalState::ACTIVE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comment explaining the context for this check and for ignore_state
variable.
i know that you did not add ignore_state
, but curious to understand why we need this check.
@@ -19,16 +17,28 @@ class Service; | |||
|
|||
namespace detail { | |||
|
|||
class SnapshotStorage; | |||
|
|||
struct LastSaveInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we return it from a ServerFamily function, it can be declared outside detail
namespace
src/server/server_family.h
Outdated
|
||
// Load snapshot from file (.rdb file or summary.dfs file) and return | ||
// future with error_code. | ||
util::fb2::Future<GenericError> Load(const std::string& file_name); | ||
|
||
bool IsSaving() const { | ||
return is_saving_.load(std::memory_order_relaxed); | ||
std::lock_guard lk(save_mu_); | ||
return save_controller_ && save_controller_->IsSaving(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that this function is now used only for tests.
I think that returning bool(save_controller_) is enough for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually .. I debuged the test for a while to understand that I do need it, as it takes some time for the save executed command to be scheduled in the transaction queue. If I dont have this is_saving_ set after the scheduling the test SaveFlush fails as it will schedule flushall before the save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not complicate real code for testing reasons (especially without comments or TEST prefix), we have lot's of other ways to check if saving is in progress. For example check tl_slice_snapshots
, which should be easiest. We can also check transaction metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change was not introduced in this PR. is_saving_ was set there before. I change the logic there
private: | ||
absl::Time start_time_; | ||
std::filesystem::path full_path_; | ||
bool is_cloud_; | ||
bool is_saving_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's not really needed. see my comment below
src/server/server_family.h
Outdated
|
||
// Load snapshot from file (.rdb file or summary.dfs file) and return | ||
// future with error_code. | ||
util::fb2::Future<GenericError> Load(const std::string& file_name); | ||
|
||
bool IsSaving() const { | ||
return is_saving_.load(std::memory_order_relaxed); | ||
std::lock_guard lk(save_mu_); | ||
return save_controller_ && save_controller_->IsSaving(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not complicate real code for testing reasons (especially without comments or TEST prefix), we have lot's of other ways to check if saving is in progress. For example check tl_slice_snapshots
, which should be easiest. We can also check transaction metrics
@@ -125,21 +125,21 @@ async def test_dbfilenames( | |||
|
|||
@pytest.mark.slow | |||
@dfly_args({**BASIC_ARGS, "dbfilename": "test-cron", "snapshot_cron": "* * * * *"}) | |||
async def test_cron_snapshot(tmp_path: Path, async_client: aioredis.Redis): | |||
async def test_cron_snapshot(tmp_dir: Path, async_client: aioredis.Redis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad 😅 ... but why did it pass 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it did not. the slow tests are not running in regression tests
LastSaveInfo* last_save_info_ ABSL_GUARDED_BY(save_mu_); | ||
util::fb2::Mutex* save_mu_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a though I realize now (and not while extracting this class), it'd be more convenient if returned LastSaveInfo just by value from Save()
instead of passing a mutex by pointer
tests/dragonfly/replication_test.py
Outdated
save_task = asyncio.create_task(save_replica()) | ||
await asyncio.sleep(0.1) # let save start | ||
await c_replica.execute_command("replicaof no one") | ||
await save_task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we assert replicaof no one
returns OK
and doesn't crash, the test is over. We can kill the replica and cancel the save task (it'll be knocked out with an error either way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can busy loop on saving_metric > 0
to start quicker? Not to save time here, but to be able to use much less than 100k keys (we rarely use such large values like 100k in pytests + debug). The snapshot starting shouldn't take more than a few milliseconds on a slow machine, and a slow machine won't be quick enough to save even 10k large values until the replicaof arrives immediately after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test to use 10k keys and a busy loop. I do want to want to wait for the save to finish and not just exit after issue replicaof no one, this test will check that saving will finish after promotion to master and not only try to reproduce a specific crash we had
Signed-off-by: adi_holden <adi@dragonflydb.io>
src/server/server_family.h
Outdated
bool TEST_IsSaving() const { | ||
std::lock_guard lk(save_mu_); | ||
return save_controller_ && save_controller_->TEST_IsSaving(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously not safe and that's why the test segfaults. I meant to check thread local snapshots not tied to any saver instance. Either way, I don't think it's that important, no matter how it works, we can clean up later 😅
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
* fix(server): saving is not a server state Signed-off-by: adi_holden <adi@dragonflydb.io>
fixes #2597
The bug:
server crashed because when replicaof no one was executed the replica was creating a snapshot and there for the server state was not active but saving.
check fail -server_family.cc:2142] Check failed: service_.SwitchState(GlobalState::LOADING, GlobalState::ACTIVE).first == GlobalState::ACTIVE Server is set to replica no one, yet state is not active!
The fix: