Skip to content

Commit

Permalink
gossiper: report error when waiting too long for endpoint lock
Browse files Browse the repository at this point in the history
In a longevity test reported in scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.

One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb#7127).

Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
  • Loading branch information
kbr-scylla authored and dgarcia360 committed Apr 30, 2024
1 parent 7f7fe72 commit f516dac
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 1 deletion.
39 changes: 38 additions & 1 deletion gms/gossiper.cc
Expand Up @@ -782,6 +782,10 @@ gossiper::endpoint_permit::endpoint_permit(endpoint_locks_map::entry_ptr&& ptr,
, _caller(std::move(caller))
{
++_ptr->holders;
if (!_ptr->first_holder) {
_ptr->first_holder = _caller;
}
_ptr->last_holder = _caller;
logger.debug("{}: lock_endpoint {}: acquired: permit_id={} holders={}", _caller.function_name(), _addr, _permit_id, _ptr->holders);
}

Expand All @@ -804,6 +808,7 @@ bool gossiper::endpoint_permit::release() noexcept {
logger.debug("{}: lock_endpoint {}: released: permit_id={}", _caller.function_name(), _addr, _permit_id);
ptr->units.return_all();
ptr->pid = null_permit_id;
ptr->first_holder = ptr->last_holder = std::nullopt;
_permit_id = null_permit_id;
return true;
}
Expand Down Expand Up @@ -833,7 +838,39 @@ future<gossiper::endpoint_permit> gossiper::lock_endpoint(inet_address ep, permi
}
pid = permit_id::create_random_id();
logger.debug("{}: lock_endpoint {}: waiting: permit_id={}", l.function_name(), ep, pid);
eptr->units = co_await get_units(eptr->sem, 1, _abort_source);
while (true) {
_abort_source.check();
static constexpr auto duration = std::chrono::minutes{1};
abort_on_expiry aoe(lowres_clock::now() + duration);
auto sub = _abort_source.subscribe([&aoe] () noexcept {
aoe.abort_source().request_abort();
});
assert(sub); // due to check() above
try {
eptr->units = co_await get_units(eptr->sem, 1, aoe.abort_source());
break;
} catch (const abort_requested_exception&) {
if (_abort_source.abort_requested()) {
throw;
}

// If we didn't rethrow above, the abort had to come from `abort_on_expiry`'s timer.

static constexpr auto fmt_loc = [] (const seastar::compat::source_location& l) {
return fmt::format("{}({}:{}) `{}`", l.file_name(), l.line(), l.column(), l.function_name());
};
static constexpr auto fmt_loc_opt = [] (const std::optional<seastar::compat::source_location>& l) {
if (!l) {
return "null"s;
}
return fmt_loc(*l);
};
logger.error(
"{}: waiting for endpoint lock (ep={}) took more than {}, signifying possible deadlock;"
" holders: {}, first holder: {}, last holder (might not be current): {}",
fmt_loc(l), ep, duration, eptr->holders, fmt_loc_opt(eptr->first_holder), fmt_loc_opt(eptr->last_holder));
}
}
eptr->pid = pid;
if (eptr->holders) {
on_internal_error_noexcept(logger, fmt::format("{}: lock_endpoint {}: newly held endpoint_lock_entry has {} holders", l.function_name(), ep, eptr->holders));
Expand Down
4 changes: 4 additions & 0 deletions gms/gossiper.hh
Expand Up @@ -152,6 +152,10 @@ public:
permit_id pid;
semaphore_units<> units;
size_t holders = 0;
std::optional<seastar::compat::source_location> first_holder;
// last_holder is the caller of endpoint_permit who last took this entry,
// it might not be a current holder (the permit might've been destroyed)
std::optional<seastar::compat::source_location> last_holder;

endpoint_lock_entry() noexcept;
};
Expand Down

0 comments on commit f516dac

Please sign in to comment.