notify standbys to resubscribe when forward entry missing by MrGuin · Pull Request #399 · eloqdata/tx_service

MrGuin · 2026-02-04T08:48:12Z

Summary by CodeRabbit

New Features
- Automatic detection and notification when standby nodes fall out-of-sync.
- Triggers resubscription and recovery signaling so standbys can be re-synchronized promptly.
Bug Fixes
- Centralizes out-of-sync handling to ensure consistent notifications during forwarding failures.
- Improves housekeeping and recovery reliability to reduce missed updates and stale subscriptions.

coderabbitai · 2026-02-04T08:48:42Z

Walkthrough

Adds CcShard::NotifyStandbyOutOfSync(uint32_t) and replaces inline out-of-sync handling with calls to it when ForwardEntry is missing or resend fails; the helper ensures stream sender init, sends a KeyObjectStandbyForwardRequest, marks standby seq_id = UINT64_MAX, removes heartbeat targets, and dispatches per-core subscription updates.

Changes

Cohort / File(s)	Summary
CcShard public API & impl `tx_service/include/cc/cc_shard.h`, `tx_service/src/cc/cc_shard.cpp`	Add public method `NotifyStandbyOutOfSync(uint32_t)` and implement it. Move previously inlined out-of-sync logic into this helper: ensure stream sender initialized, send KeyObjectStandbyForwardRequest, set standby.seq_id = UINT64_MAX, remove heartbeat target, and broadcast per-core tasks to clear subscriptions. ResendFailedForwardMessages now calls the helper.
Forward-entry guards in object CC map `tx_service/include/cc/object_cc_map.h`	In `PostWriteCc` and `Execute` paths, capture `subscribed_standbys` locally and, if `ForwardEntry` is null, log and call `shard_->NotifyStandbyOutOfSync(node_id)` for each subscribed standby to trigger resubscription/out-of-sync handling.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CcShard as CcShard
    participant Stream as StreamSender
    participant Standby as StandbyNode
    participant Heartbeat as HeartbeatManager
    participant Scheduler as PerCoreScheduler

    CcShard->>CcShard: detect missing ForwardEntry or resend failure
    CcShard->>Stream: ensure_initialized()
    Stream-->>CcShard: initialized
    CcShard->>Stream: send(KeyObjectStandbyForwardRequest(node_id, out_of_sync))
    Stream->>Standby: deliver out-of-sync request
    CcShard->>CcShard: set standby.seq_id = UINT64_MAX
    CcShard->>Heartbeat: remove_target(node_id)
    CcShard->>Scheduler: broadcast(adjust_subscriptions_for_node(node_id))
    Note right of Scheduler: per-core tasks clear subscriptions for that node

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Fix CcStream SendMessageToNode #282: Modifies send paths for standby messages and aligns with refactoring of resend/send logic in cc_shard.cpp.
fix standby cce eviction and forward msg #384: Changes standby-forward handling in object_cc_map.h; closely related to the forward/standby eviction and resubscription behavior updated here.

Suggested reviewers

liunyl

Poem

🐰 I sniffed a missing forward and gave a hop,

Sent "out of sync" where the messages stop.
Seq climbed to MAX, heartbeats took flight,
Cores cleared their lists through the soft, moonlit night.
A tiny fix, now standbys sleep tight.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided by the author, but the repository template requires sections like tests, documentation, issue/RFC references, and test suite completion.	Add a pull request description following the template: include test additions, documentation updates, issue reference (fixes #), RFC link if applicable, and test suite results.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'notify standbys to resubscribe when forward entry missing' clearly and specifically describes the main change: adding functionality to notify standbys to resubscribe when a forward entry is missing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch notify_resub

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tx_service/src/cc/cc_shard.cpp`:
- Around line 3079-3141: NotifyStandbyOutOfSync sets the node's seq id to
UINT64_MAX but leaves the entry in subscribed_standby_nodes_, causing
ForwardStandbyMessage to treat it as a send target and trigger retries; fix by
changing forwarding logic in ForwardStandbyMessage to skip entries whose seq id
(seq_id_and_term.first / the subscribed_standby_nodes_ value) equals UINT64_MAX
(or otherwise mark them as non-sendable), or alternatively remove the node from
subscribed_standby_nodes_ in NotifyStandbyOutOfSync; update the iteration/send
path in ForwardStandbyMessage to check for UINT64_MAX and not add
history/trigger retries for those nodes.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tx_service/src/cc/cc_shard.cpp`:
- Around line 3098-3160: NotifyStandbyOutOfSync currently updates state (setting
seq_id to UINT64_MAX and removing heartbeat targets) immediately after calling
stream_sender_->SendMessageToNode without checking the SendMessageResult; change
it to capture the return value from stream_sender_->SendMessageToNode
(SendMessageResult), and if both sent and queued_for_retry are false, log a
warning and return early so you don't mark the standby out-of-sync; only proceed
to modify seq_node_iter->second, call local_shards_.RemoveHeartbeatTargetNode,
and dispatch the unsubscribe updates to other cores after the send_result
indicates success or queued retry.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48cad9a and ae93f8a.

📒 Files selected for processing (3)

tx_service/include/cc/cc_shard.h
tx_service/include/cc/object_cc_map.h
tx_service/src/cc/cc_shard.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

tx_service/include/cc/cc_shard.h

coderabbitai · 2026-02-24T04:14:35Z

tx_service/src/cc/cc_shard.cpp

+void CcShard::NotifyStandbyOutOfSync(uint32_t node_id)
+{
+    auto seq_node_iter = subscribed_standby_nodes_.find(node_id);
+    if (seq_node_iter == subscribed_standby_nodes_.end())
+    {
+        return;
+    }
+
+    if (!stream_sender_)
+    {
+        stream_sender_ = Sharder::Instance().GetCcStreamSender();
+    }
+    if (!stream_sender_)
+    {
+        LOG(WARNING) << "Failed to notify standby " << node_id
+                     << " of out of sync state because stream sender is null";
+        return;
+    }
+
+    // Message not found in map - it has been evicted. Notify standby that it
+    // has already fallen behind so it can resubscribe to the primary node.
+    remote::CcMessage cc_msg;
+    cc_msg.set_type(remote::CcMessage_MessageType::
+                        CcMessage_MessageType_KeyObjectStandbyForwardRequest);
+    auto req = cc_msg.mutable_key_obj_standby_forward_req();
+    req->set_forward_seq_grp(core_id_);
+    req->set_forward_seq_id(next_forward_sequence_id_ - 1);
+    req->set_primary_leader_term(
+        Sharder::Instance().LeaderTerm(Sharder::Instance().NativeNodeGroup()));
+    req->set_out_of_sync(true);
+    stream_sender_->SendMessageToNode(node_id, cc_msg);
+
+    auto &seq_id_and_term = seq_node_iter->second;
+    seq_id_and_term.first = UINT64_MAX;
+    // Remove heartbeat target node
+    local_shards_.RemoveHeartbeatTargetNode(node_id, seq_id_and_term.second);
+
+    int64_t unsubscribe_standby_term = seq_id_and_term.second;
+    for (size_t core_idx = 0; core_idx < core_cnt_; ++core_idx)
+    {
+        if (core_idx != core_id_)
+        {
+            DispatchTask(
+                core_idx,
+                [node_id, unsubscribe_standby_term](CcShard &ccs) -> bool
+                {
+                    auto subscribe_node_iter =
+                        ccs.subscribed_standby_nodes_.find(node_id);
+                    if (subscribe_node_iter !=
+                        ccs.subscribed_standby_nodes_.end())
+                    {
+                        if (subscribe_node_iter->second.second <=
+                            unsubscribe_standby_term)
+                        {
+                            subscribe_node_iter->second.first = UINT64_MAX;
+                        }
+                    }
+
+                    return true;
+                });
+        }
+    }
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find SendMessageToNode method definition and its return type rg -n "SendMessageToNode" -t cpp -t h -t hh -C 3

Repository: eloqdata/tx_service

Length of output: 89

🏁 Script executed:

#!/bin/bash # Find SendMessageToNode method definition and its return type rg -n "SendMessageToNode" -t cpp -t h -C 3

Repository: eloqdata/tx_service

Length of output: 25623

🏁 Script executed:

#!/bin/bash # Find SendMessageResult struct definition rg -n "struct SendMessageResult|class SendMessageResult" -t cpp -t h -A 10

Repository: eloqdata/tx_service

Length of output: 809

🏁 Script executed:

#!/bin/bash # Get more details about the SendMessageResult and understand the implementation rg -n "SendMessageResult" tx_service/include/remote/cc_stream_sender.h -A 30 | head -60

Repository: eloqdata/tx_service

Length of output: 2387

🏁 Script executed:

#!/bin/bash # Check the NotifyStandbyOutOfSync context and see if SendStandbyMessageToNode could be used rg -n "SendStandbyMessageToNode" -t cpp -t h -C 2 | head -50

Repository: eloqdata/tx_service

Length of output: 2445

🏁 Script executed:

#!/bin/bash # Check the implementation of SendMessageToNode to see what happens on failure cat -n tx_service/src/remote/cc_stream_sender.cpp | sed -n '158,250p'

Repository: eloqdata/tx_service

Length of output: 4286

🏁 Script executed:

#!/bin/bash # Continue reading SendMessageToNode implementation cat -n tx_service/src/remote/cc_stream_sender.cpp | sed -n '250,290p'

Repository: eloqdata/tx_service

Length of output: 1948

🏁 Script executed:

#!/bin/bash # Continue reading to see what happens on non-resend failures cat -n tx_service/src/remote/cc_stream_sender.cpp | sed -n '290,310p'

Repository: eloqdata/tx_service

Length of output: 1443

🏁 Script executed:

#!/bin/bash # See the final return value cat -n tx_service/src/remote/cc_stream_sender.cpp | sed -n '310,325p'

Repository: eloqdata/tx_service

Length of output: 633

🏁 Script executed:

#!/bin/bash # Check the exact context around line 3128 in cc_shard.cpp cat -n tx_service/src/cc/cc_shard.cpp | sed -n '3115,3165p'

Repository: eloqdata/tx_service

Length of output: 2377

Check the send result before marking standby as out-of-sync.

NotifyStandbyOutOfSync modifies state (setting seq_id to UINT64_MAX and removing heartbeat targets) without verifying the notification was successfully sent or queued for retry. SendMessageToNode returns a SendMessageResult struct with fields sent and queued_for_retry. If the result indicates the message failed and wasn't queued (both false), the standby will never resubscribe and will silently lag behind.

Check the result before modifying state:

const auto send_result = stream_sender_->SendMessageToNode(node_id, cc_msg); if (!send_result.sent && !send_result.queued_for_retry) { LOG(WARNING) << "Failed to notify standby " << node_id << " of out-of-sync state; will retry later"; return; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tx_service/src/cc/cc_shard.cpp` around lines 3098 - 3160, NotifyStandbyOutOfSync currently updates state (setting seq_id to UINT64_MAX and removing heartbeat targets) immediately after calling stream_sender_->SendMessageToNode without checking the SendMessageResult; change it to capture the return value from stream_sender_->SendMessageToNode (SendMessageResult), and if both sent and queued_for_retry are false, log a warning and return early so you don't mark the standby out-of-sync; only proceed to modify seq_node_iter->second, call local_shards_.RemoveHeartbeatTargetNode, and dispatch the unsubscribe updates to other cores after the send_result indicates success or queued retry.

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

MrGuin requested a review from liunyl February 5, 2026 03:47

liunyl approved these changes Feb 5, 2026

View reviewed changes

MrGuin force-pushed the notify_resub branch from b689249 to 48cad9a Compare February 6, 2026 08:38

MrGuin added 2 commits February 24, 2026 12:07

notify standby to resubscribe when forward entry missing

cf5902b

format

ae93f8a

MrGuin force-pushed the notify_resub branch from 48cad9a to ae93f8a Compare February 24, 2026 04:08

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

MrGuin merged commit 95edda9 into main Feb 24, 2026
4 checks passed

MrGuin deleted the notify_resub branch February 24, 2026 09:16

coderabbitai bot mentioned this pull request Mar 11, 2026

Enforce subscription barrier for standby snapshot sync #436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notify standbys to resubscribe when forward entry missing#399

notify standbys to resubscribe when forward entry missing#399
MrGuin merged 2 commits intomainfrom
notify_resub

MrGuin commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026 •

edited

Loading

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MrGuin commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGuin commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading