Skip to content

Fix FindSliceKeys crashed for data_sync_vec empty#172

Merged
lzxddz merged 3 commits intomainfrom
update_timeout_ts
Oct 27, 2025
Merged

Fix FindSliceKeys crashed for data_sync_vec empty#172
lzxddz merged 3 commits intomainfrom
update_timeout_ts

Conversation

@lzxddz
Copy link
Collaborator

@lzxddz lzxddz commented Oct 24, 2025

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Bug Fixes
    • Improved data synchronization robustness by detecting and warning when a merged data batch becomes empty and resetting sync state to avoid silent failures.
    • Abort processing and reset scan context when a batch is empty to prevent invalid downstream updates.
    • Added early-exit guards to skip unnecessary work for empty batches and unavailable channels, reducing wasted processing and noisy errors.

@lzxddz lzxddz self-assigned this Oct 24, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 24, 2025

Walkthrough

Added defensive empty-batch and post-merge guards: DataSyncForRangePartition now captures merged vector size, aborts and resets range-scan context when the merged data_sync_vec becomes empty; RangeCacheSender/FindSliceKeys returns early with a warning when batch_records is empty.

Changes

Cohort / File(s) Change Summary
Data-sync guards
src/cc/local_cc_shards.cpp
After merging vectors in DataSyncForRangePartition, capture data_sync_vec_size, check for empty data_sync_vec after drains/erases, log a warning and reset range-scan context to abort further processing when empty; removed redundant end-of-batch per-shard cleanup.
Empty-batch guard in range cache
src/cc/local_cc_shards.cpp
In RangeCacheSender::FindSliceKeys added early-return with a warning when batch_records is empty to skip unnecessary processing.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant RangeCacheSender
    Note right of RangeCacheSender #EEF2FF: FindSliceKeys (empty-batch guard)
    Caller->>RangeCacheSender: FindSliceKeys(batch_records)
    alt batch_records empty
        RangeCacheSender-->>Caller: log warning and return
    else batch_records non-empty
        RangeCacheSender->>RangeCacheSender: compute slice keys
        RangeCacheSender-->>Caller: continue flow
    end
Loading
sequenceDiagram
    participant Caller
    participant DataSync
    Note right of DataSync #F6FFF0: DataSyncForRangePartition (post-merge guard)
    Caller->>DataSync: DataSyncForRangePartition()
    DataSync->>DataSync: merge vectors -> data_sync_vec
    DataSync->>DataSync: capture data_sync_vec_size
    alt data_sync_vec empty
        DataSync-->>Caller: log warning, reset range-scan context, abort
    else not empty
        DataSync->>DataSync: continue per-shard processing
        DataSync-->>Caller: complete batch
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • xiexiaoy

Poem

🐰 I nudged the vectors, merged with care,

When empty shadows lingered there.
I warned, I reset, then hopped away,
Safe batches now — no crash today. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description consists only of the unchecked template checklist with no substantive content. While the description follows the template structure, all items remain unchecked and there is no explanation of what problem is being fixed, why these changes are necessary, or how they address the issue. The description should include actual details about the bug being fixed, the rationale for the solution, and context about the changes made to local_cc_shards.cpp.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Fix FindSliceKeys crashed for data_sync_vec empty" is specific and directly related to the main changes in the pull request. The summary of changes confirms that the modifications add guards and checks in LocalCcShards::DataSyncForRangePartition to prevent crashes when data_sync_vec becomes empty, which aligns with the title's intent. The title clearly communicates the primary purpose of the changeset, allowing teammates to understand the fix at a glance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update_timeout_ts

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1785e and f5cd09e.

📒 Files selected for processing (1)
  • src/cc/local_cc_shards.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
PR: eloqdata/tx_service#150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • src/cc/local_cc_shards.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lzxddz lzxddz linked an issue Oct 24, 2025 that may be closed by this pull request
@lzxddz lzxddz requested review from liunyl and yi-xmu October 24, 2025 08:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cc/local_cc_shards.cpp (1)

3828-3887: Do not continue here; it breaks the batching pipeline

This early continue skips all of the clean-up that has to happen after each scan batch (scan_cc.Reset(), per-core vector clear()s, and the flush quota bookkeeping). On the next loop iteration we keep the old scan buffers and outstanding quota, so we double-count records, leak the reserved data_sync_mem_controller_ budget, and eventually deadlock the worker. Please restructure this block so that we still run the normal reset/clear path (and release the quota) even when the merged batch becomes empty, instead of jumping straight to the next iteration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3cd3a0 and eb928b6.

📒 Files selected for processing (1)
  • src/cc/local_cc_shards.cpp (3 hunks)

Comment on lines +6158 to +6164
if (batch_records.empty())
{
LOG(WARNING)
<< "FindSliceKeys: Empty batch records. Skip sending cache.";
return;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Empty batch guard leaves stale slice state

When batch_records is empty we return without clearing batch_slice_key_vec_, yet AppendSliceDataRequest is still called right after this, so it works with stale slice keys (and even hits its assert). Make the empty-batch path reset the slice vector and ensure the caller skips AppendSliceDataRequest in that case.

🤖 Prompt for AI Agents
In src/cc/local_cc_shards.cpp around lines 6158-6164, the early return on empty
batch_records leaves batch_slice_key_vec_ populated which causes
AppendSliceDataRequest to operate on stale slice keys; clear
batch_slice_key_vec_ before returning and update the caller to skip calling
AppendSliceDataRequest when batch_records is empty (or check
batch_slice_key_vec_.empty() after this function) so no stale keys are used and
the assert is not triggered.

@lzxddz lzxddz force-pushed the update_timeout_ts branch from 11d59ce to 9c1785e Compare October 24, 2025 09:38
Copy link
Collaborator

@yi-xmu yi-xmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/cc/local_cc_shards.cpp (1)

6164-6169: Empty batch still leaves stale slice state

This early return happens before you clear batch_slice_key_vec_ or reset paused_pos_, and the caller still proceeds to AppendSliceDataRequest. That recreates the stale-slice scenario from the previous review: append runs with the prior batch’s keys/pause state and can still trigger the same assert. Please reset the per-batch state here and signal the caller to skip the append (e.g. return a boolean and gate the call). As it stands the original bug is still present.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11d59ce and 9c1785e.

📒 Files selected for processing (1)
  • src/cc/local_cc_shards.cpp (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
PR: eloqdata/tx_service#150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

  • src/cc/local_cc_shards.cpp

Comment on lines 3889 to +3898
}

if (data_sync_vec->empty())
{
LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
"size of data_sync_vec_size: "
<< data_sync_vec_size;
// Reset
scan_cc.Reset();
continue;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Release flush quota before early continue

We reserve flush_data_size via AllocateFlushDataMemQuota(...) just above, but if this branch trips we continue without ever giving the quota back (no flush task is enqueued to do it later). After a few hits the controller stays pegged and subsequent scans block forever waiting for capacity. Please release the quota before the continue.

             if (data_sync_vec->empty())
             {
                 LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
                                 "size of data_sync_vec_size: "
                              << data_sync_vec_size;
+                data_sync_mem_controller_.DeallocateFlushMemQuota(
+                    flush_data_size);
                 // Reset
                 scan_cc.Reset();
                 continue;
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
}
if (data_sync_vec->empty())
{
LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
"size of data_sync_vec_size: "
<< data_sync_vec_size;
// Reset
scan_cc.Reset();
continue;
}
if (data_sync_vec->empty())
{
LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
"size of data_sync_vec_size: "
<< data_sync_vec_size;
data_sync_mem_controller_.DeallocateFlushMemQuota(
flush_data_size);
// Reset
scan_cc.Reset();
continue;

@lzxddz lzxddz merged commit f4bbfde into main Oct 27, 2025
4 checks passed
@lzxddz lzxddz deleted the update_timeout_ts branch October 27, 2025 03:16
@coderabbitai coderabbitai bot mentioned this pull request Dec 2, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Crashed at FindSliceKeys on test eloqdoc cluster (3 nodes).

2 participants