Fix FindSliceKeys crashed for data_sync_vec empty by lzxddz · Pull Request #172 · eloqdata/tx_service

lzxddz · 2025-10-24T08:15:02Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of issue using fixes eloqdb/tx_service#issue_id
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

Bug Fixes
- Improved data synchronization robustness by detecting and warning when a merged data batch becomes empty and resetting sync state to avoid silent failures.
- Abort processing and reset scan context when a batch is empty to prevent invalid downstream updates.
- Added early-exit guards to skip unnecessary work for empty batches and unavailable channels, reducing wasted processing and noisy errors.

coderabbitai · 2025-10-24T08:15:12Z

Walkthrough

Added defensive empty-batch and post-merge guards: DataSyncForRangePartition now captures merged vector size, aborts and resets range-scan context when the merged data_sync_vec becomes empty; RangeCacheSender/FindSliceKeys returns early with a warning when batch_records is empty.

Changes

Cohort / File(s)	Change Summary
Data-sync guards `src/cc/local_cc_shards.cpp`	After merging vectors in `DataSyncForRangePartition`, capture `data_sync_vec_size`, check for empty `data_sync_vec` after drains/erases, log a warning and reset range-scan context to abort further processing when empty; removed redundant end-of-batch per-shard cleanup.
Empty-batch guard in range cache `src/cc/local_cc_shards.cpp`	In `RangeCacheSender::FindSliceKeys` added early-return with a warning when `batch_records` is empty to skip unnecessary processing.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant RangeCacheSender
    Note right of RangeCacheSender #EEF2FF: FindSliceKeys (empty-batch guard)
    Caller->>RangeCacheSender: FindSliceKeys(batch_records)
    alt batch_records empty
        RangeCacheSender-->>Caller: log warning and return
    else batch_records non-empty
        RangeCacheSender->>RangeCacheSender: compute slice keys
        RangeCacheSender-->>Caller: continue flow
    end

sequenceDiagram
    participant Caller
    participant DataSync
    Note right of DataSync #F6FFF0: DataSyncForRangePartition (post-merge guard)
    Caller->>DataSync: DataSyncForRangePartition()
    DataSync->>DataSync: merge vectors -> data_sync_vec
    DataSync->>DataSync: capture data_sync_vec_size
    alt data_sync_vec empty
        DataSync-->>Caller: log warning, reset range-scan context, abort
    else not empty
        DataSync->>DataSync: continue per-shard processing
        DataSync-->>Caller: complete batch
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

xiexiaoy

Poem

🐰 I nudged the vectors, merged with care,

When empty shadows lingered there.
I warned, I reset, then hopped away,
Safe batches now — no crash today. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description consists only of the unchecked template checklist with no substantive content. While the description follows the template structure, all items remain unchecked and there is no explanation of what problem is being fixed, why these changes are necessary, or how they address the issue. The description should include actual details about the bug being fixed, the rationale for the solution, and context about the changes made to local_cc_shards.cpp.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "Fix FindSliceKeys crashed for data_sync_vec empty" is specific and directly related to the main changes in the pull request. The summary of changes confirms that the modifications add guards and checks in LocalCcShards::DataSyncForRangePartition to prevent crashes when data_sync_vec becomes empty, which aligns with the title's intent. The title clearly communicates the primary purpose of the changeset, allowing teammates to understand the fix at a glance.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update_timeout_ts

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1785e and f5cd09e.

📒 Files selected for processing (1)

src/cc/local_cc_shards.cpp (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
PR: eloqdata/tx_service#150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

src/cc/local_cc_shards.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cc/local_cc_shards.cpp (1)

3828-3887: Do not continue here; it breaks the batching pipeline

This early continue skips all of the clean-up that has to happen after each scan batch (scan_cc.Reset(), per-core vector clear()s, and the flush quota bookkeeping). On the next loop iteration we keep the old scan buffers and outstanding quota, so we double-count records, leak the reserved data_sync_mem_controller_ budget, and eventually deadlock the worker. Please restructure this block so that we still run the normal reset/clear path (and release the quota) even when the merged batch becomes empty, instead of jumping straight to the next iteration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3cd3a0 and eb928b6.

📒 Files selected for processing (1)

src/cc/local_cc_shards.cpp (3 hunks)

coderabbitai · 2025-10-24T08:19:17Z

src/cc/local_cc_shards.cpp

+    if (batch_records.empty())
+    {
+        LOG(WARNING)
+            << "FindSliceKeys: Empty batch records. Skip sending cache.";
+        return;
+    }
+


⚠️ Potential issue | 🔴 Critical

Empty batch guard leaves stale slice state

When batch_records is empty we return without clearing batch_slice_key_vec_, yet AppendSliceDataRequest is still called right after this, so it works with stale slice keys (and even hits its assert). Make the empty-batch path reset the slice vector and ensure the caller skips AppendSliceDataRequest in that case.

🤖 Prompt for AI Agents

In src/cc/local_cc_shards.cpp around lines 6158-6164, the early return on empty batch_records leaves batch_slice_key_vec_ populated which causes AppendSliceDataRequest to operate on stale slice keys; clear batch_slice_key_vec_ before returning and update the caller to skip calling AppendSliceDataRequest when batch_records is empty (or check batch_slice_key_vec_.empty() after this function) so no stale keys are used and the assert is not triggered.

yi-xmu

+1

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/cc/local_cc_shards.cpp (1)

6164-6169: Empty batch still leaves stale slice state

This early return happens before you clear batch_slice_key_vec_ or reset paused_pos_, and the caller still proceeds to AppendSliceDataRequest. That recreates the stale-slice scenario from the previous review: append runs with the prior batch’s keys/pause state and can still trigger the same assert. Please reset the per-batch state here and signal the caller to skip the append (e.g. return a boolean and gate the call). As it stands the original bug is still present.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11d59ce and 9c1785e.

📒 Files selected for processing (1)

src/cc/local_cc_shards.cpp (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-09T03:56:58.811Z

Learnt from: thweetkomputer
PR: eloqdata/tx_service#150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.

Applied to files:

src/cc/local_cc_shards.cpp

coderabbitai · 2025-10-24T09:45:55Z

src/cc/local_cc_shards.cpp

            }

+            if (data_sync_vec->empty())
+            {
+                LOG(WARNING) << "data_sync_vec becomes empty after erase, old "
+                                "size of data_sync_vec_size: "
+                             << data_sync_vec_size;
+                // Reset
+                scan_cc.Reset();
+                continue;


⚠️ Potential issue | 🔴 Critical

Release flush quota before early continue

We reserve flush_data_size via AllocateFlushDataMemQuota(...) just above, but if this branch trips we continue without ever giving the quota back (no flush task is enqueued to do it later). After a few hits the controller stays pegged and subsequent scans block forever waiting for capacity. Please release the quota before the continue.

if (data_sync_vec->empty()) { LOG(WARNING) << "data_sync_vec becomes empty after erase, old " "size of data_sync_vec_size: " << data_sync_vec_size; + data_sync_mem_controller_.DeallocateFlushMemQuota( + flush_data_size); // Reset scan_cc.Reset(); continue; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

}

if (data_sync_vec->empty())

{

LOG(WARNING) << "data_sync_vec becomes empty after erase, old "

"size of data_sync_vec_size: "

<< data_sync_vec_size;

// Reset

scan_cc.Reset();

continue;

}

if (data_sync_vec->empty())

{

LOG(WARNING) << "data_sync_vec becomes empty after erase, old "

"size of data_sync_vec_size: "

<< data_sync_vec_size;

data_sync_mem_controller_.DeallocateFlushMemQuota(

flush_data_size);

// Reset

scan_cc.Reset();

continue;

fix FindSliceKeys crashed for vector empty

eb928b6

lzxddz self-assigned this Oct 24, 2025

lzxddz linked an issue Oct 24, 2025 that may be closed by this pull request

[Bug]: Crashed at FindSliceKeys on test eloqdoc cluster (3 nodes). #171

Closed

lzxddz requested review from liunyl and yi-xmu October 24, 2025 08:17

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

re-init data_sync_vecs after move

9c1785e

lzxddz force-pushed the update_timeout_ts branch from 11d59ce to 9c1785e Compare October 24, 2025 09:38

yi-xmu approved these changes Oct 24, 2025

View reviewed changes

update

f5cd09e

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

lzxddz merged commit f4bbfde into main Oct 27, 2025
4 checks passed

lzxddz deleted the update_timeout_ts branch October 27, 2025 03:16

yi-xmu mentioned this pull request Nov 25, 2025

[Bug]: LocalCcShards::UpdateSlices assert(!slice_split_keys.empty()); #95

Closed

coderabbitai bot mentioned this pull request Dec 2, 2025

optimze update range slices #254

Merged

5 tasks

coderabbitai bot mentioned this pull request Dec 18, 2025

[Bug]: LocalCcShards::UpdateSlices assert(!slice_split_keys.empty()) failed #301

Closed

This was referenced Jan 15, 2026

Fix recent bugs #362

Closed

Bug fix #368

Merged

Skip ccpage that has no dirty keys since last datasync during datasyn… #374

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FindSliceKeys crashed for data_sync_vec empty#172

Fix FindSliceKeys crashed for data_sync_vec empty#172
lzxddz merged 3 commits intomainfrom
update_timeout_ts

lzxddz commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

yi-xmu left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lzxddz commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yi-xmu left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lzxddz commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading