Skip to content

osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests#66817

Merged
aainscow merged 1 commit intoceph:mainfrom
aainscow:bad_erase_after_ro_offset_fix
Feb 2, 2026
Merged

osd/ECUtil: Fix erase_after_ro_offset length calculation and add tests#66817
aainscow merged 1 commit intoceph:mainfrom
aainscow:bad_erase_after_ro_offset_fix

Conversation

@aainscow
Copy link
Copy Markdown
Contributor

@aainscow aainscow commented Jan 6, 2026

System test logs showed EC recovery failures with assertion errors when
recovering small objects (smaller than stripe width) in EC pools.
The recovery would fail with "shard_size >= tobj_size" assertions
because shards that should be empty incorrectly contained data.

The primary change in this commit fixes a bug in
shard_extent_map_t::erase_after_ro_offset() where the length
calculation was incorrect:

sinfo->ro_range_to_shard_extent_set(ro_offset, ro_end - ro_start, ...)

Should be:

sinfo->ro_range_to_shard_extent_set(ro_offset, ro_end - ro_offset, ...)

When ro_offset < ro_start, the incorrect calculation caused data that
should be erased to remain on shards, leading to recovery failures.

Additionally, this commit adds 13 comprehensive unit tests to TestECUtil
that thoroughly exercise erase_after_ro_offset across various edge cases,
including the critical scenario of objects smaller than stripe width where
some shards should remain empty. These tests successfully catch the bug
when it is re-introduced.

Note: The unit tests in this commit were generated with assistance from
an LLM (Large Language Model) and subsequently validated and refined.

Fixes: https://tracker.ceph.com/issues/74329

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@aainscow aainscow requested a review from a team as a code owner January 6, 2026 16:11
@aainscow aainscow requested a review from bill-scales January 7, 2026 10:40
Copy link
Copy Markdown
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix LGTM!

The comments are just about the primary unit test which, per the on-Slack discussion, needs some changes.

@aainscow aainscow force-pushed the bad_erase_after_ro_offset_fix branch from c6df2cd to 7e1c551 Compare January 8, 2026 12:59
 System test logs showed EC recovery failures with assertion errors when
 recovering small objects (smaller than stripe width) in EC pools.
 The recovery would fail with "shard_size >= tobj_size" assertions
 because shards that should be empty incorrectly contained data.

 The primary change in this commit fixes a bug in
 shard_extent_map_t::erase_after_ro_offset() where the length
 calculation was incorrect:

   sinfo->ro_range_to_shard_extent_set(ro_offset, ro_end - ro_start, ...)

 Should be:

   sinfo->ro_range_to_shard_extent_set(ro_offset, ro_end - ro_offset, ...)

 When ro_offset < ro_start, the incorrect calculation caused data that
 should be erased to remain on shards, leading to recovery failures.

 Additionally, this commit adds 13 comprehensive unit tests to TestECUtil
 that thoroughly exercise erase_after_ro_offset across various edge cases,
 including the critical scenario of objects smaller than stripe width where
 some shards should remain empty. These tests successfully catch the bug
 when it is re-introduced.

 Note: The unit tests in this commit were generated with assistance from
 an LLM (Large Language Model) and subsequently validated and refined.

 Fixes: https://tracker.ceph.com/issues/74329

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
@aainscow aainscow force-pushed the bad_erase_after_ro_offset_fix branch from 7e1c551 to a60c86a Compare January 8, 2026 14:01
@ljflores
Copy link
Copy Markdown
Member

jenkins test make check arm64

@ljflores
Copy link
Copy Markdown
Member

jenkins test windows

@ljflores
Copy link
Copy Markdown
Member

@aainscow
Copy link
Copy Markdown
Contributor Author

jenkins test make check

@aainscow aainscow merged commit 5642d83 into ceph:main Feb 2, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants