Skip to content

tentacle: osd: Do not remove objects with divergent logs if only partial writes.#66725

Merged
yuriw merged 1 commit intoceph:tentaclefrom
aainscow:wip-74269-tentacle
Jan 29, 2026
Merged

tentacle: osd: Do not remove objects with divergent logs if only partial writes.#66725
yuriw merged 1 commit intoceph:tentaclefrom
aainscow:wip-74269-tentacle

Conversation

@aainscow
Copy link
Copy Markdown
Contributor

backport tracker: https://tracker.ceph.com/issues/74269


backport of #66698
parent tracker: https://tracker.ceph.com/issues/74221

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

Fixes https://tracker.ceph.com/issues/74221

Note: An AI was used to assist generating unit tests for this commit.
      The production code was written by the author.

In the scenario we are fixing here, there is a divergent log, which needs to
be rolled back. The non-primary does not participate in the transaction to
the object, but the log exists describing the transaction.  The primary has
a different transaction and has correctly detected the divergence.

The primary correctly concludes that no recovery is needed for the object, since
only partial writes exist on the non-primary.

The non-primary observes its divergent log and incorrectly concludes that
recovery IS needed for the divergent write and prepares by removing that
object.

The consequence of this depends on the next operation:
1. A read will fail with -EIO
2. A RMW involving a read from the removed object  will detect the failure
   and reconstruct the necessary data.
3. A RMW not involve the write or an append will recreate the object, but with
   zeros, so will cause data corruption. A

It is unusual for such a log entry to exist on the non-primary because
normally those are omitted from the non-primary log. The scenario that causes
this when a partial write triggers a clone due to copy on write.  We now have
a clone operation which affects ALL shards and so the log entry is sent to
all shards.

This is unusual to see in the field. We must have all of the following:

1. A clone operation (these are infrequent)
2. A partial write.
3. A peering cycle must happen before this write is complete.

The combination of 1 and 3 make this a very unusual operation in teuthology
and will be even rarer in the field.

The fix ensures we skip divergent log entries for partial writes that the shard
did not participate in.

The following is a minimal script to recreate:

set -e -x

MDS=0 MON=1 OSD=4 MGR=1 ../src/vstart.sh --debug --new -x --localhost -o timeout=10000 -o session_timeout=10000 -o debug_osd=20

ceph osd pool set noautoscale
ceph balancer off
ceph osd set nodeep-scrub
ceph osd set noscrub
ceph osd set noout

ceph config set global bluestore_debug_inject_read_err true

dd if=/dev/random of=file_8k bs=8k count=1
dd if=/dev/random of=file_4k bs=4k count=1

ceph osd erasure-code-profile set alex k=2 m=2
ceph osd pool create mypool --pg_num=1 --pool_type=erasure alex
ceph osd pool set mypool allow_ec_overwrites true
ceph osd pool set mypool allow_ec_optimizations true
ceph osd pool set mypool min_size 2

rados put -p mypool test1 file_8k

acting_set=$(ceph osd map mypool test1 --format=json | jq -r '.acting[]')
acting_array=($acting_set)

shard_0_osd=${acting_array[0]}
shard_1_osd=${acting_array[1]}

echo "Shard 0 OSD: $shard_0_osd"
echo "Shard 1 OSD: $shard_1_osd"

ceph daemon osd.$shard_0_osd injectecwriteerr mypool "*" 2 1 0 1

rados -p mypool mksnap test1_snap
rados put -p mypool test1 file_4k --offset 0 &

ceph osd set noup
ceph osd down $shard_1_osd

wait

ceph osd unset noup

rados -p mypool mksnap test1_snap2
rados put -p mypool test1 file_4k --offset 0

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 65dce5e)
@aainscow aainscow requested a review from a team as a code owner December 23, 2025 13:40
@aainscow aainscow added this to the tentacle milestone Dec 23, 2025
@aainscow aainscow added the core label Dec 23, 2025
@github-actions github-actions bot added the tests label Dec 23, 2025
@aainscow
Copy link
Copy Markdown
Contributor Author

jenkins test make check

@amathuria
Copy link
Copy Markdown
Contributor

@amathuria
Copy link
Copy Markdown
Contributor

jenkins test api

@yuriw yuriw merged commit e368409 into ceph:tentacle Jan 29, 2026
13 of 14 checks passed
@batrick batrick modified the milestones: tentacle, v20.2.1 Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants