Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: relax divergent backtrace scrub failures for replicated ancestor inodes #57354

Merged
merged 1 commit into from
May 29, 2024

Conversation

vshankar
Copy link
Contributor

@vshankar vshankar commented May 8, 2024

scrub could be verifying backtrace for an inode for which some of its ancestors might be replicas, e.g. (from a custom debug build) some ancestors of an inode with divergent backtrace were replicas:

[inode 0x3000000502f [...122,head] /volumes/qa/sv_0/b98de6ea-ed40-40d0-8e1a-9433a337a387/client.0/tmp/payload.2/multiple_rsync_payload.190107/firmware/ rep@0.1 fragtree_t(*^3) v6663 f(v493 m2024-05-01T06:38:16.403080+0000 388=289+99) n(v139 rc2024-05-01T06:55:35.239345+0000 b467915716 4880=4534+346) old_inodes=24 (inest mix) (ifile mix) | lock=0 importing=0 dirfrag=1 0x55a85d244680]

In such cases, the backpointer version (inode_backpointer_t::version) of the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version < on-disk version).

Sample:

"ondisk_value":"(2)0x30000005bba:

[<0x3000000502f/mwl8k v2126>,
<0x30000005026/firmware v6663>,
<0x30000005025/multiple_rsync_payload.190107 v3041>,
<0x10000005894/payload.2 v4873>,
<0x10000000005/tmp v6193>,<0x10000000003/client.0 v5964>,
<0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5817>
,<0x10000000001/sv_0 v5837>,
<0x10000000000/qa v6241>,
<0x1/volumes v4036>]

"memoryvalue":"(2)0x30000005bba:

[<0x3000000502f/mwl8k v2126>,
<0x30000005026/firmware v6663>,
<0x30000005025/multiple_rsync_payload.190107 v3041>,
<0x10000005894/payload.2 v4873>,
<0x10000000005/tmp v6081>,
<0x10000000003/client.0 v5942>,
<0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5709>,
<0x10000000001/sv_0 v5819>,
<0x10000000000/qa v6121>,
<0x1/volumes v4022>]

Fixes: http://tracker.ceph.com/issues/64730

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label May 8, 2024
@vshankar vshankar requested a review from a team May 8, 2024 10:38
@gregsfortytwo
Copy link
Member

the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version > on-disk version).

Sorry, if the replicated metadata has fallen behind, how can it have a newer version than what's on disk?

@vshankar
Copy link
Contributor Author

vshankar commented May 8, 2024

the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version > on-disk version).

Sorry, if the replicated metadata has fallen behind, how can it have a newer version than what's on disk?

Typo in the message - should be lesser than what's on disk. I'll fix it up.

@gregsfortytwo
Copy link
Member

the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version > on-disk version).

Sorry, if the replicated metadata has fallen behind, how can it have a newer version than what's on disk?

Typo in the message - should be lesser than what's on disk. I'll fix it up.

👍 Ah now it makes so much more sense.

… inodes

scrub could be verifying backtrace for an inode for which some of its
ancestors might be replicas, e.g. (from a custom debug build) some
ancestors of an inode with divergent backtrace were replicas:

```
[inode 0x3000000502f [...122,head] /volumes/qa/sv_0/b98de6ea-ed40-40d0-8e1a-9433a337a387/client.0/tmp/payload.2/multiple_rsync_payload.190107/firmware/ rep@0.1 fragtree_t(*^3) v6663 f(v493 m2024-05-01T06:38:16.403080+0000 388=289+99) n(v139 rc2024-05-01T06:55:35.239345+0000 b467915716 4880=4534+346) old_inodes=24 (inest mix) (ifile mix) | lock=0 importing=0 dirfrag=1 0x55a85d244680]
```

In such cases, the backpointer version (inode_backpointer_t::version) of the
in-memory (cache) inode can fall behind the on-disk version causing scrub to
consider the inode backtrace as divergent (memory version < on-disk version).

Sample:

```
"ondisk_value":"(2)0x30000005bba:

[<0x3000000502f/mwl8k v2126>,
<0x30000005026/firmware v6663>,
<0x30000005025/multiple_rsync_payload.190107 v3041>,
<0x10000005894/payload.2 v4873>,
<0x10000000005/tmp v6193>,<0x10000000003/client.0 v5964>,
<0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5817>
,<0x10000000001/sv_0 v5837>,
<0x10000000000/qa v6241>,
<0x1/volumes v4036>]

"memoryvalue":"(2)0x30000005bba:

[<0x3000000502f/mwl8k v2126>,
<0x30000005026/firmware v6663>,
<0x30000005025/multiple_rsync_payload.190107 v3041>,
<0x10000005894/payload.2 v4873>,
<0x10000000005/tmp v6081>,
<0x10000000003/client.0 v5942>,
<0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5709>,
<0x10000000001/sv_0 v5819>,
<0x10000000000/qa v6121>,
<0x1/volumes v4022>]
```

Fixes: http://tracker.ceph.com/issues/64730
Signed-off-by: Venky Shankar <vshankar@redhat.com>
@vshankar
Copy link
Contributor Author

vshankar commented May 9, 2024

the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version > on-disk version).

Sorry, if the replicated metadata has fallen behind, how can it have a newer version than what's on disk?

Typo in the message - should be lesser than what's on disk. I'll fix it up.

👍 Ah now it makes so much more sense.

Commit message (+ PR description) fixed and added additional debug detail.

Comment on lines +4858 to +4862
} else if (in->is_any_ancestor_inode_a_replica()) {
results->backtrace.passed = true;
dout(20) << "divergent backtraces are acceptable when some "
"ancestor inodes are replicas " << *in << dendl;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if following any path below a replica dir inode is sensible at all. Shouldn't scrub follow the auth inode trail. But that would lead to a distributed algorithm and a complex one I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So, one of the non-auth parent inodes could have a auth cdir (pinned directory) and therefore that subtree needs to be scrubbed. So, if scrub jumps off to the auth-mds of the inode (note that its cdir would be auth), then the entire subtree doesn't get scrubbed.

Copy link
Contributor

@kotreshhr kotreshhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vshankar
Copy link
Contributor Author

vshankar commented May 9, 2024

jenkins test api

Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@batrick
Copy link
Member

batrick commented May 15, 2024

Please do not merge this yet. I'm looking more closely at the logs.

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/66035.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is an unfortunate consequence of the scrub not acquiring nest locks which would update the iversion for the in-memory inode prior to comparing to on-disk values. I would like if the commit message and comment could expand on this. Here is what I found:

rank 4:

# last time iversion was updated:
2024-03-04T13:04:14.528+0000 7fdd4a200700 10 mds.4.cache.ino(0x1000000365d) decode_lock_inest taking inode rstat n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) for [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ rep@0.1 v5737 f() n(v212 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603)/n(v0 1=0+1) (inest mix) (ifile lock) (iversion lock) | lock=0 dirfrag=1 dirwaiter=0 discoverbase=0 0x55ac89887b80]
...
# error:
2024-03-04T13:05:35.157+0000 7fdd42000700 -1 mds.4.scrubstack _validate_inode_done scrub error on inode [inode 0x10000004847 [...42,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/mmc/host/ auth v132 ap=1 f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0) n(v0 rc2024-03-04T10:43:41.144163+0000 b231232 13=12+1) (inest lock) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/AsLsXsFsx@1},l=15360 | importingcaps=0 dirfrag=1 caps=1 replicated=0 authpin=1 scrubqueue=0 0x55ac8a214100]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value":"(2)0x10000004847:[<0x10000004790/host v132>,<0x10000004757/mmc v4410>,<0x1000000474f/drivers v1576>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v1033>,<0x1000000365f/modules v7307>,<0x1000000365e/multiple_rsync_payload.196784 v3136>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5746>,<0x10000000003/client.0 v5534>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5454>,<0x10000000001/sv_0 v5393>,<0x10000000000/qa v5396>,<0x1/volumes v5291>]//[]","memoryvalue":"(2)0x10000004847:[<0x10000004790/host v132>,<0x10000004757/mmc v4410>,<0x1000000474f/drivers v1574>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v648>,<0x1000000365f/modules v5655>,<0x1000000365e/multiple_rsync_payload.196784 v3103>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5742>,<0x10000000003/client.0 v5526>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5452>,<0x10000000001/sv_0 v5386>,<0x10000000000/qa v5396>,<0x1/volumes v5287>]//[]","error_str":"On-disk backtrace is divergent or newer"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0)","ondisk_value.rstat":"n(v0 rc2024-03-04T10:43:41.112915+0000 b231232 13=12+1)","memory_value.dirstat":"f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0)","memory_value.rstat":"n(v0 rc2024-03-04T10:43:41.144163+0000 b231232 13=12+1)","error_str":""},"return_code":0}

For rank 0:

# last update to rank 4
2024-03-04T13:04:14.526+0000 7fb043a00700 10 mds.0.cache.ino(0x1000000365d) clear_dirty_scattered 128 on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5742 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix->lock(2) w=1 flushed) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.88:6836/2374192948,v1:172.21.15.88:6837/2374192948] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907b74e000
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.88:6836/2374192948,v1:172.21.15.88:6837/2374192948] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907b74e000 con 0x559074861400
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.88:6832/2821191379,v1:172.21.15.88:6833/2821191379] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907c5d2400
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.88:6832/2821191379,v1:172.21.15.88:6833/2821191379] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907c5d2400 con 0x559074861800
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.138:6838/2340706881,v1:172.21.15.138:6839/2340706881] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907b7b8200
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.138:6838/2340706881,v1:172.21.15.138:6839/2340706881] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907b7b8200 con 0x559074861c00
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.138:6836/3212400420,v1:172.21.15.138:6837/3212400420] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x559076d78e00
2024-03-04T13:04:14.526+0000 7fb043a00700  1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.138:6836/3212400420,v1:172.21.15.138:6837/3212400420] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x559076d78e00 con 0x55907485a000
...
# update via nestlock write but no flush (deliberately) to other rnaks
2024-03-04T13:04:21.445+0000 7fb043a00700  7 mds.0.locker local_wrlock_finish  on (iversion lock w=1) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix w=1) (ifile excl) (iversion lock w=1) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:21.445+0000 7fb043a00700  7 mds.0.locker wrlock_finish on (inest mix w=1) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix w=1) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:21.445+0000 7fb043a00700 10 mds.0.locker scatter_eval (inest mix) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/66063.

@vshankar
Copy link
Contributor Author

This PR is under test in https://tracker.ceph.com/issues/66090.

Copy link
Contributor Author

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar vshankar merged commit becedf2 into ceph:main May 29, 2024
14 of 17 checks passed
@vshankar
Copy link
Contributor Author

So this is an unfortunate consequence of the scrub not acquiring nest locks which would update the iversion for the in-memory inode prior to comparing to on-disk values. I would like if the commit message and comment could expand on this.

Sorry @batrick, I totally missed this comment before merging. It's a minor update to the commit message though, but I apologize for the same.

@lxbsz
Copy link
Member

lxbsz commented Jun 7, 2024

@vshankar Sorry to review this after it getting merged.

I have one question when I was going through the logs and code for my tracker: https://tracker.ceph.com/issues/57656

Currently this PR will treat the scrub as passed when any of the parents is in replica. But what if there really has divergent in the path and only one of the parent is in replica.

For example for path /AA/BB/CC/DD/EE/FF/GG/file, what if the BB is in replica, but EE and is really divergent or disk version is newer ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cephfs Ceph File System
Projects
None yet
6 participants