-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mds: relax divergent backtrace scrub failures for replicated ancestor inodes #57354
Conversation
Sorry, if the replicated metadata has fallen behind, how can it have a newer version than what's on disk? |
Typo in the message - should be lesser than what's on disk. I'll fix it up. |
👍 Ah now it makes so much more sense. |
… inodes scrub could be verifying backtrace for an inode for which some of its ancestors might be replicas, e.g. (from a custom debug build) some ancestors of an inode with divergent backtrace were replicas: ``` [inode 0x3000000502f [...122,head] /volumes/qa/sv_0/b98de6ea-ed40-40d0-8e1a-9433a337a387/client.0/tmp/payload.2/multiple_rsync_payload.190107/firmware/ rep@0.1 fragtree_t(*^3) v6663 f(v493 m2024-05-01T06:38:16.403080+0000 388=289+99) n(v139 rc2024-05-01T06:55:35.239345+0000 b467915716 4880=4534+346) old_inodes=24 (inest mix) (ifile mix) | lock=0 importing=0 dirfrag=1 0x55a85d244680] ``` In such cases, the backpointer version (inode_backpointer_t::version) of the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version < on-disk version). Sample: ``` "ondisk_value":"(2)0x30000005bba: [<0x3000000502f/mwl8k v2126>, <0x30000005026/firmware v6663>, <0x30000005025/multiple_rsync_payload.190107 v3041>, <0x10000005894/payload.2 v4873>, <0x10000000005/tmp v6193>,<0x10000000003/client.0 v5964>, <0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5817> ,<0x10000000001/sv_0 v5837>, <0x10000000000/qa v6241>, <0x1/volumes v4036>] "memoryvalue":"(2)0x30000005bba: [<0x3000000502f/mwl8k v2126>, <0x30000005026/firmware v6663>, <0x30000005025/multiple_rsync_payload.190107 v3041>, <0x10000005894/payload.2 v4873>, <0x10000000005/tmp v6081>, <0x10000000003/client.0 v5942>, <0x10000000002/b98de6ea-ed40-40d0-8e1a-9433a337a387 v5709>, <0x10000000001/sv_0 v5819>, <0x10000000000/qa v6121>, <0x1/volumes v4022>] ``` Fixes: http://tracker.ceph.com/issues/64730 Signed-off-by: Venky Shankar <vshankar@redhat.com>
Commit message (+ PR description) fixed and added additional debug detail. |
} else if (in->is_any_ancestor_inode_a_replica()) { | ||
results->backtrace.passed = true; | ||
dout(20) << "divergent backtraces are acceptable when some " | ||
"ancestor inodes are replicas " << *in << dendl; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if following any path below a replica dir inode is sensible at all. Shouldn't scrub follow the auth inode trail. But that would lead to a distributed algorithm and a complex one I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So, one of the non-auth parent inodes could have a auth cdir (pinned directory) and therefore that subtree needs to be scrubbed. So, if scrub jumps off to the auth-mds of the inode (note that its cdir would be auth), then the entire subtree doesn't get scrubbed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
jenkins test api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please do not merge this yet. I'm looking more closely at the logs. |
This PR is under test in https://tracker.ceph.com/issues/66035. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is an unfortunate consequence of the scrub not acquiring nest locks which would update the iversion for the in-memory inode prior to comparing to on-disk values. I would like if the commit message and comment could expand on this. Here is what I found:
rank 4:
# last time iversion was updated:
2024-03-04T13:04:14.528+0000 7fdd4a200700 10 mds.4.cache.ino(0x1000000365d) decode_lock_inest taking inode rstat n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) for [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ rep@0.1 v5737 f() n(v212 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603)/n(v0 1=0+1) (inest mix) (ifile lock) (iversion lock) | lock=0 dirfrag=1 dirwaiter=0 discoverbase=0 0x55ac89887b80]
...
# error:
2024-03-04T13:05:35.157+0000 7fdd42000700 -1 mds.4.scrubstack _validate_inode_done scrub error on inode [inode 0x10000004847 [...42,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/payload.1/multiple_rsync_payload.196784/modules/5.14.0-239.el9.x86_64/kernel/drivers/mmc/host/ auth v132 ap=1 f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0) n(v0 rc2024-03-04T10:43:41.144163+0000 b231232 13=12+1) (inest lock) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/AsLsXsFsx@1},l=15360 | importingcaps=0 dirfrag=1 caps=1 replicated=0 authpin=1 scrubqueue=0 0x55ac8a214100]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":0,"ondisk_value":"(2)0x10000004847:[<0x10000004790/host v132>,<0x10000004757/mmc v4410>,<0x1000000474f/drivers v1576>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v1033>,<0x1000000365f/modules v7307>,<0x1000000365e/multiple_rsync_payload.196784 v3136>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5746>,<0x10000000003/client.0 v5534>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5454>,<0x10000000001/sv_0 v5393>,<0x10000000000/qa v5396>,<0x1/volumes v5291>]//[]","memoryvalue":"(2)0x10000004847:[<0x10000004790/host v132>,<0x10000004757/mmc v4410>,<0x1000000474f/drivers v1574>,<0x10000004749/kernel v624>,<0x10000003675/5.14.0-239.el9.x86_64 v648>,<0x1000000365f/modules v5655>,<0x1000000365e/multiple_rsync_payload.196784 v3103>,<0x1000000365d/payload.1 v4404>,<0x10000000005/tmp v5742>,<0x10000000003/client.0 v5526>,<0x10000000002/8e048092-a055-46f2-bd35-bc12ad2511ae v5452>,<0x10000000001/sv_0 v5386>,<0x10000000000/qa v5396>,<0x1/volumes v5287>]//[]","error_str":"On-disk backtrace is divergent or newer"},"raw_stats":{"checked":true,"passed":true,"read_ret_val":0,"ondisk_value.dirstat":"f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0)","ondisk_value.rstat":"n(v0 rc2024-03-04T10:43:41.112915+0000 b231232 13=12+1)","memory_value.dirstat":"f(v0 m2024-03-04T10:43:41.112915+0000 12=12+0)","memory_value.rstat":"n(v0 rc2024-03-04T10:43:41.144163+0000 b231232 13=12+1)","error_str":""},"return_code":0}
For rank 0:
# last update to rank 4
2024-03-04T13:04:14.526+0000 7fb043a00700 10 mds.0.cache.ino(0x1000000365d) clear_dirty_scattered 128 on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5742 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix->lock(2) w=1 flushed) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.88:6836/2374192948,v1:172.21.15.88:6837/2374192948] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907b74e000
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.88:6836/2374192948,v1:172.21.15.88:6837/2374192948] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907b74e000 con 0x559074861400
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.88:6832/2821191379,v1:172.21.15.88:6833/2821191379] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907c5d2400
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.88:6832/2821191379,v1:172.21.15.88:6833/2821191379] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907c5d2400 con 0x559074861800
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.138:6838/2340706881,v1:172.21.15.138:6839/2340706881] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x55907b7b8200
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.138:6838/2340706881,v1:172.21.15.138:6839/2340706881] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x55907b7b8200 con 0x559074861c00
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] send_to--> mds [v2:172.21.15.138:6836/3212400420,v1:172.21.15.138:6837/3212400420] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- ?+0 0x559076d78e00
2024-03-04T13:04:14.526+0000 7fb043a00700 1 -- [v2:172.21.15.88:6838/265659833,v1:172.21.15.88:6839/265659833] --> [v2:172.21.15.138:6836/3212400420,v1:172.21.15.138:6837/3212400420] -- lock(a=lockflushed inest 0x1000000365d.head) v1 -- 0x559076d78e00 con 0x55907485a000
...
# update via nestlock write but no flush (deliberately) to other rnaks
2024-03-04T13:04:21.445+0000 7fb043a00700 7 mds.0.locker local_wrlock_finish on (iversion lock w=1) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix w=1) (ifile excl) (iversion lock w=1) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:21.445+0000 7fb043a00700 7 mds.0.locker wrlock_finish on (inest mix w=1) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix w=1) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
2024-03-04T13:04:21.445+0000 7fb043a00700 10 mds.0.locker scatter_eval (inest mix) on [inode 0x1000000365d [...15a,head] /volumes/qa/sv_0/8e048092-a055-46f2-bd35-bc12ad2511ae/client.0/tmp/ auth{1=1,2=1,3=1,4=1} v5744 ap=1 f(v0 m2024-03-04T11:45:32.829119+0000 2=0+2) n(v213 rc2024-03-04T13:04:00.157091+0000 b5411877886 62869=56266+6603) old_inodes=162 (inest mix) (ifile excl) (iversion lock) caps={15360=pAsLsXsFsx/-@5},l=15360 | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 replicated=1 dirty=1 waiter=0 authpin=1 scrubqueue=0 0x5590791d4580]
This PR is under test in https://tracker.ceph.com/issues/66063. |
This PR is under test in https://tracker.ceph.com/issues/66090. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @batrick, I totally missed this comment before merging. It's a minor update to the commit message though, but I apologize for the same. |
@vshankar Sorry to review this after it getting merged. I have one question when I was going through the logs and code for my tracker: https://tracker.ceph.com/issues/57656 Currently this PR will treat the scrub as For example for path |
scrub could be verifying backtrace for an inode for which some of its ancestors might be replicas, e.g. (from a custom debug build) some ancestors of an inode with divergent backtrace were replicas:
In such cases, the backpointer version (inode_backpointer_t::version) of the in-memory (cache) inode can fall behind the on-disk version causing scrub to consider the inode backtrace as divergent (memory version < on-disk version).
Sample:
Fixes: http://tracker.ceph.com/issues/64730
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
x
between the brackets:[x]
. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e