Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds: wait unlink to finish to avoid conflict when creating same dentries #46331

Merged
merged 2 commits into from Oct 11, 2022

Conversation

lxbsz
Copy link
Member

@lxbsz lxbsz commented May 19, 2022

If the previous unlink request has been delayed due to some reasons,
and the new creating for the same dentry may fail or new open will
succeeds but new contents wrote to it will be lost.

The kernel client will make sure before the unlink getting the first
reply it won't send the followed create requests for the same dentry.
Here we need to make sure that before the first reply has been sent
out the dentry must be marked as unlinking.

Fixes: https://tracker.ceph.com/issues/55332
Signed-off-by: Xiubo Li xiubli@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Copy link
Contributor

@mchangir mchangir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits

src/mds/MDCache.h Outdated Show resolved Hide resolved
src/mds/Server.cc Outdated Show resolved Hide resolved
src/mds/Server.cc Outdated Show resolved Hide resolved
@vshankar vshankar requested a review from a team May 25, 2022 01:44
@djgalloway djgalloway changed the base branch from master to main May 25, 2022 19:57
@joscollin
Copy link
Member

jenkins test make check

@vshankar vshankar self-assigned this Jun 16, 2022
src/msg/Message.h Outdated Show resolved Hide resolved
src/mds/MDCache.cc Outdated Show resolved Hide resolved
src/mds/MDCache.h Outdated Show resolved Hide resolved
src/mds/MDCache.cc Outdated Show resolved Hide resolved
src/mds/Server.cc Outdated Show resolved Hide resolved
src/mds/Server.cc Outdated Show resolved Hide resolved
src/mds/Server.cc Show resolved Hide resolved
src/mds/MDCache.cc Show resolved Hide resolved
@vshankar
Copy link
Contributor

@lxbsz Is this ready for another round of review?

@lxbsz
Copy link
Member Author

lxbsz commented Jun 29, 2022

@lxbsz Is this ready for another round of review?

Yeah, I have already fixed the issues as the comments mentioned.

@vshankar
Copy link
Contributor

@lxbsz I'm seeing fsstresss job failures in teuthology - https://pulpito.ceph.com/vshankar-2022-08-10_04:06:00-fs-wip-vshankar-testing-20220805-190751-testing-default-smithi/6965167/

This is the only MDS related PR with significant changes in the branch. Could you check if the above failure (stalled job) is related to this change?

@vshankar
Copy link
Contributor

Another instance - https://pulpito.ceph.com/vshankar-2022-08-10_04:06:00-fs-wip-vshankar-testing-20220805-190751-testing-default-smithi/6965181/

The create call seems to be blocked for ~3hrs

2022-08-10T05:06:07.461 INFO:tasks.workunit.client.1.smithi130.stdout:8/727: dread d16/d17/d3f/d4f/f7b [0,4194304] 0
2022-08-10T05:06:07.464 INFO:tasks.workunit.client.1.smithi130.stdout:8/728: write d16/d6f/d71/d68/d86/fbc [986545,101947] 0
2022-08-10T05:06:07.467 INFO:tasks.workunit.client.1.smithi130.stdout:8/729: creat d16/d6f/d71/d68/d73/ff6 x:0 0 0
2022-08-10T08:03:58.198 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-08-10T08:04:00.533 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-08-10T08:04:58.241 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3448: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 85 (E, <Logger tasks.workunit.client.1.smithi130.stderr (, None, False)>
2022-08-10T08:05:00.535 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3e48: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 88 (E, <Logger tasks.workunit.client.0.smithi061.stderr (, None, False)>
2022-08-10T08:05:58.244 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3648: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 85 (E, <Logger tasks.workunit.client.1.smithi130.stdout (, None, False)>
2022-08-10T08:05:58.246 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...

@lxbsz
Copy link
Member Author

lxbsz commented Aug 11, 2022

Another instance - https://pulpito.ceph.com/vshankar-2022-08-10_04:06:00-fs-wip-vshankar-testing-20220805-190751-testing-default-smithi/6965181/

The create call seems to be blocked for ~3hrs

2022-08-10T05:06:07.461 INFO:tasks.workunit.client.1.smithi130.stdout:8/727: dread d16/d17/d3f/d4f/f7b [0,4194304] 0
2022-08-10T05:06:07.464 INFO:tasks.workunit.client.1.smithi130.stdout:8/728: write d16/d6f/d71/d68/d86/fbc [986545,101947] 0
2022-08-10T05:06:07.467 INFO:tasks.workunit.client.1.smithi130.stdout:8/729: creat d16/d6f/d71/d68/d73/ff6 x:0 0 0
2022-08-10T08:03:58.198 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-08-10T08:04:00.533 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-08-10T08:04:58.241 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3448: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 85 (E, <Logger tasks.workunit.client.1.smithi130.stderr (, None, False)>
2022-08-10T08:05:00.535 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3e48: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 88 (E, <Logger tasks.workunit.client.0.smithi061.stderr (, None, False)>
2022-08-10T08:05:58.244 DEBUG:teuthology.orchestra.run:timed out waiting; will kill: <Greenlet at 0x7f586ece3648: copy_file_to(<paramiko.ChannelFile from <paramiko.Channel 85 (E, <Logger tasks.workunit.client.1.smithi130.stdout (, None, False)>
2022-08-10T08:05:58.246 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...

Sure, will check it.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
@lxbsz
Copy link
Member Author

lxbsz commented Aug 12, 2022

@vshankar Found the root cause and have pushed one tmp commit 20c8646. Please review and if it's okay I will squash them.
Locally I can reproduce it and after this fix the test passed, and I am still running the teuthology test.

I'm dropping this change from my branch for now. Will pick it up in my next run.

Sure, I will run some of the teuthology tests those failed in your last run. But I am still hitting the Shaman building failure as usual before. Not sure when that could be fixed.

@lxbsz
Copy link
Member Author

lxbsz commented Aug 15, 2022

@vshankar
Copy link
Contributor

@vshankar Found the root cause and have pushed one tmp commit 20c8646. Please review and if it's okay I will squash them.

Locally I can reproduce it and after this fix the test passed, and I am still running the teuthology test.

Checked the commit and its seems to do the right thing.

@vshankar
Copy link
Contributor

vshankar commented Sep 5, 2022

@vshankar
Copy link
Contributor

@lxbsz Could you please take a look at ffsb failure - https://pulpito.ceph.com/vshankar-2022-09-13_09:03:58-fs-wip-vshankar-testing-20220909-120801-testing-default-smithi/7030431/

Excerpt from the log:

022-09-13T09:39:27.816 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lm... yes
2022-09-13T09:39:27.970 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lpthread... yes
2022-09-13T09:39:27.976 INFO:tasks.workunit.client.0.smithi087.stdout:checking for ANSI C header files... (cached) yes
2022-09-13T09:39:28.082 INFO:tasks.workunit.client.0.smithi087.stdout:checking for sys/wait.h that is POSIX.1 compatible... yes
2022-09-13T09:39:31.321 DEBUG:teuthology.orchestra.run.smithi087:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2022-09-13T09:39:31.328 DEBUG:teuthology.orchestra.run.smithi132:> sudo logrotate /etc/logrotate.d/ceph-test.conf

The test times out and no activity is seen post 2022-09-13T09:39:28.082. This is similar to the failure I reported with my last run.

@lxbsz
Copy link
Member Author

lxbsz commented Sep 14, 2022

@lxbsz Could you please take a look at ffsb failure - https://pulpito.ceph.com/vshankar-2022-09-13_09:03:58-fs-wip-vshankar-testing-20220909-120801-testing-default-smithi/7030431/

Excerpt from the log:

022-09-13T09:39:27.816 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lm... yes
2022-09-13T09:39:27.970 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lpthread... yes
2022-09-13T09:39:27.976 INFO:tasks.workunit.client.0.smithi087.stdout:checking for ANSI C header files... (cached) yes
2022-09-13T09:39:28.082 INFO:tasks.workunit.client.0.smithi087.stdout:checking for sys/wait.h that is POSIX.1 compatible... yes
2022-09-13T09:39:31.321 DEBUG:teuthology.orchestra.run.smithi087:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2022-09-13T09:39:31.328 DEBUG:teuthology.orchestra.run.smithi132:> sudo logrotate /etc/logrotate.d/ceph-test.conf

The test times out and no activity is seen post 2022-09-13T09:39:28.082. This is similar to the failure I reported with my last run.

@vshankar I think it should be caused by another PR #40787.

2022-09-13T09:39:28.118+0000 7f7923fff700 20 client.4815 _lookup have 0x10000000002.head["conftest.err"] from mds.-1 ttl 0.000000 seq 0
2022-09-13T09:39:28.118+0000 7f7923fff700  1 client.4815 _lookup dir 0x10000000002.head(faked_ino=0 nref=39 ll_ref=788 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2022-09-13T09:39:00.428798+0000 mtime=2022-09-13T09:39:28.106629+0000 ctime=2022-09-13T09:39:28.106629+0000 caps=pAsLsXsFsx(0=pAsLsXsFsx) parents=0x10000000001.head["ffsb"] 0x7f794400b430) rename is on the way, will wait for dn 'conftest.err'
2022-09-13T09:39:28.909+0000 7f794b7f6700 20 client.4815 tick

Not this one.

@vshankar
Copy link
Contributor

@lxbsz Could you please take a look at ffsb failure - https://pulpito.ceph.com/vshankar-2022-09-13_09:03:58-fs-wip-vshankar-testing-20220909-120801-testing-default-smithi/7030431/
Excerpt from the log:

022-09-13T09:39:27.816 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lm... yes
2022-09-13T09:39:27.970 INFO:tasks.workunit.client.0.smithi087.stdout:checking for main in -lpthread... yes
2022-09-13T09:39:27.976 INFO:tasks.workunit.client.0.smithi087.stdout:checking for ANSI C header files... (cached) yes
2022-09-13T09:39:28.082 INFO:tasks.workunit.client.0.smithi087.stdout:checking for sys/wait.h that is POSIX.1 compatible... yes
2022-09-13T09:39:31.321 DEBUG:teuthology.orchestra.run.smithi087:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2022-09-13T09:39:31.328 DEBUG:teuthology.orchestra.run.smithi132:> sudo logrotate /etc/logrotate.d/ceph-test.conf

The test times out and no activity is seen post 2022-09-13T09:39:28.082. This is similar to the failure I reported with my last run.

@vshankar I think it should be caused by another PR #40787.

I commented on that PR -- there are timeouts seen in other tests too..

2022-09-13T09:39:28.118+0000 7f7923fff700 20 client.4815 _lookup have 0x10000000002.head["conftest.err"] from mds.-1 ttl 0.000000 seq 0
2022-09-13T09:39:28.118+0000 7f7923fff700  1 client.4815 _lookup dir 0x10000000002.head(faked_ino=0 nref=39 ll_ref=788 cap_refs={} open={} mode=40775 size=0/0 nlink=1 btime=2022-09-13T09:39:00.428798+0000 mtime=2022-09-13T09:39:28.106629+0000 ctime=2022-09-13T09:39:28.106629+0000 caps=pAsLsXsFsx(0=pAsLsXsFsx) parents=0x10000000001.head["ffsb"] 0x7f794400b430) rename is on the way, will wait for dn 'conftest.err'
2022-09-13T09:39:28.909+0000 7f794b7f6700 20 client.4815 tick

Not this one.

@vshankar
Copy link
Contributor

@lxbsz It's probably a good idea to drop #40787 from my branch and test this change. WDYT?

@lxbsz
Copy link
Member Author

lxbsz commented Sep 14, 2022

@lxbsz It's probably a good idea to drop #40787 from my branch and test this change. WDYT?

Sounds good. I will fix that PR and run the test first after that.

@vshankar vshankar added wip-rishabh-testing Rishabh's testing label and removed wip-vshankar-testing labels Sep 15, 2022
@vshankar
Copy link
Contributor

@rishabh-d-dave Please include this in your next run.

@rishabh-d-dave
Copy link
Contributor

jenkins test make check arm64

@rishabh-d-dave
Copy link
Contributor

jenkins test windows

Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishabh-d-dave
Copy link
Contributor

This PR is ready for merge. Waiting for the CI jobs to finish first.

@rishabh-d-dave
Copy link
Contributor

Error look irrelevant - https://jenkins.ceph.com/job/ceph-pull-requests-arm64/33426/

@rishabh-d-dave
Copy link
Contributor

jenkins test make check arm64

@lxbsz
Copy link
Member Author

lxbsz commented Oct 11, 2022

jenkins test make check arm64

This test is optional , so we can just ignore this IMO.

@rishabh-d-dave
Copy link
Contributor

#46331 (review)

https://tracker.ceph.com/projects/cephfs/wiki/Main#10-Oct-2022

Since the QA testing was fine, merging this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build/ops cephfs Ceph File System core tests wip-rishabh-testing Rishabh's testing label
Projects
None yet
5 participants