Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: potential race between discard and writeback #21248

Merged
merged 2 commits into from Apr 8, 2018

Conversation

Projects
None yet
4 participants
@dillaman
Copy link
Contributor

commented Apr 4, 2018

No description provided.

@dillaman dillaman force-pushed the dillaman:wip-23548 branch from 5109336 to 6bf8589 Apr 4, 2018

@smithfarm

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

Test   #1: run-rbd-unit-tests.sh ...................***Failed  147.49 sec
@batrick

This comment has been minimized.

Copy link
Member

commented Apr 4, 2018

retest this please

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

@smithfarm Thanks -- I saw the failure and have been trying to repeat a failure locally (since the make check jenkins builders now "conveniently" drop the end of the test run output).

@trociny

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2018

@dillaman see [1] for "src/osdc/ObjectCacher.cc: 605: FAILED assert(bh->waitfor_read.empty())"

[1] http://pulpito.ceph.com/trociny-2018-04-05_10:40:08-rbd-wip-mgolub-testing-distro-basic-smithi/

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 5, 2018

@trociny Thanks -- I repeated it locally by running under a CPU restricted cgroup. The readahead logic is racing w/ the IO.

dillaman added some commits Apr 4, 2018

osdc/ObjectCacher: allow discard to complete in-flight writeback
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
librbd: discard should wait for in-flight cache writeback to complete
Signed-off-by: Jason Dillaman <dillaman@redhat.com>

@dillaman dillaman force-pushed the dillaman:wip-23548 branch from 6bf8589 to 0e04de4 Apr 5, 2018

@trociny

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2018

@ceph-jenkins retest this please

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2018

retest this please

@trociny

trociny approved these changes Apr 8, 2018

Copy link
Contributor

left a comment

@dillaman LGTM though I saw one fsx test failure [1]

I have failed to reproduce this and taking that it is fro writethrough cache mode it is probably not related.

[1] http://qa-proxy.ceph.com/teuthology/trociny-2018-04-06_07:49:11-rbd-wip-mgolub-testing-distro-basic-smithi/2361684/teuthology.log

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2018

@trociny

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2018

@dillaman Yes, actually I can reproduce this locally, using the same seed and config settings (some probably are not important) both on your branch and master.

    rbd default data pool = datapool
    rbd skip partial discard = true
    rbd cache = true
    rbd cache max dirty = 0

ceph_test_librbd_fsx -d -W -R -p 100 -P /tmp -r 1 -w 1 -t 1 -h 1 -l 250000000 -S 3206 -N 20000 pool_client.0 image_client.0

@trociny trociny merged commit 46df695 into ceph:master Apr 8, 2018

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@dillaman dillaman deleted the dillaman:wip-23548 branch Apr 8, 2018

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2018

@trociny Thanks -- I suspect ObjectCacherObjectDispatch is not respecting the OBJECT_DISCARD_FLAG_SKIP_PARTIAL flag. I'm tracking it here [1].

[1] http://tracker.ceph.com/issues/23597

@trociny

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

@dillaman Today I retested with different option combinations. It also fails when skip_partial_discard is disabled. It stops to fail only when rbd cache = false, other options from the list above look unimportant. So it is probably not related to OBJECT_DISCARD_FLAG_SKIP_PARTIAL flag.

@dillaman

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2018

@trociny It was related to my attempt to avoid initializing the ObjectCacher for every image in a clone chain. Once ObjectCacher receives an -ENOENT it will mark the whole object as missing so it can create odd edge conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.