-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qa: add tests for persistent writeback cache #38921
Conversation
@dillaman I am still testing these cases in my local teuthology environment, and please help to review the PR in the meantime. |
d7b1216
to
af539d0
Compare
@lixiaoy1 We also need a PR to tweak the default |
af539d0
to
7a2105c
Compare
@dillaman The test cases are tested successfully. Please help to review. |
7a2105c
to
8e3cc14
Compare
- ceph: | ||
conf: | ||
client: | ||
rbd_persistent_cache_mode: rwl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should probably be in a different directory from pool
(i.e. cache_mode
) and we would want a file to also exercise the SSD mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new folder for rwl. And the other file for SSD will be created in a later PR as it needs to test at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to run the xfstest
workload as well.
conf: | ||
client: | ||
rbd_persistent_cache_mode: rwl | ||
rbd_rwl_path: /tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this path will be a tmpfs
and will fail since it doesn't support direct IO. Might need to use a directory under ~ubuntu/cephtest
and make sure you delete it at the end of the test run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE.
"bench", | ||
"--io-type", | ||
"write" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe --io-pattern rand
to avoid simple sequential case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
@@ -0,0 +1,76 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not sure this really needs to be a new task. You can probably just write a small script (or use the built-in exec
task) to start timeout <random>s rbd bench --io-type write ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted the new task, and use timeout
.
1b2cbdb
to
a2b39bc
Compare
|
a2b39bc
to
167a11f
Compare
image_format: 2 | ||
- exec: | ||
client.0: | ||
- "timeout 10s rbd bench --io-type write testimage || true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: --io-pattern rand
@lixiaoy1 Looks like lots of failures |
From the log, it failed to load the pwl_cache library.
|
167a11f
to
65f64c8
Compare
@dillaman I found that it is not correct to remove the cache folder in '7-cleanup/cleanup.yaml'. As this work is done before deleting rbd images. As a result, the command 'rbd remove' failed as it needs to load the cache folder during this command. |
Sounds like an issue w/ the RPM spec file that the library wasn't built? Running |
What image is automatically deleted at the end of the test? I would expect any image created by the QEMU task to be deleted by the end of that task and not after the full test runs. Worst case, you could always disable the cache on the image before removing it. Also, realistically speaking, if you are removing the image, why would the cache care that it failed to open the cache? It's not like you are going to care about replaying the log. |
https://github.com/ceph/ceph/blob/master/ceph.spec.in#L1276 The |
From the codes, it acquires exclusive lock in pre_remove_image() when removing an image. After acquiring the exclusive lock, its write-back cache is initialized. Because the cache folder is removed, the initialization fails. |
e508922
to
86ae486
Compare
The issue is fixed by using exec and exec_on_cleanup. |
I'd prefer to see "remove" function correctly and as expected should the cache be corrupted / missing. In a perfect world the |
The spec will need to default it to on -- our upstream builders will not be changed to flip the switch. |
Signed-off-by: Li, Xiaoyan <xiaoyan.li@intel.com>
Signed-off-by: Li, Xiaoyan <xiaoyan.li@intel.com>
#39539 is raised to enable write-back cache by default. |
@dillaman Thank you for your advice. This item will be implemented in a standalone PR. |
Found another issue: #39567 |
@lixiaoy1 It looks like PWL isn't passing the XFS tests:
and
|
Also deadlocking here [1] but now w/ debug logs enabled. |
@lixiaoy1 Looks like this is where it hangs (note the 137 second timestamp delta between log entries after the PWL cache complained about the large IO):
|
Thank you for the info. The PR is ongoing: #39603 |
Going to merge this just so the existing bug doesn't get lost. |
Hi @dillaman , I'm going to take this work, but I don't quite understand the scene. Can you give me an example? why pass flag and what issue solve? |
@CongMinYin In general, when removing an image, we don't care about the state or availability of data. We just want to ensure everything possible has been removed. Therefore, when removing there is no need to attempt to open the cache, writeback the cache, etc. Since the remove state machine acquires the exclusive lock (via |
Signed-off-by: Li, Xiaoyan xiaoyan.li@intel.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox