New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: enhance the rados service map #39290
Conversation
I think using [1] https://github.com/ceph/ceph/blob/master/src/pybind/mgr/dashboard/services/tcmu_service.py |
Yeah, make sense and done.
Let's keep the 'tcmu-runner' string in the tcmu-runner project patches later.
|
src/test/librados/service.cc
Outdated
string name = string("portal/gw") + stringify(i % 8) + string(":rbd/image") + tid; | ||
ASSERT_EQ(0, rados_service_register(cluster, "iscsi", name.c_str(), | ||
"foo\0bar\0this\0that\0")); | ||
sleep(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this the threads will be exit immediately, which will unregister the services too.
We need to make all the threads to keep running for a while, or the ceph -s
won't see the 128 services at the same time when checking them in StatusFormat
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleeps worry me that (1) it unnecessarily delays the test and (2) we might hit race conditions in the lab from time-to-time. Why not bootstrap the connections up in the StatusFormat
test and keep the connections alive for the entirety of the test. Then you can loop querying for the status to match your expectation or fail if it fails to converge within like 60 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this approach sounds cool, will try it later.
src/test/librados/service.cc
Outdated
string name = string("portal/gw") + stringify(i % 8) + string(":rbd/image") + tid; | ||
ASSERT_EQ(0, rados_service_register(cluster, "iscsi", name.c_str(), | ||
"foo\0bar\0this\0that\0")); | ||
sleep(20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleeps worry me that (1) it unnecessarily delays the test and (2) we might hit race conditions in the lab from time-to-time. Why not bootstrap the connections up in the StatusFormat
test and keep the connections alive for the entirety of the test. Then you can loop querying for the status to match your expectation or fail if it fails to converge within like 60 seconds.
src/mgr/ServiceMap.cc
Outdated
// by the "prefix" instead of "daemon_name". | ||
// | ||
// For exmaple for iscsi gateways, it will be something likes: | ||
// "portal/hostX:rbd/imageX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wonder if it would be cleaner to just inject this prefix into the daemon metadata? That would ensure that we don't break backwards compatibility for dashboard + ceph-iscsi (which can be upgraded independently) and would also allow the "tcmu-runner" daemon to be listed as "iscsi" or similar via a second key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is feasible.
@lxbsz are you still working on this change? |
Yeah, will update it tomorrow after my test finishes. |
make check failure fixed by #39713 |
jenkins test make check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
jenkins test make check |
@lxbsz Looks like the new test is taking so long that it's causing the RADOS API test case to timeout and fail. |
For some use cases, like the tcmu-runner, there maybe handreds or thousands of LUNs, and then for each LUN it will register one service daemon, then in the `ceph -s` output will be full of useless info. This will allow to classify the sevices service daemons in one specified format by adding two pairs in metadata: "daemon_type" : "${TYPE}" "daemon_prefix" : "${PREFIX}" TYPE: will be used to replace the default "daemon(s)" showed in `ceph -s`. If absent, the "daemon" will be used. PREFIX: if present the active members will be classified by the prefix instead of "daemon_name". For exmaple for iscsi gateways, it will be something likes: "daemon_type" : "portal" "daemon_prefix" : "gw${N}" Then the `ceph -s` output will be: ... services: mon: 3 daemons, quorum a,b,c (age 50m) mgr: x(active, since 49m) mds: a:1 {0=c=up:active} 2 up:standby osd: 3 osds: 3 up (since 49m), 3 in (since 49m) iscsi: 8 portals active (gw0, gw1, gw2, gw3, gw4, gw5, gw6, gw7) ... Fixes: https://tracker.ceph.com/issues/49057 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Checked it, I didn't find why this PR could cause that failure. |
The failed ones are not related as above. And there has another test, which I only fired 3 jobs and all succeed: |
Hi Jason, Checked the log, all the service related test passed:
|
From your test, there are full of the following errors:
|
@lxbsz Thanks! |
For some use cases, like the tcmu-runner, there maybe handreds or
thousands of LUNs, and then for each LUN it will register one service
daemon, then in the
ceph -s
output will be full of useless info, suchas there are 128 devices in 8 gateways:
This will allow to classify the sevices service daemons in one
specified format by add one key:value pair in metadata:
If the type is "portal" and prefix is "gw${N}", the
ceph -s
outputwill be like:
Fixes: https://tracker.ceph.com/issues/49057
Signed-off-by: Xiubo Li xiubli@redhat.com
The tcmu-runner related change: open-iscsi/tcmu-runner#648
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox