mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

alfonsomthd · 2021-12-22T09:17:32Z

Fix: ensure that on_error trap is called (display more info on error).
Set static IPs to VMs.
Remove domain in cluster definition to avoid side effects of potential dns misconfiguration.
Minor improvements.

Fixes: https://tracker.ceph.com/issues/53991

Signed-off-by: Alfonso Martínez almartin@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

github-actions · 2021-12-30T07:11:38Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

nizamial09

@alfonsomthd Can we also enable the debug logs? I think it'd help to debug the cephadm issues really well.

alfonsomthd · 2022-01-24T10:10:17Z

@alfonsomthd Can we also enable the debug logs? I think it'd help to debug the cephadm issues really well.

Is it really needed? There is a warning saying: The debug messages are very verbose!
So maybe we will clutter the job output...

nizamial09 · 2022-01-24T10:42:21Z

So maybe we will clutter the job output

Yes, it'll clutter the job output. But for figuring out issues like the failure in that's happening currently in the cephadm job (rgw accesskey related), i thought it'd be helpful. Anyways, I did a little digging in the available logs to find out whats going wrong with the rgw deployment and this was all I could find.

Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Checking dashboard <-> RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Checking dashboard <-> RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO rgw_client] Configuring dashboard RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.015s] [admin] [385.0B] /api/service
Jan 21 15:19:22 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v74: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:23 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Writing back 2 completed events
Jan 21 15:19:23 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:24 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v75: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v76: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.005s] [admin] [22.0B] /api/prometheus/notifications
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.011s] [admin] [877.0B] /api/summary
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.011s] [admin] [385.0B] /api/service
Jan 21 15:19:28 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v77: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:28 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:30 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v78: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.007s] [admin] [22.0B] /api/prometheus/notifications
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.012s] [admin] [877.0B] /api/summary
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard ERROR root] Timeout (10s) executing radosgw-admin ['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard ERROR rgw_client] Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
                                              Traceback (most recent call last):
                                                File "/lib64/python3.6/subprocess.py", line 425, in run
                                                  stdout, stderr = process.communicate(input, timeout=timeout)
                                                File "/lib64/python3.6/subprocess.py", line 863, in communicate
                                                  stdout, stderr = self._communicate(input, endtime, timeout)
                                                File "/lib64/python3.6/subprocess.py", line 1535, in _communicate
                                                  self._check_timeout(endtime, orig_timeout)
                                                File "/lib64/python3.6/subprocess.py", line 891, in _check_timeout
                                                  raise TimeoutExpired(self.args, orig_timeout)
                                              subprocess.TimeoutExpired: Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
                                              
                                              During handling of the above exception, another exception occurred:
                                              
                                              Traceback (most recent call last):
                                                File "/usr/share/ceph/mgr/dashboard/services/rgw_client.py", line 243, in configure_rgw_credentials
                                                  _, out, err = mgr.send_rgwadmin_command(['realm', 'list'])
                                                File "/usr/share/ceph/mgr/mgr_module.py", line 2235, in send_rgwadmin_command
                                                  timeout=10,
                                                File "/lib64/python3.6/subprocess.py", line 430, in run
                                                  stderr=stderr)
                                              subprocess.TimeoutExpired: Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev a17c58b8-dd16-4393-8694-6c4c78723bd8 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev a092e721-a7cb-47d5-a218-4ef2698477b6 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev 609ee482-a1d1-49d6-8a6b-9e5076d0b978 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev 25dd2cd7-80c6-4450-bd53-d9d8384616ca does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Purge service mds.test
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Purge service mds.test
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.014s] [admin] [336.0B] /api/service
Jan 21 15:19:32 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v79: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:33 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Checking dashboard <-> RGW credentials
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Checking dashboard <-> RGW credentials
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: [dashboard INFO rgw_client] Configuring dashboard RGW credentials

nizamial09 · 2022-01-24T10:44:35Z

@adk3798 FYI ^^

alfonsomthd · 2022-01-24T10:57:11Z

@adk3798 FYI ^^

@nizamial09 (Although not related to this very PR) usually when radosgw-admin command hangs is due to not having the cluster healthy yet (not enough OSDs up yet, ...)

…ments - Fix: ensure that on_error trap is called (display more info on error). - Set static IPs to VMs. - Remove domain in cluster definition to avoid side effects of potential dns misconfiguration. - Minor improvements. Fixes: https://tracker.ceph.com/issues/53991 Signed-off-by: Alfonso Martínez <almartin@redhat.com>

alfonsomthd · 2022-01-26T16:17:33Z

jenkins test api

alfonsomthd · 2022-01-26T16:17:44Z

jenkins test make check

alfonsomthd · 2022-01-26T16:17:53Z

jenkins test dashboard

nizamial09 · 2022-01-27T06:38:19Z

@nizamial09 (Although not related to this very PR) usually when radosgw-admin command hangs is due to not having the cluster healthy yet (not enough OSDs up yet, ...)

Thanks @alfonsomthd I raised a tracker for adapting the tests in a way that when rgw is created cluster will be healthy. https://tracker.ceph.com/issues/54030. I'll take it up in a follow up PR.

alfonsomthd · 2022-01-31T09:32:51Z

jenkins test dashboard

nizamial09 · 2022-01-31T16:03:39Z

jenkins test dashboard

alfonsomthd added tests dashboard skip-teuthology For PRs whose changes do not have an effect on QA runs/changes are not being tested in Teuthology labels Dec 22, 2021

alfonsomthd requested review from avanthakkar, pereman2, epuertat, Waadkh7, aaSharma14 and nizamial09 December 22, 2021 09:17

alfonsomthd added this to In progress in Dashboard via automation Dec 22, 2021

github-actions bot added the pybind label Dec 22, 2021

alfonsomthd force-pushed the cephadm-e2e-static-ip branch from 578b5b1 to 15cdf0a Compare December 22, 2021 12:05

github-actions bot added the needs-rebase label Dec 30, 2021

alfonsomthd force-pushed the cephadm-e2e-static-ip branch from 15cdf0a to 3d89ed0 Compare January 21, 2022 14:58

github-actions bot removed the needs-rebase label Jan 21, 2022

nizamial09 reviewed Jan 24, 2022

View reviewed changes

alfonsomthd added the needs-quincy-backport backport required for quincy label Jan 24, 2022

alfonsomthd marked this pull request as ready for review January 24, 2022 12:52

alfonsomthd requested a review from a team as a code owner January 24, 2022 12:52

alfonsomthd changed the title ~~[WIP-Draft] mgr/dashboard: cephadm e2e job: set static ip to VM~~ mgr/dashboard: cephadm e2e job: set static ip to VM Jan 24, 2022

alfonsomthd force-pushed the cephadm-e2e-static-ip branch from 3d89ed0 to 39af61e Compare January 24, 2022 12:59

alfonsomthd changed the title ~~mgr/dashboard: cephadm e2e job: set static ip to VM~~ mgr/dashboard: cephadm e2e job: display info on error & other improvements Jan 25, 2022

pereman2 approved these changes Jan 27, 2022

View reviewed changes

Dashboard automation moved this from In progress to Reviewer approved Jan 27, 2022

nizamial09 approved these changes Jan 27, 2022

View reviewed changes

epuertat moved this from Reviewer approved to Ready-to-merge in Dashboard Jan 31, 2022

epuertat merged commit 5f32eb2 into ceph:master Jan 31, 2022

Dashboard automation moved this from Ready-to-merge to Done Jan 31, 2022

epuertat deleted the cephadm-e2e-static-ip branch January 31, 2022 21:33

epuertat removed the needs-quincy-backport backport required for quincy label Jan 31, 2022

nizamial09 added the needs-quincy-backport backport required for quincy label Feb 7, 2022

epuertat mentioned this pull request Feb 7, 2022

quincy: dashboard: 2nd backport batch #44899

Merged

nizamial09 mentioned this pull request Feb 8, 2022

pacific: mgr/dashboard: cephadm e2e job improvements #44938

Merged

epuertat removed the needs-quincy-backport backport required for quincy label Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

alfonsomthd commented Dec 22, 2021 •

edited

github-actions bot commented Dec 30, 2021

nizamial09 left a comment •

edited

alfonsomthd commented Jan 24, 2022 •

edited

nizamial09 commented Jan 24, 2022

nizamial09 commented Jan 24, 2022

alfonsomthd commented Jan 24, 2022

alfonsomthd commented Jan 26, 2022

alfonsomthd commented Jan 26, 2022

alfonsomthd commented Jan 26, 2022

nizamial09 commented Jan 27, 2022

alfonsomthd commented Jan 31, 2022

nizamial09 commented Jan 31, 2022

mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

Conversation

alfonsomthd commented Dec 22, 2021 • edited

Checklist

github-actions bot commented Dec 30, 2021

nizamial09 left a comment • edited

Choose a reason for hiding this comment

alfonsomthd commented Jan 24, 2022 • edited

nizamial09 commented Jan 24, 2022

nizamial09 commented Jan 24, 2022

alfonsomthd commented Jan 24, 2022

alfonsomthd commented Jan 26, 2022

alfonsomthd commented Jan 26, 2022

alfonsomthd commented Jan 26, 2022

nizamial09 commented Jan 27, 2022

alfonsomthd commented Jan 31, 2022

nizamial09 commented Jan 31, 2022

alfonsomthd commented Dec 22, 2021 •

edited

nizamial09 left a comment •

edited

alfonsomthd commented Jan 24, 2022 •

edited