Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/dashboard: cephadm e2e job: display info on error & other improvements #44384

Merged
merged 1 commit into from Jan 31, 2022

Conversation

alfonsomthd
Copy link
Contributor

@alfonsomthd alfonsomthd commented Dec 22, 2021

  • Fix: ensure that on_error trap is called (display more info on error).
  • Set static IPs to VMs.
  • Remove domain in cluster definition to avoid side effects of potential dns misconfiguration.
  • Minor improvements.

Fixes: https://tracker.ceph.com/issues/53991

Signed-off-by: Alfonso Martínez almartin@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@alfonsomthd alfonsomthd added tests dashboard skip-teuthology For PRs whose changes do not have an effect on QA runs/changes are not being tested in Teuthology labels Dec 22, 2021
@alfonsomthd alfonsomthd added this to In progress in Dashboard via automation Dec 22, 2021
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alfonsomthd Can we also enable the debug logs? I think it'd help to debug the cephadm issues really well.

@alfonsomthd
Copy link
Contributor Author

alfonsomthd commented Jan 24, 2022

@alfonsomthd Can we also enable the debug logs? I think it'd help to debug the cephadm issues really well.

Is it really needed? There is a warning saying: The debug messages are very verbose!
So maybe we will clutter the job output...

@nizamial09
Copy link
Member

So maybe we will clutter the job output

Yes, it'll clutter the job output. But for figuring out issues like the failure in that's happening currently in the cephadm job (rgw accesskey related), i thought it'd be helpful. Anyways, I did a little digging in the available logs to find out whats going wrong with the rgw deployment and this was all I could find.

Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Checking dashboard <-> RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Checking dashboard <-> RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO rgw_client] Configuring dashboard RGW credentials
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:21 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.015s] [admin] [385.0B] /api/service
Jan 21 15:19:22 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v74: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:23 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Writing back 2 completed events
Jan 21 15:19:23 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:24 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v75: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v76: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.005s] [admin] [22.0B] /api/prometheus/notifications
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.011s] [admin] [877.0B] /api/summary
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:26 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.011s] [admin] [385.0B] /api/service
Jan 21 15:19:28 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v77: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:28 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:30 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v78: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.007s] [admin] [22.0B] /api/prometheus/notifications
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.012s] [admin] [877.0B] /api/summary
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard ERROR root] Timeout (10s) executing radosgw-admin ['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard ERROR rgw_client] Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
                                              Traceback (most recent call last):
                                                File "/lib64/python3.6/subprocess.py", line 425, in run
                                                  stdout, stderr = process.communicate(input, timeout=timeout)
                                                File "/lib64/python3.6/subprocess.py", line 863, in communicate
                                                  stdout, stderr = self._communicate(input, endtime, timeout)
                                                File "/lib64/python3.6/subprocess.py", line 1535, in _communicate
                                                  self._check_timeout(endtime, orig_timeout)
                                                File "/lib64/python3.6/subprocess.py", line 891, in _check_timeout
                                                  raise TimeoutExpired(self.args, orig_timeout)
                                              subprocess.TimeoutExpired: Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
                                              
                                              During handling of the above exception, another exception occurred:
                                              
                                              Traceback (most recent call last):
                                                File "/usr/share/ceph/mgr/dashboard/services/rgw_client.py", line 243, in configure_rgw_credentials
                                                  _, out, err = mgr.send_rgwadmin_command(['realm', 'list'])
                                                File "/usr/share/ceph/mgr/mgr_module.py", line 2235, in send_rgwadmin_command
                                                  timeout=10,
                                                File "/lib64/python3.6/subprocess.py", line 430, in run
                                                  stderr=stderr)
                                              subprocess.TimeoutExpired: Command '['radosgw-admin', '-c', '/etc/ceph/ceph.conf', '-k', '/var/lib/ceph/mgr/ceph-ceph-node-00.njzmln/keyring', '-n', 'mgr.ceph-node-00.njzmln', 'realm', 'list']' timed out after 10 seconds
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev a17c58b8-dd16-4393-8694-6c4c78723bd8 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev a092e721-a7cb-47d5-a218-4ef2698477b6 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev 609ee482-a1d1-49d6-8a6b-9e5076d0b978 does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [progress WARNING root] complete: ev 25dd2cd7-80c6-4450-bd53-d9d8384616ca does not exist
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Purge service mds.test
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Purge service mds.test
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO orchestrator] is orchestrator available: True, 
Jan 21 15:19:31 ceph-node-00 ceph-mgr[11741]: [dashboard INFO request] [::ffff:192.168.100.1:40492] [GET] [200] [0.014s] [admin] [336.0B] /api/service
Jan 21 15:19:32 ceph-node-00 ceph-mgr[11741]: log_channel(cluster) log [DBG] : pgmap v79: 1 pgs: 1 unknown; 0 B data, 0 B used, 0 B / 0 B avail
Jan 21 15:19:33 ceph-node-00 ceph-mgr[11741]: [progress INFO root] Processing OSDMap change 5..5
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: [cephadm INFO cephadm.serve] Checking dashboard <-> RGW credentials
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: log_channel(cephadm) log [INF] : Checking dashboard <-> RGW credentials
Jan 21 15:19:34 ceph-node-00 ceph-mgr[11741]: [dashboard INFO rgw_client] Configuring dashboard RGW credentials

@nizamial09
Copy link
Member

@adk3798 FYI ^^

@alfonsomthd
Copy link
Contributor Author

@adk3798 FYI ^^

@nizamial09 (Although not related to this very PR) usually when radosgw-admin command hangs is due to not having the cluster healthy yet (not enough OSDs up yet, ...)

@alfonsomthd alfonsomthd added the needs-quincy-backport backport required for quincy label Jan 24, 2022
@alfonsomthd alfonsomthd marked this pull request as ready for review January 24, 2022 12:52
@alfonsomthd alfonsomthd requested a review from a team as a code owner January 24, 2022 12:52
@alfonsomthd alfonsomthd changed the title [WIP-Draft] mgr/dashboard: cephadm e2e job: set static ip to VM mgr/dashboard: cephadm e2e job: set static ip to VM Jan 24, 2022
…ments

- Fix: ensure that on_error trap is called (display more info on error).
- Set static IPs to VMs.
- Remove domain in cluster definition to avoid side effects of potential dns misconfiguration.
- Minor improvements.

Fixes: https://tracker.ceph.com/issues/53991

Signed-off-by: Alfonso Martínez <almartin@redhat.com>
@alfonsomthd alfonsomthd changed the title mgr/dashboard: cephadm e2e job: set static ip to VM mgr/dashboard: cephadm e2e job: display info on error & other improvements Jan 25, 2022
@alfonsomthd
Copy link
Contributor Author

jenkins test api

@alfonsomthd
Copy link
Contributor Author

jenkins test make check

@alfonsomthd
Copy link
Contributor Author

jenkins test dashboard

@nizamial09
Copy link
Member

@nizamial09 (Although not related to this very PR) usually when radosgw-admin command hangs is due to not having the cluster healthy yet (not enough OSDs up yet, ...)

Thanks @alfonsomthd I raised a tracker for adapting the tests in a way that when rgw is created cluster will be healthy. https://tracker.ceph.com/issues/54030. I'll take it up in a follow up PR.

Dashboard automation moved this from In progress to Reviewer approved Jan 27, 2022
@alfonsomthd
Copy link
Contributor Author

jenkins test dashboard

@epuertat epuertat moved this from Reviewer approved to Ready-to-merge in Dashboard Jan 31, 2022
@nizamial09
Copy link
Member

jenkins test dashboard

@epuertat epuertat merged commit 5f32eb2 into ceph:master Jan 31, 2022
Dashboard automation moved this from Ready-to-merge to Done Jan 31, 2022
@epuertat epuertat deleted the cephadm-e2e-static-ip branch January 31, 2022 21:33
@epuertat epuertat removed the needs-quincy-backport backport required for quincy label Jan 31, 2022
@nizamial09 nizamial09 added the needs-quincy-backport backport required for quincy label Feb 7, 2022
@epuertat epuertat removed the needs-quincy-backport backport required for quincy label Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashboard pybind skip-teuthology For PRs whose changes do not have an effect on QA runs/changes are not being tested in Teuthology tests
Projects
Archived in project
Dashboard
  
Done
4 participants