Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus: mgr: update mon metadata when monmap is updated #39219

Merged
merged 2 commits into from Apr 6, 2021

Conversation

k0ste
Copy link
Contributor

@k0ste k0ste commented Feb 2, 2021

@ideepika
Copy link
Member

ideepika commented Mar 4, 2021

jenkins test api

Copy link
Member

@ideepika ideepika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epuertat
Copy link
Member

epuertat commented Mar 4, 2021

Failed API tests:

FAIL: test_inventory_list (tasks.mgr.dashboard.test_orchestrator.OrchestratorControllerTest)

2021-03-04 16:59:50,637.637 INFO:__main__:----------------------------------------------------------------------
2021-03-04 16:59:50,637.637 INFO:__main__:Traceback (most recent call last):
2021-03-04 16:59:50,638.638 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_orchestrator.py", line 120, in test_inventory_list
2021-03-04 16:59:50,638.638 INFO:__main__:    self.assertStatus(200)
2021-03-04 16:59:50,638.638 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 407, in assertStatus
2021-03-04 16:59:50,638.638 INFO:__main__:    self.assertEqual(self._resp.status_code, status)
2021-03-04 16:59:50,638.638 INFO:__main__:AssertionError: 500 != 200
2021-03-04 16:59:50,638.638 INFO:__main__:
2021-03-04 16:59:50,638.638 INFO:__main__:----------------------------------------------------------------------

@epuertat
Copy link
Member

epuertat commented Mar 4, 2021

jenkins test api

@yuriw
Copy link
Contributor

yuriw commented Mar 4, 2021

@epuertat FYI

@epuertat
Copy link
Member

epuertat commented Mar 4, 2021

😓 Different failure:

Collecting teuthology[test]
  Cloning https://github.com/ceph/teuthology to /tmp/pip-install-ybk409o5/teuthology_41923afde31a4a7495602519d342ad6a
  Running command git clone -q https://github.com/ceph/teuthology /tmp/pip-install-ybk409o5/teuthology_41923afde31a4a7495602519d342ad6a
ERROR: Could not find a version that satisfies the requirement apache-libcloud (from teuthology[test])
ERROR: No matching distribution found for apache-libcloud

@epuertat
Copy link
Member

epuertat commented Mar 4, 2021

jenkins test api

2 similar comments
@yuriw
Copy link
Contributor

yuriw commented Mar 4, 2021

jenkins test api

@yuriw
Copy link
Contributor

yuriw commented Mar 4, 2021

jenkins test api

@yuriw
Copy link
Contributor

yuriw commented Mar 4, 2021

jenkins test this please

@epuertat
Copy link
Member

epuertat commented Mar 5, 2021

Mmm, it's failing with the same error as before:

FAIL: test_inventory_list (tasks.mgr.dashboard.test_orchestrator.OrchestratorControllerTest)

2021-03-04 23:41:03,707.707 INFO:__main__:----------------------------------------------------------------------
2021-03-04 23:41:03,707.707 INFO:__main__:Traceback (most recent call last):
2021-03-04 23:41:03,707.707 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/test_orchestrator.py", line 120, in test_inventory_list
2021-03-04 23:41:03,707.707 INFO:__main__:    self.assertStatus(200)
2021-03-04 23:41:03,707.707 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/dashboard/helper.py", line 407, in assertStatus
2021-03-04 23:41:03,708.708 INFO:__main__:    self.assertEqual(self._resp.status_code, status)
2021-03-04 23:41:03,708.708 INFO:__main__:AssertionError: 500 != 200
2021-03-04 23:41:03,708.708 INFO:__main__:
2021-03-04 23:41:03,708.708 INFO:__main__:----------------------------------------------------------------------

In fact it reported 3 times the same issue [1], [2], [3]. This doesn't like like a 'flapping' test but a real change (I don't yet mean bug, but at least the test is behaving in a different way between master and octopus).

Main difference regarding the test itself is that in octopus it uses the test_orchestrator, but not in master/pacific. Do you have any idea, @sebastian-philipp?

That test consumes mgr.get('osd_metadata') mgr API call and the Orchestrator client invetory list call.

Additionally I see in the Jenkins console log the following trace, but it appears in successful runs too:
Error EINVAL: DaemonDescription: __init__() got an unexpected keyword argument 'nodename'

@k0ste k0ste requested review from a team, alfonsomthd and aaSharma14 and removed request for a team March 5, 2021 13:12
@epuertat
Copy link
Member

epuertat commented Mar 5, 2021

Just pushed a commit to increase verbosity in api tests output

@epuertat
Copy link
Member

epuertat commented Mar 5, 2021

jenkins retest this please

@epuertat epuertat force-pushed the wip-49004-octopus branch 4 times, most recently from 646e4f5 to 0cdf54b Compare March 5, 2021 22:05
@epuertat
Copy link
Member

epuertat commented Mar 6, 2021

jenkins test api

@epuertat epuertat force-pushed the wip-49004-octopus branch 2 times, most recently from 37f2823 to 1c71ddb Compare March 6, 2021 17:54
@epuertat epuertat added the DNM label Mar 6, 2021
@epuertat
Copy link
Member

epuertat commented Mar 6, 2021

"Traceback (most recent call last):
  File \"/home/jenkins-build/build/workspace/ceph-api/src/pybind/mgr/dashboard/controllers/orchestrator.py\", line 111, in list
    device_osd_map = get_device_osd_map()
  File \"/home/jenkins-build/build/workspace/ceph-api/src/pybind/mgr/dashboard/controllers/orchestrator.py\", line 48, in get_device_osd_map
    result[hostname][device].append(int(osd_id))
ValueError: invalid literal for int() with base 10: 'a'

@epuertat
Copy link
Member

epuertat commented Mar 6, 2021

The mgr.get('osd_metadata') API method, apart from the expected numeric OSD IDs (0, 1, 2, 3), it now returns 3 additional keys ('a', 'b' and 'c'), I guess corresponding to the 3 mon daemons.

{
   "0":{
      "hostname":"braggi16",
      "": "...",
      "rotational":"0"
   },
   "1":{
      "hostname":"braggi16",
      "": "...",
      "rotational":"0"
   },
   "2":{
      "hostname":"braggi16",
      "": "...",
      "rotational":"0"
   },
   "3":{
      "hostname":"braggi16",
      "": "...",
      "rotational":"0"
   },
   "a":{
      "hostname":"braggi16",
      "addrs":"[v2:172.21.2.16:40564/0,v1:172.21.2.16:40565/0]",
      "": "...",
      "os":"Linux"
   },
   "b":{
      "hostname":"braggi16",
      "addrs":"[v2:172.21.2.16:40566/0,v1:172.21.2.16:40567/0]",

      "os":"Linux"
   },
   "c":{
      "hostname":"braggi16",
      "addrs":"[v2:172.21.2.16:40568/0,v1:172.21.2.16:40569/0]",
      "": "...",
      "os":"Linux"
   }
}

@tchaikov
Copy link
Contributor

tchaikov commented Mar 9, 2021

The mgr.get('osd_metadata') API method, apart from the expected numeric OSD IDs (0, 1, 2, 3), it now returns 3 additional keys ('a', 'b' and 'c'), I guess corresponding to the 3 mon daemons.

should be fixed by #39937

@k0ste
Copy link
Contributor Author

k0ste commented Mar 10, 2021

Kefu, I'll cherry-pick #39937 tomorrow

@yuriw
Copy link
Contributor

yuriw commented Mar 10, 2021

@k0ste add needs-qa when done and I will retest

@k0ste k0ste force-pushed the wip-49004-octopus branch 2 times, most recently from eda3796 to aa7c5ee Compare March 11, 2021 08:25
there is chance that some monitor(s) is updated / upgraded in a single
monmap update without being removed from cluster state's metata first,
so, without this change, we will not update the metadata associated with
that monitor, hence the mgr modules which consumes the metadata is not
updated accordingly and keep reporting the stale information.

in this change, we always update the metadata associated with all monitor
included by the latest monmap. multiple "mon metadata" commands are sent
to monitor for retrieving their updated metadata, instead of sending a
single one, so that we can reuse "MetadataUpdate" to update the metadata
of a given daemon. as the number of monitors in a typical cluster is
relatively small, and the frequency of monmap update is low, so this
overhead should be fine.

unlike other places where we ask mon for metadata in Mgr class, the code
sending the mon command for updated monitor metata is located outside of
`cluster_state.with_monmap()` block, the reason is that `with_monmap()`
is guraded by the monc_lock under the hood, while `start_mon_command()`
also need to acquire the monc_lock, which is not a recursive lock. so we
have to do this out of the `with_monmap()` block.

Fixes: https://tracker.ceph.com/issues/48905
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit c037f4c)

backport:
  - path: src/mgr/Mgr.cc
    comment: octopus don't declared `fmt`
this change addresses a regression introduced by
c037f4c

also remove the "P" before the json command.

see also: https://tracker.ceph.com/issues/48905

Fixes: https://tracker.ceph.com/issues/49661
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 8fc290b)
@k0ste
Copy link
Contributor Author

k0ste commented Mar 26, 2021

needs-qa

@yuriw yuriw merged commit a9655fd into ceph:octopus Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants