Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "mgr: use un-deprecated APIs to initialize Python interpretor" #56617

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Apr 1, 2024

This reverts commit 8dffa87, reversing
changes made to 80374da.

This commit broke the import of the "mgr_module" module
within the python modules in the mgr at least on python 3.6.8
that we currently use in our centos 8 stream based containers

Failures would look like (removing beginning of log lines)

Loading python module 'alerts'
Module not found: 'mgr_module'
Class not found in module 'alerts'
Error loading module 'alerts': (22) Invalid argument


Larger set of mgr logs after bootstrapping a cluster with a main branch CI build container image

Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Loading python module 'alerts'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.513+0000 7fc482b39200 -1 mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.513+0000 7fc482b39200 -1 mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.530+0000 7fc482b39200 -1 mgr[py] Class not found in module 'alerts'
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Class not found in module 'alerts'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.530+0000 7fc482b39200 -1 mgr[py] Error loading module 'alerts': (22) Invalid argument
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Error loading module 'alerts': (22) Invalid argument
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Loading python module 'balancer'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.580+0000 7fc482b39200 -1 mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.580+0000 7fc482b39200 -1 mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.590+0000 7fc482b39200 -1 mgr[py] Class not found in module 'balancer'
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Class not found in module 'balancer'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.590+0000 7fc482b39200 -1 mgr[py] Error loading module 'balancer': (22) Invalid argument
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Error loading module 'balancer': (22) Invalid argument
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Loading python module 'cephadm'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.652+0000 7fc482b39200 -1 mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] Module not found: 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.652+0000 7fc482b39200 -1 mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 
Apr 01 11:05:40 vm-00 ceph-mgr[19125]: mgr[py] ModuleNotFoundError: No module named 'mgr_module'
Apr 01 11:05:40 vm-00 ceph-46e92f9a-f039-11ee-af20-525400946525-mgr-vm-00-bttboc[19121]: 2024-04-01T15:05:40.665+0000 7fc482b39200 -1 mgr[py] Class not found in module 'cephadm'

which appeared in main very recently. Opened this PR to test the revert.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 1, 2024

just realized I have the wrong branch name. This is meant to test reverting #55436 not PR 55346

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 1, 2024

https://shaman.ceph.com/builds/ceph/wip-adk3-testing-2024-04-01-1147/b934c1ac93a8ebac4d5bca3f12de07b4770054a1/default/389620/

tested with the container image from that build that is main + the reversion this PR does and the issue described in #56617 (comment) no longer occurred

@adk3798 adk3798 changed the title Wip revert 55346 Wip revert 55436 Apr 1, 2024
@adk3798
Copy link
Contributor Author

adk3798 commented Apr 1, 2024

https://shaman.ceph.com/builds/ceph/wip-adk3-testing-2024-04-01-1147/b934c1ac93a8ebac4d5bca3f12de07b4770054a1/default/389620/

tested with the container image from that build that is main + the reversion this PR does and the issue described in #56617 (comment) no longer occurred

@tchaikov FYI

@tchaikov
Copy link
Contributor

tchaikov commented Apr 2, 2024

@adk3798 hi Adam, could you share the link of the failure? so i can learn, for instance, the python version and os version/distro where we have the regression?

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 2, 2024

@adk3798 hi Adam, could you share the link of the failure? so i can learn, for instance, the python version and os version/distro where we have the regression?

this wasn't in a test instance. Just taking the main branch build of cephadm (created locally via https://docs.ceph.com/en/latest/dev/cephadm/developing-cephadm/#compiling-cephadm) and then bootstrapping a cluster using the quay.ceph.io/ceph-ci/ceph:main image. The container image is based on centos 8 stream, and would have python 3.6.8. During bootstrap, the error presents as

RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.ceph.io/ceph-ci/ceph:main -e NODE_NAME=vm-01 -v /var/log/ceph/5b22f080-f05d-11ee-99f0-5254008acfb1:/var/log/ceph:z -v /tmp/ceph-tmp1fx1zvkf:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpkkzhz_z9:/etc/ceph/ceph.conf:z quay.ceph.io/ceph-ci/ceph:main mgr module enable cephadm: Error ENOENT: module 'cephadm' reports that it cannot run on the active manager daemon: No module named 'mgr_module' (pass --force to force enablement)

what I posted before was from the mgr daemon logs after the bootstrap failed.

This reverts commit 8dffa87, reversing
changes made to 80374da.

This commit broke the import of the "mgr_module" module
within the python modules in the mgr at least on python 3.6.8
that we currently use in our centos 8 stream based containers

Failures would look like (removing beginning of log lines)

Loading python module 'alerts'
Module not found: 'mgr_module'
Class not found in module 'alerts'
Error loading module 'alerts': (22) Invalid argument

Signed-off-by: Adam King <adking@redhat.com>
@adk3798 adk3798 changed the title Wip revert 55436 Revert "mgr: use un-deprecated APIs to initialize Python interpretor" Apr 2, 2024
@adk3798 adk3798 marked this pull request as ready for review April 2, 2024 12:54
@adk3798 adk3798 requested a review from a team as a code owner April 2, 2024 12:54
Copy link
Contributor

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I worked with Adam on the initial testing that these changes caused the mgr module load failures yesterday and this pr apprears to corretly remove the problem changes.

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 2, 2024

tracker for the idea of unit tests that would catch this sort of issue https://tracker.ceph.com/issues/65266

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 2, 2024

jenkins test submodules

@adk3798 adk3798 merged commit 57e9375 into ceph:main Apr 2, 2024
14 of 17 checks passed
@tchaikov
Copy link
Contributor

tchaikov commented Apr 3, 2024

@adk3798 @phlogistonjohn do we have teuthology test or any integration tests so we can catch this kind of failures with our CI

@tchaikov
Copy link
Contributor

tchaikov commented Apr 3, 2024

the tracker above fails to mention python 3.6.8. and that's the crux of the problem, i think.

@tchaikov
Copy link
Contributor

tchaikov commented Apr 3, 2024

also, reverting my change does not address the problem the #55436 intended to address -- one day, the tree will fail to build with a newer CPython due to the removed APIs. might want to create a tracker ticket for it as well.

@rkachach
Copy link
Contributor

rkachach commented Apr 3, 2024

@adk3798 @phlogistonjohn do we have teuthology test or any integration tests so we can catch this kind of failures with our CI

We have been discussing this and as I understood running RADOS suite (which includes some cephadm tests) would have detected this issue. Adam opened a ticket to see if adding some basic UT for mgr can also help in this case

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 3, 2024

@adk3798 @phlogistonjohn do we have teuthology test or any integration tests so we can catch this kind of failures with our CI

this pops up in the test_cephadm task at the least. Here's an instance https://pulpito.ceph.com/yuriw-2024-04-01_20:57:46-rados-wip-yuri3-testing-2024-04-01-0837-squid-distro-default-smithi/7634416/. Ignore that the test is running on the squid branch. I just haven't updated that test to bootstrap with the squid image yet. What it actually ran was sudo /usr/sbin/cephadm --image quay.ceph.io/ceph-ci/ceph:main bootstrap ... where it hit the issue

also, reverting my change does not address the problem the #55436 intended to address -- one day, the tree will fail to build with a newer CPython due to the removed APIs. might want to create a tracker ticket for it as well.

I still would like to have the change in as well. We just would need to make sure to test it this time. I felt the need to push the reversion as this was blocking development on the cephadm side and causing CI failures in some other places. The patch conceptually is still okay. If we just make sure to run the test_cephadm task mentioned above on it before merging this time, I think we can still move forward with it.

@tchaikov
Copy link
Contributor

tchaikov commented Apr 3, 2024

@adk3798 hi Adam, thank you for pointing to this link. my apologies. so it was me who misinterpreted https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomcPB1LsGKm1970-wip-yuri8-testing-2024-03-11-1138-old-wip-yuri8-testing-2024-03-06-1329-old-wip-yuri8-testing-2024-03-04-1226-old-wip-yuri8-test . after double checking https://pulpito.ceph.com/yuriw-2024-03-12_18:29:22-rados-wip-yuri8-testing-2024-03-11-1138-distro-default-smithi/7594638/, i realized that the batch actually failed with my change.

I felt the need to push the reversion as this was blocking development on the cephadm side and causing CI failures in some other places.

i fully understand. just wanted to make sure our workflow is sane. i need to revisit the build process or the code, to understand why it failed to work if i have more bandwidth.

@tchaikov
Copy link
Contributor

tchaikov commented Apr 3, 2024

@adk3798 @phlogistonjohn do we have teuthology test or any integration tests so we can catch this kind of failures with our CI

We have been discussing this and as I understood running RADOS suite (which includes some cephadm tests) would have detected this issue. Adam opened a ticket to see if adding some basic UT for mgr can also help in this case

you are referring https://tracker.ceph.com/issues/65266, i think. but to test this very issue, we need python 3.6. that's the difficult part, i guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants