Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: fix rgw spec migration with simple specs #52301

Merged
merged 1 commit into from Jul 7, 2023

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Jul 3, 2023

The rgw spec migration code, intended to formalize the rgw_frontend_type spec option, doesn't work with simple specs i.e.

service_type: rgw
service_id: rgw.1
service_name: rgw.rgw.1
placement:
  label: rgw

because the migration code assumes there will always be a "spec" section inside the spec. This is the case for more involved rgw specs such as

service_type: rgw
service_id: foo
placement:
  label: rgw
  count_per_host: 2
spec:
  rgw_realm: myrealm
  rgw_zone: myzone
  rgw_frontend_type: "beast"
  rgw_frontend_port: 5000

which is what the migration is actually concerned about (verification of the rgw_frontend_type in these specs).

In the case where the spec is more simple, we should
just leave the spec alone and move on. Unfortunately
the current code assumes the field will always be
there and hits an unhandled KeyError when trying to
migrate the more simple specs. This causes the
cephadm module to crash shortly after starting an
upgrade to a version that includes this migration
and it's very difficult to find the root cause. This
can be worked around by adding fields to the rgw
spec before upgrade so the "spec" field exists in
the spec and the migration works as intended.

This commit fixes the migration in the simple
case as well as adding testing for that case to
both the unit tests and orch/cephadm teuthology
upgrade tests

Fixes: https://tracker.ceph.com/issues/61889

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 4, 2023

https://pulpito.ceph.com/adking-2023-07-03_21:57:12-orch:cephadm-wip-adk4-testing-2023-07-03-1544-distro-default-smithi/

1 failure, 3 dead jobs

  • 3 dead jobs were infra/teuthology issue Error reimaging machines that has been frequent recently. Didn't feel a rerun of these jobs was necessary here.
  • 1 failed job is a known issue with the jaeger-tracing test https://tracker.ceph.com/issues/59704

Nothing to block merging

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 4, 2023

jenkins retest this please

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

jenkins test api

1 similar comment
@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

jenkins test api

Copy link
Contributor

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

from api tests failure

2023-07-05 14:32:12,809.809 INFO:__main__:Starting test: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
That module-handled commands have appropriate  behavior on
2023-07-05T14:32:37.588+0000 7f351ef17640 -1 WARNING: all dangerous and experimental features are enabled.
2023-07-05T14:32:37.628+0000 7f351ef17640 -1 WARNING: all dangerous and experimental features are enabled.
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'selftest' is not enabled/loaded (required by command 'mgr self-test run'): use `ceph mgr module enable selftest` to enable it
2023-07-05T14:32:42.984+0000 7f81e970b640 -1 WARNING: all dangerous and experimental features are enabled.
2023-07-05T14:32:43.032+0000 7f81e970b640 -1 WARNING: all dangerous and experimental features are enabled.
no valid command found; 10 closest matches:
osd perf
osd df [<output_method:plain|tree>] [<filter_by:class|name>] [<filter>]
osd blocked-by
osd pool stats [<pool_name>]
osd pool scrub <who>...
osd pool deep-scrub <who>...
osd pool repair <who>...
osd pool force-recovery <who>...
osd pool force-backfill <who>...
osd pool cancel-force-recovery <who>...
Error EINVAL: invalid command
2023-07-05T14:32:53.384+0000 7fa8618da640 -1 WARNING: all dangerous and experimental features are enabled.
2023-07-05T14:32:53.416+0000 7fa8618da640 -1 WARNING: all dangerous and experimental features are enabled.
Error EIO: Module 'selftest' has experienced an error and cannot handle commands: Synthetic exception in serve
2023-07-05 14:32:53,684.684 INFO:tasks.ceph_test_case:waiting 30s for health warning matching Module 'selftest' has failed: Synthetic exception in serve
2023-07-05 14:33:36,855.855 INFO:__main__:Stopped test: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
That module-handled commands have appropriate  behavior on in 84.045979s
2023-07-05 14:33:36,855.855 INFO:__main__:test_diskprediction_local (tasks.mgr.test_module_selftest.TestModuleSelftest) ... ok
2023-07-05 14:33:36,855.855 INFO:__main__:test_influx (tasks.mgr.test_module_selftest.TestModuleSelftest) ... ok
2023-07-05 14:33:36,855.855 INFO:__main__:test_iostat (tasks.mgr.test_module_selftest.TestModuleSelftest) ... ok
2023-07-05 14:33:36,855.855 INFO:__main__:test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 14:33:36,855.855 INFO:__main__:That module-handled commands have appropriate  behavior on ... ERROR
2023-07-05 14:33:36,855.855 INFO:__main__:
2023-07-05 14:33:36,856.856 INFO:__main__:
2023-07-05 14:33:36,856.856 INFO:__main__:======================================================================
2023-07-05 14:33:36,856.856 INFO:__main__:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 14:33:36,856.856 INFO:__main__:That module-handled commands have appropriate  behavior on
2023-07-05 14:33:36,856.856 INFO:__main__:----------------------------------------------------------------------
2023-07-05 14:33:36,856.856 INFO:__main__:Traceback (most recent call last):
2023-07-05 14:33:36,856.856 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/test_module_selftest.py", line 204, in test_module_commands
2023-07-05 14:33:36,856.856 INFO:__main__:    self.wait_for_health_clear(timeout=30)
2023-07-05 14:33:36,856.856 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 194, in wait_for_health_clear
2023-07-05 14:33:36,856.856 INFO:__main__:    self.wait_until_true(is_clear, timeout)
2023-07-05 14:33:36,856.856 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 231, in wait_until_true
2023-07-05 14:33:36,856.856 INFO:__main__:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-07-05 14:33:36,856.856 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2023-07-05 14:33:36,856.856 INFO:__main__:
Cannot find device "ceph-brx"
2023-07-05 14:33:36,876.876 INFO:__main__:
2023-07-05 14:33:36,877.877 INFO:__main__:----------------------------------------------------------------------
2023-07-05 14:33:36,877.877 INFO:__main__:Ran 297 tests in 3274.992s
2023-07-05 14:33:36,877.877 INFO:__main__:
2023-07-05 14:33:36,877.877 INFO:__main__:FAILED (errors=1)
2023-07-05 14:33:36,877.877 INFO:__main__:
2023-07-05 14:33:36,877.877 INFO:__main__:
2023-07-05 14:33:36,877.877 INFO:__main__:======================================================================
2023-07-05 14:33:36,877.877 INFO:__main__:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 14:33:36,877.877 INFO:__main__:That module-handled commands have appropriate  behavior on
2023-07-05 14:33:36,877.877 INFO:__main__:----------------------------------------------------------------------
2023-07-05 14:33:36,877.877 INFO:__main__:Traceback (most recent call last):
2023-07-05 14:33:36,878.878 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/test_module_selftest.py", line 204, in test_module_commands
2023-07-05 14:33:36,878.878 INFO:__main__:    self.wait_for_health_clear(timeout=30)
2023-07-05 14:33:36,878.878 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 194, in wait_for_health_clear
2023-07-05 14:33:36,878.878 INFO:__main__:    self.wait_until_true(is_clear, timeout)
2023-07-05 14:33:36,878.878 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 231, in wait_until_true
2023-07-05 14:33:36,878.878 INFO:__main__:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-07-05 14:33:36,878.878 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
2023-07-05 14:33:36,878.878 INFO:__main__:
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']
2023-07-05 14:33:36,879.879 INFO:__main__:
2023-07-05 14:33:36,879.879 INFO:__main__:

doesn't look related to this PR imo

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

jenkins test api

The rgw spec migration code, intended to formalize
the rgw_frontend_type spec option, doesn't work with
simple specs i.e.

service_type: rgw
service_id: rgw.1
service_name: rgw.rgw.1
placement:
  label: rgw

because the migration code assumes there will always be
a "spec" section inside the spec. This is the case for
more involved rgw specs such as

service_type: rgw
service_id: foo
placement:
  label: rgw
  count_per_host: 2
spec:
  rgw_realm: myrealm
  rgw_zone: myzone
  rgw_frontend_type: "beast"
  rgw_frontend_port: 5000

which is what the migration is actually concerned about
(verification of the rgw_frontend_type in these specs).

In the case where the spec is more simple, we should
just leave the spec alone and move on. Unfortunately
the current code assumes the field will always be
there and hits an unhandled KeyError when trying to
migrate the more simple specs. This causes the
cephadm module to crash shortly after starting an
upgrade to a version that includes this migration
and it's very difficult to find the root cause. This
can be worked around by adding fields to the rgw
spec before upgrade so the "spec" field exists in
the spec and the migration works as intended.

This commit fixes the migration in the simple
case as well as adding testing for that case to
both the unit tests and orch/cephadm teuthology
upgrade tests

Fixes: https://tracker.ceph.com/issues/61889

Signed-off-by: Adam King <adking@redhat.com>
@adk3798 adk3798 force-pushed the cephadm-fix-rgw-migration branch from 42e4a03 to 1860ef8 Compare July 5, 2023 15:01
@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

rebased to see if that would help the failures somehow

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 5, 2023

jenkins test api

1 similar comment
@adk3798
Copy link
Contributor Author

adk3798 commented Jul 6, 2023

jenkins test api

@adk3798
Copy link
Contributor Author

adk3798 commented Jul 7, 2023

jenkins test api

@adk3798 adk3798 merged commit b985fad into ceph:main Jul 7, 2023
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants