Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: HA for RGW endpoints #38615

Merged
merged 1 commit into from
Jan 11, 2021
Merged

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Dec 16, 2020

New HA_RGW service provides HA for RGW endpoints using keepalived and haproxy

Signed-off-by: Adam King adking@redhat.com

Fixes: https://tracker.ceph.com/issues/45116

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@batrick
Copy link
Member

batrick commented Dec 16, 2020

jenkins test make check

Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

  • Next task: A theutology test to cover ha-rgw service

@jmolmo
Copy link
Member

jmolmo commented Dec 18, 2020

A new round of Integration tests passed:
https://pulpito.ceph.com/jolmomar-2020-12-17_23:04:59-rados:cephadm-wip-jmolmo-testing-2020-12-17-2237-distro-basic-smithi/
4 errors. (3 are the old well known (docker limit + mut.ex. container options ) and a new one thath seems not to be related with this change

The new error probably is related with the modifications introduced yesterday to avoid use passwords in dashboard commands. Investigation on-going:
TypeError: ac_user_create_cmd() got an unexpected keyword argument 'inbuf'
https://pulpito.ceph.com/jolmomar-2020-12-17_23:04:59-rados:cephadm-wip-jmolmo-testing-2020-12-17-2237-distro-basic-smithi/5715758/

Error "CapAdd and privileged are mutually exclusive options"
https://pulpito.ceph.com/jolmomar-2020-12-17_23:04:59-rados:cephadm-wip-jmolmo-testing-2020-12-17-2237-distro-basic-smithi/5715703/
https://pulpito.ceph.com/jolmomar-2020-12-17_23:04:59-rados:cephadm-wip-jmolmo-testing-2020-12-17-2237-distro-basic-smithi/5715742/

See: https://tracker.ceph.com/issues/48142

Error: Docker.io "reached your pull rate limit"
https://pulpito.ceph.com/jolmomar-2020-12-17_23:04:59-rados:cephadm-wip-jmolmo-testing-2020-12-17-2237-distro-basic-smithi/5715699/

See: https://tracker.ceph.com/issues/48157

@jmolmo jmolmo removed the wip-jmolmo-testing CI testing label Dec 22, 2020
Copy link
Contributor

@sebastian-philipp sebastian-philipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice work, just some minor nits.

doc/mgr/orchestrator.rst Outdated Show resolved Hide resolved
Comment on lines +222 to +231
def populate_files(config_dir, config_files, uid, gid):
# type: (str, Dict, int, int) -> None
"""create config files for different services"""
for fname in config_files:
config_file = os.path.join(config_dir, fname)
config_content = dict_get_join(config_files, fname)
logger.info('Write file: %s' % (config_file))
with open(config_file, 'w') as f:
os.fchown(f.fileno(), uid, gid)
os.fchmod(f.fileno(), 0o600)
f.write(config_content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please make this refactorization a seperate commit. I really want to avoid merging a single HUGE commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-philipp : We are in a hurry with this feature now. I think that we are discussing this feature ( in different PRS) for almost two months. In the first revision you pointed several important points that i think have been addressed.
This re-factorization was one of the first things you suggested us (not as a separate PR) . It would have been very easy to move this to another PR before Christmas, but in this moment, this movement can carry us to more delays and to finish with this nice feature outside of our release.
We have tested this several times manually and passed the Integration tests... so I have to request to do only the minimum changes needed.

src/cephadm/cephadm Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
Comment on lines +1898 to +2067
if daemon_spec.daemon_type == 'haproxy':
haspec = cast(HA_RGWSpec, daemon_spec.spec)
if haspec.haproxy_container_image:
image = haspec.haproxy_container_image

if daemon_spec.daemon_type == 'keepalived':
haspec = cast(HA_RGWSpec, daemon_spec.spec)
if haspec.keepalived_container_image:
image = haspec.keepalived_container_image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please avoid adding those `if daemon_type == foobar? hacks here please? I really invested a lot of time in cleaning up this here. having those hacks creeping is super painful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a problem but with haproxy and keepalived being the 4th and 5th daemon types (container, osd and cephadm-exporter as well) to now need something like this (special daemon-specific operations during _create_daemon) it feels like there needs to be some generic way to allow something like this that doesn't clutter up the function. I personally would rather that refactor be its own PR rather than included here with this already being so large.

Comment on lines +538 to +544
daemon_id = self.mgr.get_unique_name(daemon_type, host, daemons,
prefix=spec.service_id,
forcename=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, service_id starts with rgw-ha, what does daemon_id look like then here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log statement a few lines later that prints daemon_type.daemon_id came out like
Placing haproxy.haproxy_for_rgw.vm-00.yaljjn on host vm-00 for haproxy and
Placing keepalived.haproxy_for_rgw.vm-00.ymborn on host vm-00 for keepalived
on host vm-00 coming from a spec file with
service_type: ha-rgw
service_id: haproxy_for_rgw
It doesn't seem like ha-rgw appears in the id anywhere unless you explicitly include it as part of your service id in the spec

Comment on lines +695 to +715
# ha-rgw needs definitve host list to create keepalived config files
# if definitive host list has changed, all ha-rgw daemons must get new
# config, including those that are already on the correct host and not
# going to be deployed
def update_ha_rgw_definitive_hosts(self, spec: ServiceSpec, hosts: List[HostPlacementSpec],
add_hosts: Set[HostPlacementSpec]) -> HA_RGWSpec:
spec = cast(HA_RGWSpec, spec)
if not (set(hosts) == set(spec.definitive_host_list)):
spec.definitive_host_list = hosts
ha_rgw_daemons = self.mgr.cache.get_daemons_by_service(spec.service_name())
for daemon in ha_rgw_daemons:
if daemon.hostname in [h.hostname for h in hosts] and daemon.hostname not in add_hosts:
self.mgr.cache.schedule_daemon_action(
daemon.hostname, daemon.name(), 'reconfig')
return spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though this is why we added

self.requires_post_actions: Set[str] = set()

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the definitive hosts are used in creating the keepalived config file (it needs to know where other keepalived are being placed) that it uses on deployment. That means, if done using post actions, every single keepalived daemon would need to be reconfigured right after it is initially deployed to get the config file with the location of the other daemons. Doing it the way it is here allows new keepalived daemons to get this info before they are created so you only have to reconfig already existing daemons if the set of hosts ha-rgw is on changes. It should still be functional using post actions but all new keepalived daemons would effectively have to be deployed twice. This is sort of an odd case as we would ideally like to set this attribute in the small window between when placement is done and when the daemons are actually created.

src/pybind/mgr/cephadm/serve.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/module.py Outdated Show resolved Hide resolved
@sebastian-philipp
Copy link
Contributor

Or, should we just merge it as it is?

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Jan 7, 2021
@github-actions
Copy link

github-actions bot commented Jan 7, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions
Copy link

github-actions bot commented Jan 8, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Cephadm deploying keepalived and HAproxy for providing High availability for RGW endpoints

Fixes: https://tracker.ceph.com/issues/45116

Signed-off-by: Daniel-Pivonka <dpivonka@redhat.com>
Signed-off-by: Adam King <adking@redhat.com>
Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>
@sebastian-philipp
Copy link
Contributor

thank you @jmolmo

@sebastian-philipp sebastian-philipp merged commit 8d80817 into ceph:master Jan 11, 2021
@adk3798 adk3798 mentioned this pull request Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants