Skip to content
This repository has been archived by the owner on Jun 24, 2022. It is now read-only.

Ensuring restarts are cluster friendly #126

Closed
gingerwizard opened this issue Jul 22, 2016 · 6 comments
Closed

Ensuring restarts are cluster friendly #126

gingerwizard opened this issue Jul 22, 2016 · 6 comments

Comments

@gingerwizard
Copy link

Many actions in the role require a node restart. Currently we don't use synced flush or disable allocation before restarting nodes per recommendations here.

Installation of plugins could therefore cause significant cluster disruption and potentially very long recoveries. This ticket proposes:

  1. Issuing a synced flush before any restart on all indices.
  2. disabling allocation on the cluster before restarting a node
  3. If (2), we could also wait for the cluster to be green before exiting. Alternatively we could only operate on a green cluster.

Or possibly a combination of the above.

Whilst ideally we would minimise cluster disruption i see several challenges with the above:

  1. We can't guarantee that we are only operating on one node at a time i.e. 1 fork - the above process requires it. Maybe we could only perform this the above if a single fork is used - otherwise just skip.
  2. Waiting for a green state is likely to result in long pauses e.g. if the customer is indexing or half applied roles. Users may therefore cancel. This shouldn't matter if its defn idempotent.
  3. We begin to blur the lines between configuration management and cluster management by adding the above.
  4. The user may not want a synced flush operation to occur or allocation to be disabled - it makes assumptions as to what they are doing with the cluster. If we provide it should probably be possible to disable.

@tylerjl @elasticdog @dliappis @jakommo @barryib

Relevant features maybe http://docs.ansible.com/ansible/playbooks_delegation.html

@robin13
Copy link
Contributor

robin13 commented Jul 22, 2016

I strongly agree with "We begin to blur the lines between configuration management and cluster management by adding the above." and am not sure if this level of automation and "cleverness" in a configuration management script is good - it may cause more confusion and errors than it helps.

I would tend not to implement this functionality with the argumentation that usecases are too different to have this a standard feature.

If at all, it should be an optional feature (disabled by default) which can be enabled with a variable.

@tylerjl
Copy link

tylerjl commented Jul 22, 2016

I can't speak for what's typical or atypical within Ansible modules, but offering input from the puppet side, I err with @robin13's perspective in that this treads into "cluster management" territory, and that there's essentially unbounded potential here for functionality. There was actually user input in the puppet module to change the restart_on_change default to stop restarting the service when plugins/configs/package versions changed for more predictable behavior (obviously the setting can be user-controlled, but the concern was unexpected cluster restarts when, for example, a config file was altered slightly triggering cluster-wide node restarts).

I do think there are some cases when managing indices/stateful data within Elasticsearch is just too useful not to leverage config management for, like managing index templates. It talks to the API and reads/writes data to a document store, but it's just way too useful for users to ignore the feature.

It's probably worth granting greater weight to user preference in either case; I don't have as close of a pulse on what's needed as end-users do (for example, I went into the puppet module thinking adding additional supported OSes was high-priority, but getting Shield supported dwarfed all other requests from users).

@barryib
Copy link
Contributor

barryib commented Jul 22, 2016

I don't really understand what do you mean by "cluster management", but what I do understand is the feature we are talking about is more like an orchestration to do rolling upgrade. And in the ansible world, it's described by playbooks.

But in my opinion, I don't think that we should implement it because ansible-elasticsearch is a role and should stay as that. It's more generic and do a good job in the elasticsearch deployment.

We can add an option as proposed by @tylerjl (restart_on_change per exemple) to let user choose whether they want to rester nodes or not (this option will be very helpful).

For rolling upgrade, we can propose an generic playbook as exemple (in docs?). This will fix serial to 1 and upgrade node 1 by 1 as recommended for rolling restart.

I'm actually trying write a playbook with rolling upgrade in our env.

PS: An exemple I found in the @ekhoinc github repo

@strootman
Copy link
Contributor

I agree with the general sentiment here. This roles is already pretty feature dense.
The ability to compose multiple plays through - include: statements is probably the way to go.

The example posted by @barryib could be reworked to separate the pre upgrade and post upgrade tasks

---
- name: Elasticsearch rolling upgrade
  ...
  pre_tasks:
    ...
    - name: Disable shard allocation for the cluster
  tasks:
    - include: your_es_playbook.yml
  post_tasks:
    ....

Or

---
- name: Elasticsearch rolling upgrade
  ...
  tasks:
    - include: pre_rolling_restart.yml
    - include: your_es_playbook.yml
    - include: post_rolling_restart.yml

@strootman
Copy link
Contributor

I should add, that once you start introducing playbook and/or role composition, that you will inevitably come face to face with Ansible's variable precedence and override shenanigans.

@gingerwizard
Copy link
Author

We already support a es_restart_on_change.
Closing. All in agreement. Thanks all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants