Ensuring restarts are cluster friendly #126

gingerwizard · 2016-07-22T10:35:35Z

Many actions in the role require a node restart. Currently we don't use synced flush or disable allocation before restarting nodes per recommendations here.

Installation of plugins could therefore cause significant cluster disruption and potentially very long recoveries. This ticket proposes:

Issuing a synced flush before any restart on all indices.
disabling allocation on the cluster before restarting a node
If (2), we could also wait for the cluster to be green before exiting. Alternatively we could only operate on a green cluster.

Or possibly a combination of the above.

Whilst ideally we would minimise cluster disruption i see several challenges with the above:

We can't guarantee that we are only operating on one node at a time i.e. 1 fork - the above process requires it. Maybe we could only perform this the above if a single fork is used - otherwise just skip.
Waiting for a green state is likely to result in long pauses e.g. if the customer is indexing or half applied roles. Users may therefore cancel. This shouldn't matter if its defn idempotent.
We begin to blur the lines between configuration management and cluster management by adding the above.
The user may not want a synced flush operation to occur or allocation to be disabled - it makes assumptions as to what they are doing with the cluster. If we provide it should probably be possible to disable.

@tylerjl @elasticdog @dliappis @jakommo @barryib

Relevant features maybe http://docs.ansible.com/ansible/playbooks_delegation.html

robin13 · 2016-07-22T12:41:48Z

I strongly agree with "We begin to blur the lines between configuration management and cluster management by adding the above." and am not sure if this level of automation and "cleverness" in a configuration management script is good - it may cause more confusion and errors than it helps.

I would tend not to implement this functionality with the argumentation that usecases are too different to have this a standard feature.

If at all, it should be an optional feature (disabled by default) which can be enabled with a variable.

tylerjl · 2016-07-22T14:04:21Z

I can't speak for what's typical or atypical within Ansible modules, but offering input from the puppet side, I err with @robin13's perspective in that this treads into "cluster management" territory, and that there's essentially unbounded potential here for functionality. There was actually user input in the puppet module to change the restart_on_change default to stop restarting the service when plugins/configs/package versions changed for more predictable behavior (obviously the setting can be user-controlled, but the concern was unexpected cluster restarts when, for example, a config file was altered slightly triggering cluster-wide node restarts).

I do think there are some cases when managing indices/stateful data within Elasticsearch is just too useful not to leverage config management for, like managing index templates. It talks to the API and reads/writes data to a document store, but it's just way too useful for users to ignore the feature.

It's probably worth granting greater weight to user preference in either case; I don't have as close of a pulse on what's needed as end-users do (for example, I went into the puppet module thinking adding additional supported OSes was high-priority, but getting Shield supported dwarfed all other requests from users).

barryib · 2016-07-22T16:30:48Z

I don't really understand what do you mean by "cluster management", but what I do understand is the feature we are talking about is more like an orchestration to do rolling upgrade. And in the ansible world, it's described by playbooks.

But in my opinion, I don't think that we should implement it because ansible-elasticsearch is a role and should stay as that. It's more generic and do a good job in the elasticsearch deployment.

We can add an option as proposed by @tylerjl (restart_on_change per exemple) to let user choose whether they want to rester nodes or not (this option will be very helpful).

For rolling upgrade, we can propose an generic playbook as exemple (in docs?). This will fix serial to 1 and upgrade node 1 by 1 as recommended for rolling restart.

I'm actually trying write a playbook with rolling upgrade in our env.

PS: An exemple I found in the @ekhoinc github repo

strootman · 2016-07-22T17:07:09Z

I agree with the general sentiment here. This roles is already pretty feature dense.
The ability to compose multiple plays through - include: statements is probably the way to go.

The example posted by @barryib could be reworked to separate the pre upgrade and post upgrade tasks

---
- name: Elasticsearch rolling upgrade
  ...
  pre_tasks:
    ...
    - name: Disable shard allocation for the cluster
  tasks:
    - include: your_es_playbook.yml
  post_tasks:
    ....

Or

---
- name: Elasticsearch rolling upgrade
  ...
  tasks:
    - include: pre_rolling_restart.yml
    - include: your_es_playbook.yml
    - include: post_rolling_restart.yml

strootman · 2016-07-22T20:08:04Z

I should add, that once you start introducing playbook and/or role composition, that you will inevitably come face to face with Ansible's variable precedence and override shenanigans.

gingerwizard · 2016-07-24T15:23:10Z

We already support a es_restart_on_change.
Closing. All in agreement. Thanks all.

gingerwizard added question enhancement labels Jul 22, 2016

barryib mentioned this issue Jul 24, 2016

Advanced Cluster Management Features #41

Closed

gingerwizard closed this as completed Jul 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensuring restarts are cluster friendly #126

Ensuring restarts are cluster friendly #126

gingerwizard commented Jul 22, 2016

robin13 commented Jul 22, 2016 •

edited

Loading

tylerjl commented Jul 22, 2016

barryib commented Jul 22, 2016

strootman commented Jul 22, 2016

strootman commented Jul 22, 2016

gingerwizard commented Jul 24, 2016

Ensuring restarts are cluster friendly #126

Ensuring restarts are cluster friendly #126

Comments

gingerwizard commented Jul 22, 2016

robin13 commented Jul 22, 2016 • edited Loading

tylerjl commented Jul 22, 2016

barryib commented Jul 22, 2016

strootman commented Jul 22, 2016

strootman commented Jul 22, 2016

gingerwizard commented Jul 24, 2016

robin13 commented Jul 22, 2016 •

edited

Loading