-
Notifications
You must be signed in to change notification settings - Fork 683
/
advanced-node-scheduling.asciidoc
359 lines (314 loc) · 15.2 KB
/
advanced-node-scheduling.asciidoc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
:parent_page_id: elasticsearch-specification
:page_id: advanced-node-scheduling
ifdef::env-github[]
****
link:https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-{parent_page_id}.html#k8s-{page_id}[View this document on the Elastic website]
****
endif::[]
[id="{p}-{page_id}"]
= Advanced Elasticsearch node scheduling
Elastic Cloud on Kubernetes (ECK) offers full control over Elasticsearch cluster nodes scheduling by combining Elasticsearch configuration with Kubernetes scheduling options:
* <<{p}-define-elasticsearch-nodes-roles,Define Elasticsearch nodes roles>>
* <<{p}-affinity-options,Pod affinity and anti-affinity>>
* <<{p}-availability-zone-awareness,Topology spread constraints and availability zone awareness>>
* <<{p}-hot-warm-topologies,Hot-warm topologies>>
You can combine these features to deploy a production-grade Elasticsearch cluster.
[id="{p}-define-elasticsearch-nodes-roles"]
== Define Elasticsearch nodes roles
You can configure Elasticsearch nodes with link:https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html[one or multiple roles].
TIP: You can use link:https://yaml.org/spec/1.2/spec.html#id2765878[YAML anchors] to declare the configuration change once and reuse it across all the node sets.
This allows you to describe an Elasticsearch cluster with 3 dedicated master nodes, for example:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
# 3 dedicated master nodes
- name: master
count: 3
config:
node.roles: ["master"]
node.remote_cluster_client: false
# 3 ingest-data nodes
- name: ingest-data
count: 3
config:
node.roles: ["data", "ingest"]
----
[id="{p}-affinity-options"]
== Affinity options
You can set up various link:https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity[affinity and anti-affinity options] through the `podTemplate` section of the Elasticsearch resource specification.
=== A single Elasticsearch node per Kubernetes host (default)
To avoid scheduling several Elasticsearch nodes from the same cluster on the same host, use a `podAntiAffinity` rule based on the hostname and the cluster name label:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
podTemplate:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: quickstart
topologyKey: kubernetes.io/hostname
----
This is ECK default behaviour if you don't specify any `affinity` option. To explicitly disable the default behaviour, set an empty affinity object:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
podTemplate:
spec:
affinity: {}
----
The default affinity is using `preferredDuringSchedulingIgnoredDuringExecution`, which acts as best effort and won't prevent an Elasticsearch node from being scheduled on a host if there are no other hosts available. Scheduling a 4-nodes Elasticsearch cluster on a 3-host Kubernetes cluster would then successfully schedule 2 Elasticsearch nodes on the same host. To enforce a strict single node per host, specify `requiredDuringSchedulingIgnoredDuringExecution` instead:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
podTemplate:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: quickstart
topologyKey: kubernetes.io/hostname
----
=== Local Persistent Volume constraints
By default, volumes can be bound to a Pod before the Pod gets scheduled to a particular Kubernetes node. This can be a problem if the PersistentVolume can only be accessed from a particular host or set of hosts. Local persistent volumes are a good example: they are accessible from a single host. If the Pod gets scheduled to a different host based on any affinity or anti-affinity rule, the volume may not be available.
To solve this problem, you can link:https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode[set the Volume Binding Mode] of the `StorageClass` you are using. Make sure `volumeBindingMode: WaitForFirstConsumer` is set, link:https://kubernetes.io/docs/concepts/storage/volumes/#local[especially if you are using local persistent volumes].
[source,yaml]
----
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
----
=== Node affinity
To restrict the scheduling to a particular set of Kubernetes nodes based on labels, use a link:https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector[NodeSelector].
The following example schedules Elasticsearch Pods on Kubernetes nodes tagged with both labels `diskType: ssd` and `environment: production`.
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
podTemplate:
spec:
nodeSelector:
diskType: ssd
environment: production
----
You can achieve the same (and more) with link:https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature[node affinity]:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: environment
operator: In
values:
- e2e
- production
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: diskType
operator: In
values:
- ssd
----
This example restricts Elasticsearch nodes so they are only scheduled on Kubernetes hosts tagged with `environment: e2e` or `environment: production`. It favors nodes tagged with `diskType: ssd`.
[id="{p}-availability-zone-awareness"]
== Topology spread constraints and availability zone awareness
Starting with ECK 2.0 the operator can make Kubernetes Node labels available as Pod annotations. It can be used to make information, such as logical failure domains, available in a running Pod. Combined with link:https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html#allocation-awareness[Elasticsearch shard allocation awareness] and link:https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/[Kubernetes topology spread constraints], you can create an availability zone-aware Elasticsearch cluster.
[id="{p}-availability-zone-awareness-downward-api"]
=== Exposing Kubernetes node topology labels in Pods
. First, ensure that the operator's flag `exposed-node-labels` contains the list of the Kubernetes node labels that should be exposed in the Elasticsearch Pods. If you are using the provided installation manifest, or the Helm chart, then this flag is already preset with two wildcard patterns for well-known node labels that describe Kubernetes cluster topology, like `topology.kubernetes.io/.*` and `failure-domain.beta.kubernetes.io/.*`.
. On the Elasticsearch resources set the `eck.k8s.elastic.co/downward-node-labels` annotations with the list of the Kubernetes node labels that should be copied as Pod annotations.
. Use the link:https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/[Kubernetes downward API] in the `podTemplate` to make those annotations available as environment variables in Elasticsearch Pods.
Refer to the next section or to the link:{eck_github}/tree/{eck_release_branch}/config/samples/elasticsearch/elasticsearch.yaml[Elasticsearch sample resource in the ECK source repository] for a complete example.
[id="{p}-availability-zone-awareness-example"]
=== Using node topology labels, Kubernetes topology spread constraints, and Elasticsearch shard allocation awareness
The following example demonstrates how to use the `topology.kubernetes.io/zone` node labels to spread a NodeSet across the availability zones of a Kubernetes cluster.
Note that by default ECK creates a `k8s_node_name` attribute with the name of the Kubernetes node running the Pod, and configures Elasticsearch to use this attribute. This ensures that Elasticsearch allocates primary and replica shards to Pods running on different Kubernetes nodes and never to Pods that are scheduled onto the same Kubernetes node. To preserve this behavior while making Elasticsearch aware of the availability zone, include the `k8s_node_name` attribute in the comma-separated `cluster.routing.allocation.awareness.attributes` list.
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
annotations:
eck.k8s.elastic.co/downward-node-labels: "topology.kubernetes.io/zone"
name: quickstart
spec:
version: {version}
nodeSets:
- name: default
count: 3
config:
node.attr.zone: ${ZONE}
cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
podTemplate:
spec:
# The initContainers section is only required if using secureSettings in conjunction with zone awareness.
# initContainers:
# - name: elastic-internal-init-keystore
# env:
# - name: ZONE
# valueFrom:
# fieldRef:
# fieldPath: metadata.annotations['topology.kubernetes.io/zone']
containers:
- name: elasticsearch
env:
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: quickstart
elasticsearch.k8s.elastic.co/statefulset-name: quickstart-es-default
----
This example relies on:
- Kubernetes nodes in each zone being labeled accordingly. `topology.kubernetes.io/zone` link:https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#interlude-built-in-node-labels[is standard], but any label can be used.
- link:https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/[Pod topology spread constraints] to spread the Pods across availability zones in the Kubernetes cluster.
- Elasticsearch configured to link:https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html#allocation-awareness[allocate shards based on node attributes]. Here we specified `node.attr.zone`, but any attribute name can be used. `node.attr.rack_id` is another common example.
[id="{p}-hot-warm-topologies"]
== Hot-warm topologies
By combining link:https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html#allocation-awareness[Elasticsearch shard allocation awareness] with link:https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature[Kubernetes node affinity], you can set up an Elasticsearch cluster with hot-warm topology:
[source,yaml,subs="attributes"]
----
apiVersion: elasticsearch.k8s.elastic.co/{eck_crd_version}
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: {version}
nodeSets:
# hot nodes, with high CPU and fast IO
- name: hot
count: 3
config:
node.attr.data: hot
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
limits:
memory: 16Gi
cpu: 4
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- highio
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: local-storage
# warm nodes, with high storage
- name: warm
count: 3
config:
node.attr.data: warm
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
limits:
memory: 16Gi
cpu: 2
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- highstorage
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Ti
storageClassName: local-storage
----
In this example, we configure two groups of Elasticsearch nodes:
- The first group has the `data` attribute set to `hot`. It is intended to run on hosts with high CPU resources and fast IO (SSD). Pods can only be scheduled on Kubernetes nodes labeled with `beta.kubernetes.io/instance-type: highio` (to adapt to the labels of your Kubernetes nodes).
- The second group has the `data` attribute set to `warm`. It is intended to run on hosts with larger but maybe slower storage. Pods can only be scheduled on nodes labeled with `beta.kubernetes.io/instance-type: highstorage`.
NOTE: This example uses link:https://kubernetes.io/docs/concepts/storage/volumes/#local[Local Persistent Volumes] for both groups, but can be adapted to use high-performance volumes for `hot` Elasticsearch nodes and high-storage volumes for `warm` Elasticsearch nodes.
Finally, set up link:https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html[Index Lifecycle Management] policies on your indices, link:https://www.elastic.co/blog/implementing-hot-warm-cold-in-elasticsearch-with-index-lifecycle-management[optimizing for hot-warm architectures].