Handle `image`, `env` and `args` fields updates in DaskCluster in k8s operator #895

Fogapod · 2024-06-14T13:35:22Z

I have a permanent dask cluster in kubernetes. Current operator ignores all changes to manifest.
There has been an issue about supporting spec updates, it got closed as resolved after implementing scale field support: #636.

The only fields that cause changes to deployment after applying updated manifest are spec.worker.replicas, DaskAutoscaler min/max.

Is it possible to support other fields, specifically image, args, env, volumes/mounts?
If not, what could be the optimal way to gracefully shut down and update cluster?

Cluster manifest (mostly copypasted from example):

---
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: dask-primary
spec:
  worker:
    replicas: 1
    spec:
      containers:
      - name: worker
        image: "//backend/dask:image"
        imagePullPolicy: Always
        args:
          - worker
          - --name
          - $(DASK_WORKER_NAME)
          - --dashboard
          - --dashboard-address
          - "8788"
        ports:
          - name: http-dashboard
            containerPort: 8788
            protocol: TCP
        env:
          - name: ENV_1
            value: 1
          - name: ENV_2
            value: 2
        volumeMounts:
          - name: kafka-certs
            mountPath: /etc/ssl/kafka/ca.crt
            subPath: ca.crt
            readOnly: true

      volumes:
        - name: kafka-certs
          configMap:
            name: kafka-certs

  scheduler:
    spec:
      containers:
      - name: scheduler
        image: "//backend/dask:image"
        imagePullPolicy: Always
        args:
          - scheduler
        ports:
          - name: tcp-comm
            containerPort: 8786
            protocol: TCP
          - name: http-dashboard
            containerPort: 8787
            protocol: TCP
        readinessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 15
          periodSeconds: 20
      imagePullSecrets:
        - name: regcred
    service:
      type: ClusterIP
      selector:
        dask.org/cluster-name: dask-primary
        dask.org/component: scheduler
      ports:
      - name: tcp-comm
        protocol: TCP
        port: 8786
        targetPort: "tcp-comm"
      - name: http-dashboard
        protocol: TCP
        port: 8787
        targetPort: "http-dashboard"

---
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: dask-primary
spec:
  cluster: dask-primary
  minimum: 1
  maximum: 10

Operator version: helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name --version 2024.5.0 dask-kubernetes-operator
Dask version: custom built image that uses the following deps:

dask = "^2024.5.2"
bokeh = "^3.4.1"
distributed = "^2024.5.2"

Although it's the same with 2024.5.2-py3.11 image

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2024-06-14T14:58:34Z

Permanent clusters are far less common than ephemeral clusters, so I'm not surprised this hasn't come up before.

I would be happy to propogate other changes if that's valuable to you. Do you have any interest in raising a PR to handle this?

Alternatively you could just delete and recreate the resource?

Fogapod · 2024-06-14T15:30:24Z

Dropping daskcluster.kubernetes.dask.org/dask-primary is what I do now. I am concerned about graceful shutdown because scheduler and workers might have pending tasks. Is there a way to do this?

jacobtomlinson · 2024-06-14T15:36:18Z

When you delete the cluster all the Pods will be sent a SIGTERM. At this point the Dask scheduler and workers should gracefully shutdown. If they take too long to shut down then Kubernetes will send a SIGTERM, but this timeout is configurable via terminationGracePeriodSeconds.

In a long lived deployment I expect you have some application that runs work on the Dask cluster. If the Dask cluster restarts without completing a computation then it should be the job of the application to resubmit the work.

jacobtomlinson added enhancement operator labels Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `image`, `env` and `args` fields updates in DaskCluster in k8s operator #895

Handle `image`, `env` and `args` fields updates in DaskCluster in k8s operator #895

Fogapod commented Jun 14, 2024

jacobtomlinson commented Jun 14, 2024

Fogapod commented Jun 14, 2024

jacobtomlinson commented Jun 14, 2024

Handle image, env and args fields updates in DaskCluster in k8s operator #895

Handle image, env and args fields updates in DaskCluster in k8s operator #895

Comments

Fogapod commented Jun 14, 2024

jacobtomlinson commented Jun 14, 2024

Fogapod commented Jun 14, 2024

jacobtomlinson commented Jun 14, 2024

Handle `image`, `env` and `args` fields updates in DaskCluster in k8s operator #895

Handle `image`, `env` and `args` fields updates in DaskCluster in k8s operator #895