Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Auto scaler failing to create #774

Open
LuanAraldi opened this issue Jul 20, 2023 · 8 comments
Open

Dask Auto scaler failing to create #774

LuanAraldi opened this issue Jul 20, 2023 · 8 comments

Comments

@LuanAraldi
Copy link

I'm trying to setup a simple DaskAutoscaler on Kubernetes using YAML files, but somehow the auto scaler failes to be created with the following error

  Error  Logging  45s  kopf  Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 850, in daskautoscaler_adapt
    scheduler = await Pod.get(
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 186, in get
    raise NotFoundError(f"Could not find {cls.kind} {name}.")
kr8s._exceptions.NotFoundError: Could not find Pod None.
  Error  Logging  2s  kopf  Handler 'daskautoscaler_create' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 841, in daskautoscaler_create
    autoscaler = await DaskAutoscaler(body)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 45, in __init__
    raise ValueError("resource must be a dict or a string")
ValueError: resource must be a dict or a string

The autoscaler.yaml file that I am using is this one

apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  namespace: dask
  name: autoscaled
spec:
  cluster: autoscaled
  minimum: 1 
  maximum: 5 

the cluster YAML definition is as follow

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: autoscaled
  namespace: dask
spec:
  worker:
    replicas: 0
    spec:
      serviceAccountName: dask-operator-sa
      tolerations:  
      - key: dedicated
        operator: Equal
        value: dask-worker
      nodeSelector: 
        dedicated: dask-worker
      containers:
      - name: worker
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-worker
          - --name
          - $(DASK_WORKER_NAME)
          - --dashboard
          - --dashboard-address
          - "8788"
        ports:
          - name: http-dashboard
            containerPort: 8788
            protocol: TCP
        env:
          - name: EXTRA_PIP_PACKAGES
            value: pyarrow s3fs
        resources:
          limits:
            cpu: "2"
            memory: "18G"
          requests:
            cpu: "1"
            memory: "16G"
        
  scheduler:
    spec:
      containers:
      - name: scheduler
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-scheduler
        ports:
          - name: tcp-comm
            containerPort: 8786
            protocol: TCP
          - name: http-dashboard
            containerPort: 8787
            protocol: TCP
        readinessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          limits:
            cpu: "1"
            memory: "3G"
          requests:
            cpu: "1"
            memory: "2G"
        env:
          - name: EXTRA_PIP_PACKAGES
            value: pyarrow s3fs
          - name: DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION
            value: "1.0"
    service:
      type: NodePort
      selector:
        dask.org/cluster-name: autoscaled
        dask.org/component: scheduler
      ports:
      - name: tcp-comm
        protocol: TCP
        port: 8786
        targetPort: "tcp-comm"
      - name: http-dashboard
        protocol: TCP
        port: 8787
        targetPort: "http-dashboard"

Environment:

  • Dask version: 2023.7.0
  • Python version: 3.10.9
@LuanAraldi
Copy link
Author

I've found out the issue is that the Dask Kubernetes Operator cannot find the scheduler on other namespaces, I had its pod working on the namespace dask-operator and this cluster on the autoscaled namespace. When trying to get the scheduler of the cluster it cannot find it, because it is not passing the namespace to get it.

@jacobtomlinson
Copy link
Member

Ah that makes sense, thanks for following up. We should handle these errors better and give the user some more useful guidance.

In this situation we should probably put the DaskAutoscaler into a Pending state with a reason of Cannot find cluster in current namespace to autoscale.

@chetankumar-patel-aera
Copy link

chetankumar-patel-aera commented Aug 14, 2023

@jacobtomlinson I am facing similar issue. I have installed dask-operator in different namespace and cluster, autoscaler in another namespace. Does this mean operator can not find the scheduler in different namespace to auto scale? How to resolve this issue?

@jacobtomlinson
Copy link
Member

@chetankumar-patel-aera The cluster and autoscaler need to be in the same namespace. The operator can be installed anywhere as it watches all namespaces by default.

@chetankumar-patel-aera
Copy link

@jacobtomlinson Thanks for your response. My dask cluster and autoscaler are in same namespace. While dask operator is in different. But I am still facing same issue. I tried with simple dask cluster mentioned in the documentation. But still getting same problem.

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: simple
  namespace: platform
spec:
  worker:
    replicas: 1
    spec:
      containers:
        - name: worker
          image: "ghcr.io/dask/dask:latest"
          imagePullPolicy: "IfNotPresent"
          args:
            - dask-worker
            - --name
            - $(DASK_WORKER_NAME)
            - --dashboard
            - --dashboard-address
            - "8788"
          ports:
            - name: http-dashboard
              containerPort: 8788
              protocol: TCP
  scheduler:
    spec:
      containers:
        - name: scheduler
          image: "ghcr.io/dask/dask:latest"
          imagePullPolicy: "IfNotPresent"
          args:
            - dask-scheduler
          ports:
            - name: tcp-comm
              containerPort: 8786
              protocol: TCP
            - name: http-dashboard
              containerPort: 8787
              protocol: TCP
          readinessProbe:
            httpGet:
              port: http-dashboard
              path: /health
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              port: http-dashboard
              path: /health
            initialDelaySeconds: 15
            periodSeconds: 20
    service:
      type: ClusterIP
      selector:
        dask.org/cluster-name: simple
        dask.org/component: scheduler
      ports:
        - name: tcp-comm
          protocol: TCP
          port: 8786
          targetPort: "tcp-comm"
        - name: http-dashboard
          protocol: TCP
          port: 8787
          targetPort: "http-dashboard"

---

apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: simple
  namespace: platform
spec:
  cluster: "simple"
  minimum: 5  # we recommend always having a minimum of 1 worker so that an idle cluster can start working on tasks immediately
  maximum: 10 # you can place a hard limit on the number of workers regardless of what the scheduler requests`

dask version: latest
dask-operator version: ghcr.io/dask/dask-kubernetes-operator:2023.7.3

Error:
kubectl describe daskautoscaler simple

  Normal  Logging  27s   kopf  Handler 'daskautoscaler_create' succeeded.
  Error   Logging  25s   kopf  Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 765, in daskautoscaler_adapt
    scheduler = await Pod.get(
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 194, in get
    raise NotFoundError(f"Could not find {cls.kind} {name}.")
kr8s._exceptions.NotFoundError: Could not find Pod None.

Can you please help with what could be issue?

@jacobtomlinson
Copy link
Member

Can you try the latest release of dask-kubernetes?

@chetankumar-patel-aera
Copy link

Ok. Will try with ghcr.io/dask/dask-kubernetes-operator:2023.8.0 version of dask-operator. Was there any fix related to this in this version?

@chetankumar-patel-aera
Copy link

@jacobtomlinson It is working after updating to 2023.8.0. Thanks for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants