First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

stefannica · 2021-06-07T16:53:11Z

I'm running the MLFlow sklearn e2e workflow example, as documented, and the first workflow run always gets stuck in the build step and remains in a pending state for several minutes.

Here's some relevant output from the k8s cluster:

> kubectl -n fuseml-workloads get all -o wide
NAME                                                                  READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
pod/affinity-assistant-3df3914653-0                                   1/1     Running     0          8m38s   10.42.0.246   phys01   <none>           <none>
pod/el-mlflow-sklearn-e2e-6c85fbb7dd-rfxvz                            1/1     Running     0          8m47s   10.42.5.94    phys06   <none>           <none>
pod/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-prep-j-k75f7   0/1     Completed   0          8m6s    10.42.0.248   phys01   <none>           <none>
pod/fuseml-mlflow-project-01-mlflow-test-mlprp-clone-fp9pk-po-vhzkh   0/3     Completed   0          8m38s   10.42.0.247   phys01   <none>           <none>
pod/mlflow-86745bb4c8-wdpxp                                           1/1     Running     0          4h50m   10.42.4.69    phys05   <none>           <none>
pod/mlflow-minio-5f7457c965-kjqfr                                     1/1     Running     0          4h50m   10.42.4.70    phys05   <none>           <none>
pod/mlflow-mysql-0                                                    1/1     Running     0          4h50m   10.42.4.68    phys05   <none>           <none>

NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE     SELECTOR
service/el-mlflow-sklearn-e2e   ClusterIP   10.43.252.99    <none>        8080/TCP   8m47s   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e
service/mlflow                  ClusterIP   10.43.119.189   <none>        80/TCP     4h50m   app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow
service/mlflow-minio            ClusterIP   10.43.242.94    <none>        9000/TCP   4h50m   app=minio,release=mlflow
service/mlflow-mysql            ClusterIP   10.43.93.226    <none>        3306/TCP   4h50m   app.kubernetes.io/component=primary,app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mysql
service/mlflow-mysql-headless   ClusterIP   None            <none>        3306/TCP   4h50m   app.kubernetes.io/component=primary,app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mysql

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS       IMAGES                                                                                                                                                      SELECTOR
deployment.apps/el-mlflow-sklearn-e2e   1/1     1            1           8m47s   event-listener   gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink:v0.12.1@sha256:ec8a946b294ebb8e9752b2dd16bc8ec73f5db1a32e06fb21f97fbda135a7f39c   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e
deployment.apps/mlflow                  1/1     1            1           4h50m   mlflow           ghcr.io/fuseml/mlflow:1.14.1                                                                                                                                app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow
deployment.apps/mlflow-minio            1/1     1            1           4h50m   minio            minio/minio:RELEASE.2021-02-14T04-01-33Z                                                                                                                    app=minio,release=mlflow

NAME                                               DESIRED   CURRENT   READY   AGE     CONTAINERS       IMAGES                                                                                                                                                      SELECTOR
replicaset.apps/el-mlflow-sklearn-e2e-6c85fbb7dd   1         1         1       8m47s   event-listener   gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink:v0.12.1@sha256:ec8a946b294ebb8e9752b2dd16bc8ec73f5db1a32e06fb21f97fbda135a7f39c   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e,pod-template-hash=6c85fbb7dd
replicaset.apps/mlflow-86745bb4c8                  1         1         1       4h50m   mlflow           ghcr.io/fuseml/mlflow:1.14.1                                                                                                                                app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow,pod-template-hash=86745bb4c8
replicaset.apps/mlflow-minio-5f7457c965            1         1         1       4h50m   minio            minio/minio:RELEASE.2021-02-14T04-01-33Z                                                                                                                    app=minio,pod-template-hash=5f7457c965,release=mlflow

NAME                                             READY   AGE     CONTAINERS           IMAGES
statefulset.apps/affinity-assistant-3df3914653   1/1     8m38s   affinity-assistant   gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/nop:v0.22.0@sha256:45500af4d68be54b99c7c84f99a61595cf73e3e761c69a8009fe1adcc8ccd2af
statefulset.apps/mlflow-mysql                    1/1     4h50m   mysql                docker.io/bitnami/mysql:8.0.23-debian-10-r0

You can see from the pod list the builder-prep step has completed, but the builder (kaniko) step hasn't started yet.

> kubectl -n fuseml-workloads get task
NAME           AGE
builder-prep   4h48m
clone          4h48m
kaniko         4h48m

> kubectl -n fuseml-workloads get taskrun
NAME                                                            SUCCEEDED   REASON      STARTTIME   COMPLETIONTIME
fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6                                            
fuseml-mlflow-project-01-mlflow-test-mlprp-builder-prep-jnsw2   True        Succeeded   11m         11m
fuseml-mlflow-project-01-mlflow-test-mlprp-clone-fp9pk          True        Succeeded   12m         11m

> kubectl -n fuseml-workloads get taskrun fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6 -o yaml
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  annotations:
    pipeline.tekton.dev/affinity-assistant: affinity-assistant-3df3914653
  creationTimestamp: "2021-06-07T16:40:32Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    fuseml/codeset-name: mlflow-test
    fuseml/codeset-project: mlflow-project-01
    fuseml/codeset-version: main
    fuseml/workflow-ref: mlflow-sklearn-e2e
    tekton.dev/pipeline: mlflow-sklearn-e2e
    tekton.dev/pipelineRun: fuseml-mlflow-project-01-mlflow-test-mlprp
    tekton.dev/pipelineTask: builder
  managedFields:
  - apiVersion: tekton.dev/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:pipeline.tekton.dev/affinity-assistant: {}
        f:labels:
          .: {}
          f:fuseml/codeset-name: {}
          f:fuseml/codeset-project: {}
          f:fuseml/codeset-version: {}
          f:fuseml/workflow-ref: {}
          f:tekton.dev/pipeline: {}
          f:tekton.dev/pipelineRun: {}
          f:tekton.dev/pipelineTask: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"12b6345d-b8b0-4d9b-9ae0-ce47e33a788a"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        .: {}
        f:params: {}
        f:resources: {}
        f:serviceAccountName: {}
        f:taskRef:
          .: {}
          f:kind: {}
          f:name: {}
        f:timeout: {}
        f:workspaces: {}
      f:status:
        .: {}
        f:podName: {}
    manager: controller
    operation: Update
    time: "2021-06-07T16:40:32Z"
  name: fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6
  namespace: fuseml-workloads
  ownerReferences:
  - apiVersion: tekton.dev/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: PipelineRun
    name: fuseml-mlflow-project-01-mlflow-test-mlprp
    uid: 12b6345d-b8b0-4d9b-9ae0-ce47e33a788a
  resourceVersion: "324588700"
  selfLink: /apis/tekton.dev/v1beta1/namespaces/fuseml-workloads/taskruns/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6
  uid: 743968b6-fb17-45aa-8865-1346bac708d6
spec:
  params:
  - name: IMAGE
    value: registry.fuseml-registry/mlflow-builder/mlflow-test:main
  - name: DOCKERFILE
    value: .fuseml/Dockerfile
  resources: {}
  serviceAccountName: staging-triggers-admin
  taskRef:
    kind: Task
    name: kaniko
  timeout: 1h0m0s
  workspaces:
  - name: source
    persistentVolumeClaim:
      claimName: pvc-3df3914653

The text was updated successfully, but these errors were encountered:

stefannica · 2021-06-07T16:59:27Z

More context: this is a section of the tekton operator log, around the time the workflow is stuck. It indicates the TaskRun is created properly, but nothing happens after that (i.e. no reconciliation to implement the TaskRun):

{"level":"info","ts":"2021-06-07T16:40:32.465Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:745","msg":"Creating a new TaskRun object fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6 for pipeline task builder","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32.471492461 +0000 UTC m=+16734.295609773} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6\", UID:\"743968b6-fb17-45aa-8865-1346bac708d6\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588700\", FieldPath:\"\"}): type: 'Normal' reason: 'Started' ","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"PipelineRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp\", UID:\"12b6345d-b8b0-4d9b-9ae0-ce47e33a788a\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588633\", FieldPath:\"\"}): type: 'Normal' reason: 'Running' Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:40:32.473Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:363","msg":"Cloud Events: []","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"d891de44-fa5b-4c0f-b5b5-0a0fe35eeef4","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6"}
{"level":"info","ts":"2021-06-07T16:40:32.510Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp","duration":0.052808256}
{"level":"info","ts":"2021-06-07T16:40:32.524Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32 +0000 UTC} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"cecd9852-9a9c-4388-9e54-8f6577cb5054","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.540Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"cecd9852-9a9c-4388-9e54-8f6577cb5054","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp","duration":0.029458687}

Then much later, the Tekton operator "gets unstuck" and continues, but it does print out some strange error about AWS credentials just before it resumes:

E0607 16:57:59.228117       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
{"level":"info","ts":"2021-06-07T16:58:00.214Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:461","msg":"Successfully reconciled taskrun fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6/fuseml-workloads with status: &apis.Condition{Type:\"Succeeded\", Status:\"Unknown\", Severity:\"\", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0xc027b1e60cc33ab8, ext:17782038238456, loc:(*time.Location)(0x32f6280)}}}, Reason:\"Pending\", Message:\"Pending\"}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"d891de44-fa5b-4c0f-b5b5-0a0fe35eeef4","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6"}
{"level":"info","ts":"2021-06-07T16:58:00.214Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6\", UID:\"743968b6-fb17-45aa-8865-1346bac708d6\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588700\", FieldPath:\"\"}): type: 'Normal' reason: 'Pending' Pending","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:58:00.245Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32 +0000 UTC} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"71319dd7-556f-4d10-8336-e00b09e7b784","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}

flaviodsr · 2021-07-05T15:43:02Z

I have taken a look on this and this seems to be an issue with GCR, the registry where the kaniko image is pulled from.

According to the Tekton container contract when there is no command or script call, tekton retrieves the container entrypoint from the image metadata provided by the registry and it is at that point that the task gets stuck. I came to that conclusion by changing the kaniko task by adding a command entry and I was not able to reproduce that issue since.

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

stefannica added the bug Something isn't working label Jun 7, 2021

stefannica added this to Backlog in FuseML Project Board via automation Jun 7, 2021

stefannica moved this from Backlog to V2 in FuseML Project Board Jun 21, 2021

stefannica added the v2 label Jun 21, 2021

flaviodsr self-assigned this Jul 5, 2021

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 6, 2021

fix: builder task stuck on first time it is executed

0c8a9a6

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 6, 2021

fix: builder task stuck on first time it is executed

cf30f3f

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

flaviodsr mentioned this issue Jul 6, 2021

Use tekton-operator for deploying tekton and fix service accounts permissions #172

Merged

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021

fix: builder task stuck on first time it is executed

d7b1f36

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021

fix: builder task stuck on first time it is executed

a2a85b9

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021

fix: builder task stuck on first time it is executed

08fdfd8

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

a36cce5

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

d65c2c4

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

6276f96

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

3a4e282

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

3c2f22a

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021

fix: builder task stuck on first time it is executed

06d27d2

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 13, 2021

fix: builder task stuck on first time it is executed

c7dd50b

This change fixes: fuseml#134 by calling kaniko through `command` instead of its entrypoint. Also updates the kaniko task image to v1.6.0

stefannica moved this from V2 to In progress in FuseML Project Board Jul 13, 2021

flaviodsr closed this as completed in #172 Jul 14, 2021

FuseML Project Board automation moved this from In progress to Done Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

stefannica commented Jun 7, 2021

stefannica commented Jun 7, 2021

flaviodsr commented Jul 5, 2021

First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

Comments

stefannica commented Jun 7, 2021

stefannica commented Jun 7, 2021

flaviodsr commented Jul 5, 2021