Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First mlflow-sklearn-e2e workflow run stuck for several minutes in the build step #134

Closed
stefannica opened this issue Jun 7, 2021 · 2 comments · Fixed by #172
Closed
Assignees
Labels
bug Something isn't working

Comments

@stefannica
Copy link
Member

I'm running the MLFlow sklearn e2e workflow example, as documented, and the first workflow run always gets stuck in the build step and remains in a pending state for several minutes.

Here's some relevant output from the k8s cluster:

> kubectl -n fuseml-workloads get all -o wide
NAME                                                                  READY   STATUS      RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
pod/affinity-assistant-3df3914653-0                                   1/1     Running     0          8m38s   10.42.0.246   phys01   <none>           <none>
pod/el-mlflow-sklearn-e2e-6c85fbb7dd-rfxvz                            1/1     Running     0          8m47s   10.42.5.94    phys06   <none>           <none>
pod/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-prep-j-k75f7   0/1     Completed   0          8m6s    10.42.0.248   phys01   <none>           <none>
pod/fuseml-mlflow-project-01-mlflow-test-mlprp-clone-fp9pk-po-vhzkh   0/3     Completed   0          8m38s   10.42.0.247   phys01   <none>           <none>
pod/mlflow-86745bb4c8-wdpxp                                           1/1     Running     0          4h50m   10.42.4.69    phys05   <none>           <none>
pod/mlflow-minio-5f7457c965-kjqfr                                     1/1     Running     0          4h50m   10.42.4.70    phys05   <none>           <none>
pod/mlflow-mysql-0                                                    1/1     Running     0          4h50m   10.42.4.68    phys05   <none>           <none>

NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE     SELECTOR
service/el-mlflow-sklearn-e2e   ClusterIP   10.43.252.99    <none>        8080/TCP   8m47s   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e
service/mlflow                  ClusterIP   10.43.119.189   <none>        80/TCP     4h50m   app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow
service/mlflow-minio            ClusterIP   10.43.242.94    <none>        9000/TCP   4h50m   app=minio,release=mlflow
service/mlflow-mysql            ClusterIP   10.43.93.226    <none>        3306/TCP   4h50m   app.kubernetes.io/component=primary,app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mysql
service/mlflow-mysql-headless   ClusterIP   None            <none>        3306/TCP   4h50m   app.kubernetes.io/component=primary,app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mysql

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS       IMAGES                                                                                                                                                      SELECTOR
deployment.apps/el-mlflow-sklearn-e2e   1/1     1            1           8m47s   event-listener   gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink:v0.12.1@sha256:ec8a946b294ebb8e9752b2dd16bc8ec73f5db1a32e06fb21f97fbda135a7f39c   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e
deployment.apps/mlflow                  1/1     1            1           4h50m   mlflow           ghcr.io/fuseml/mlflow:1.14.1                                                                                                                                app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow
deployment.apps/mlflow-minio            1/1     1            1           4h50m   minio            minio/minio:RELEASE.2021-02-14T04-01-33Z                                                                                                                    app=minio,release=mlflow

NAME                                               DESIRED   CURRENT   READY   AGE     CONTAINERS       IMAGES                                                                                                                                                      SELECTOR
replicaset.apps/el-mlflow-sklearn-e2e-6c85fbb7dd   1         1         1       8m47s   event-listener   gcr.io/tekton-releases/github.com/tektoncd/triggers/cmd/eventlistenersink:v0.12.1@sha256:ec8a946b294ebb8e9752b2dd16bc8ec73f5db1a32e06fb21f97fbda135a7f39c   app.kubernetes.io/managed-by=EventListener,app.kubernetes.io/part-of=Triggers,eventlistener=mlflow-sklearn-e2e,pod-template-hash=6c85fbb7dd
replicaset.apps/mlflow-86745bb4c8                  1         1         1       4h50m   mlflow           ghcr.io/fuseml/mlflow:1.14.1                                                                                                                                app.kubernetes.io/instance=mlflow,app.kubernetes.io/name=mlflow,pod-template-hash=86745bb4c8
replicaset.apps/mlflow-minio-5f7457c965            1         1         1       4h50m   minio            minio/minio:RELEASE.2021-02-14T04-01-33Z                                                                                                                    app=minio,pod-template-hash=5f7457c965,release=mlflow

NAME                                             READY   AGE     CONTAINERS           IMAGES
statefulset.apps/affinity-assistant-3df3914653   1/1     8m38s   affinity-assistant   gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/nop:v0.22.0@sha256:45500af4d68be54b99c7c84f99a61595cf73e3e761c69a8009fe1adcc8ccd2af
statefulset.apps/mlflow-mysql                    1/1     4h50m   mysql                docker.io/bitnami/mysql:8.0.23-debian-10-r0

You can see from the pod list the builder-prep step has completed, but the builder (kaniko) step hasn't started yet.

> kubectl -n fuseml-workloads get task
NAME           AGE
builder-prep   4h48m
clone          4h48m
kaniko         4h48m
> kubectl -n fuseml-workloads get taskrun
NAME                                                            SUCCEEDED   REASON      STARTTIME   COMPLETIONTIME
fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6                                            
fuseml-mlflow-project-01-mlflow-test-mlprp-builder-prep-jnsw2   True        Succeeded   11m         11m
fuseml-mlflow-project-01-mlflow-test-mlprp-clone-fp9pk          True        Succeeded   12m         11m
> kubectl -n fuseml-workloads get taskrun fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6 -o yaml
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  annotations:
    pipeline.tekton.dev/affinity-assistant: affinity-assistant-3df3914653
  creationTimestamp: "2021-06-07T16:40:32Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    fuseml/codeset-name: mlflow-test
    fuseml/codeset-project: mlflow-project-01
    fuseml/codeset-version: main
    fuseml/workflow-ref: mlflow-sklearn-e2e
    tekton.dev/pipeline: mlflow-sklearn-e2e
    tekton.dev/pipelineRun: fuseml-mlflow-project-01-mlflow-test-mlprp
    tekton.dev/pipelineTask: builder
  managedFields:
  - apiVersion: tekton.dev/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:pipeline.tekton.dev/affinity-assistant: {}
        f:labels:
          .: {}
          f:fuseml/codeset-name: {}
          f:fuseml/codeset-project: {}
          f:fuseml/codeset-version: {}
          f:fuseml/workflow-ref: {}
          f:tekton.dev/pipeline: {}
          f:tekton.dev/pipelineRun: {}
          f:tekton.dev/pipelineTask: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"12b6345d-b8b0-4d9b-9ae0-ce47e33a788a"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        .: {}
        f:params: {}
        f:resources: {}
        f:serviceAccountName: {}
        f:taskRef:
          .: {}
          f:kind: {}
          f:name: {}
        f:timeout: {}
        f:workspaces: {}
      f:status:
        .: {}
        f:podName: {}
    manager: controller
    operation: Update
    time: "2021-06-07T16:40:32Z"
  name: fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6
  namespace: fuseml-workloads
  ownerReferences:
  - apiVersion: tekton.dev/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: PipelineRun
    name: fuseml-mlflow-project-01-mlflow-test-mlprp
    uid: 12b6345d-b8b0-4d9b-9ae0-ce47e33a788a
  resourceVersion: "324588700"
  selfLink: /apis/tekton.dev/v1beta1/namespaces/fuseml-workloads/taskruns/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6
  uid: 743968b6-fb17-45aa-8865-1346bac708d6
spec:
  params:
  - name: IMAGE
    value: registry.fuseml-registry/mlflow-builder/mlflow-test:main
  - name: DOCKERFILE
    value: .fuseml/Dockerfile
  resources: {}
  serviceAccountName: staging-triggers-admin
  taskRef:
    kind: Task
    name: kaniko
  timeout: 1h0m0s
  workspaces:
  - name: source
    persistentVolumeClaim:
      claimName: pvc-3df3914653
@stefannica stefannica added the bug Something isn't working label Jun 7, 2021
@stefannica
Copy link
Member Author

More context: this is a section of the tekton operator log, around the time the workflow is stuck. It indicates the TaskRun is created properly, but nothing happens after that (i.e. no reconciliation to implement the TaskRun):

{"level":"info","ts":"2021-06-07T16:40:32.465Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:745","msg":"Creating a new TaskRun object fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6 for pipeline task builder","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32.471492461 +0000 UTC m=+16734.295609773} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6\", UID:\"743968b6-fb17-45aa-8865-1346bac708d6\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588700\", FieldPath:\"\"}): type: 'Normal' reason: 'Started' ","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:40:32.471Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"PipelineRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp\", UID:\"12b6345d-b8b0-4d9b-9ae0-ce47e33a788a\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588633\", FieldPath:\"\"}): type: 'Normal' reason: 'Running' Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:40:32.473Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:363","msg":"Cloud Events: []","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"d891de44-fa5b-4c0f-b5b5-0a0fe35eeef4","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6"}
{"level":"info","ts":"2021-06-07T16:40:32.510Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"e471d1af-f493-4fe8-a403-8a936b2101f2","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp","duration":0.052808256}
{"level":"info","ts":"2021-06-07T16:40:32.524Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32 +0000 UTC} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"cecd9852-9a9c-4388-9e54-8f6577cb5054","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}
{"level":"info","ts":"2021-06-07T16:40:32.540Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"cecd9852-9a9c-4388-9e54-8f6577cb5054","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp","duration":0.029458687}

Then much later, the Tekton operator "gets unstuck" and continues, but it does print out some strange error about AWS credentials just before it resumes:

E0607 16:57:59.228117       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
{"level":"info","ts":"2021-06-07T16:58:00.214Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:461","msg":"Successfully reconciled taskrun fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6/fuseml-workloads with status: &apis.Condition{Type:\"Succeeded\", Status:\"Unknown\", Severity:\"\", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0xc027b1e60cc33ab8, ext:17782038238456, loc:(*time.Location)(0x32f6280)}}}, Reason:\"Pending\", Message:\"Pending\"}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"d891de44-fa5b-4c0f-b5b5-0a0fe35eeef4","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6"}
{"level":"info","ts":"2021-06-07T16:58:00.214Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"fuseml-workloads\", Name:\"fuseml-mlflow-project-01-mlflow-test-mlprp-builder-44hc6\", UID:\"743968b6-fb17-45aa-8865-1346bac708d6\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"324588700\", FieldPath:\"\"}): type: 'Normal' reason: 'Pending' Pending","commit":"10870df"}
{"level":"info","ts":"2021-06-07T16:58:00.245Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:582","msg":"PipelineRun fuseml-mlflow-project-01-mlflow-test-mlprp status is being set to &{Succeeded Unknown  {2021-06-07 16:40:32 +0000 UTC} Running Tasks Completed: 2 (Failed: 0, Cancelled 0), Incomplete: 3, Skipped: 0}","commit":"10870df","knative.dev/controller":"github.com.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"71319dd7-556f-4d10-8336-e00b09e7b784","knative.dev/key":"fuseml-workloads/fuseml-mlflow-project-01-mlflow-test-mlprp"}

@stefannica stefannica added this to Backlog in FuseML Project Board via automation Jun 7, 2021
@stefannica stefannica moved this from Backlog to V2 in FuseML Project Board Jun 21, 2021
@stefannica stefannica added the v2 label Jun 21, 2021
@flaviodsr flaviodsr self-assigned this Jul 5, 2021
@flaviodsr
Copy link
Member

I have taken a look on this and this seems to be an issue with GCR, the registry where the kaniko image is pulled from.

According to the Tekton container contract when there is no command or script call, tekton retrieves the container entrypoint from the image metadata provided by the registry and it is at that point that the task gets stuck. I came to that conclusion by changing the kaniko task by adding a command entry and I was not able to reproduce that issue since.

flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 6, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.

Also updates the kaniko task image to v1.6.0
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 6, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.

Also updates the kaniko task image to v1.6.0
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 9, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.

Also updates the kaniko task image to v1.6.0
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 12, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.

Also updates the kaniko task image to v1.6.0
flaviodsr added a commit to flaviodsr/fuseml that referenced this issue Jul 13, 2021
This change fixes: fuseml#134 by calling kaniko through
`command` instead of its entrypoint.

Also updates the kaniko task image to v1.6.0
@stefannica stefannica moved this from V2 to In progress in FuseML Project Board Jul 13, 2021
FuseML Project Board automation moved this from In progress to Done Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

2 participants