Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiencing issues with the PNS executor #1256

Closed
Tomcli opened this issue Mar 12, 2019 · 36 comments · Fixed by #4983 or #4954
Closed

Experiencing issues with the PNS executor #1256

Tomcli opened this issue Mar 12, 2019 · 36 comments · Fixed by #4983 or #4954
Milestone

Comments

@Tomcli
Copy link

Tomcli commented Mar 12, 2019

Hi @jessesuen, we are experimenting with the Argo PNS executor from PR #1214 and running it as the KubeFlow Pipeline backend. The Workflow runs smoothly for most of the containers, except we are experiencing some race condition with the last container in every Workflow. Below are the workflow definition we have and the corresponding error logs from the argoexec.

Failed wait container logs:

$ kubectl logs pipeline-flip-coin-8fn8h-3450342589 -n kubeflow wait
time="2019-03-11T18:44:36Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: pipeline-flip-coin-8fn8h-3450342589)"
time="2019-03-11T18:44:36Z" level=info msg="Executor (version: v2.3.0+83942fc.dirty, build_date: 2019-03-11T18:38:46Z) initialized with template:\narchiveLocation:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\ncontainer:\n  command:\n  - echo\n  - tails and 13 <= 15!\n  image: alpine:3.6\n  name: \"\"\n  resources: {}\ninputs:\n  parameters:\n  - name: random-number-2-output\n    value: \"13\"\nmetadata: {}\nname: print-2-3-4\noutputs:\n  artifacts:\n  - name: mlpipeline-ui-metadata\n    path: /mlpipeline-ui-metadata.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/ac598e95-442d-11e9-a431-ba1598362405/pipeline-flip-coin-8fn8h-3450342589/mlpipeline-ui-metadata.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  - name: mlpipeline-metrics\n    path: /mlpipeline-metrics.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/ac598e95-442d-11e9-a431-ba1598362405/pipeline-flip-coin-8fn8h-3450342589/mlpipeline-metrics.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n"
time="2019-03-11T18:44:36Z" level=info msg="Waiting on main container"
time="2019-03-11T18:44:36Z" level=warning msg="Polling root processes (1m0s)"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=warning msg="Failed to open /proc/20/root: open /proc/20/root: permission denied"
time="2019-03-11T18:44:37Z" level=info msg="main container started with container ID: fbecd8dcbd49bc961e2462fb76552546d7c5d3739faba488ccd6f3158bec7aad"
time="2019-03-11T18:44:37Z" level=info msg="Starting annotations monitor"
time="2019-03-11T18:44:37Z" level=info msg="Starting deadline monitor"
time="2019-03-11T18:44:37Z" level=error msg="executor error: Could not find associated pid for containerID fbecd8dcbd49bc961e2462fb76552546d7c5d3739faba488ccd6f3158bec7aad\ngithub.com/argoproj/argo/errors.New\n\t/go/src/github.com/argoproj/argo/errors/errors.go:49\ngithub.com/argoproj/argo/errors.Errorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:55\ngithub.com/argoproj/argo/errors.InternalErrorf\n\t/go/src/github.com/argoproj/argo/errors/errors.go:65\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).getContainerPID\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:257\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:147\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).Wait\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:839\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:32\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-03-11T18:44:37Z" level=info msg="No sidecars"
time="2019-03-11T18:44:37Z" level=info msg="Saving logs"
time="2019-03-11T18:44:37Z" level=info msg="Annotations monitor stopped"
time="2019-03-11T18:44:37Z" level=info msg="S3 Save path: /argo/outputs/logs/main.log, key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589/main.log"
time="2019-03-11T18:44:37Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-03-11T18:44:37Z" level=info msg="Saving from /argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-flip-coin-8fn8h/pipeline-flip-coin-8fn8h-3450342589/main.log)"
time="2019-03-11T18:44:38Z" level=info msg="Deadline monitor stopped"
time="2019-03-11T18:44:39Z" level=info msg="No output parameters"
time="2019-03-11T18:44:39Z" level=info msg="Saving output artifacts"
time="2019-03-11T18:44:39Z" level=info msg="Staging artifact: mlpipeline-ui-metadata"
time="2019-03-11T18:44:39Z" level=info msg="Copying /mlpipeline-ui-metadata.json from container base image layer to /argo/outputs/artifacts/mlpipeline-ui-metadata.tgz"
time="2019-03-11T18:44:39Z" level=error msg="executor error: could not chroot into main for artifact collection: container may have exited too quickly\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:122\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:329\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:234\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:220\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-03-11T18:44:39Z" level=info msg="Alloc=3802 TotalAlloc=14044 Sys=70078 NumGC=6 Goroutines=10"
time="2019-03-11T18:44:39Z" level=fatal msg="could not chroot into main for artifact collection: container may have exited too quickly\ngithub.com/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngithub.com/argoproj/argo/workflow/executor/pns.(*PNSExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/pns/pns.go:122\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:329\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:234\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:220\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:54\ngithub.com/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"

Workflow yaml file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: pipeline-flip-coin-
spec:
  arguments:
    parameters: []
  entrypoint: pipeline-flip-coin
  serviceAccountName: pipeline-runner
  templates:
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{tasks.random-number.outputs.parameters.random-number-output}}'
        dependencies:
        - random-number
        name: condition-2
        template: condition-2
        when: '{{tasks.random-number.outputs.parameters.random-number-output}} > 5'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{tasks.random-number.outputs.parameters.random-number-output}}'
        dependencies:
        - random-number
        name: condition-3
        template: condition-3
        when: '{{tasks.random-number.outputs.parameters.random-number-output}} <=
          5'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
        name: random-number
        template: random-number
    inputs:
      parameters:
      - name: flip-output
    name: condition-1
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{inputs.parameters.random-number-output}}'
        name: print
        template: print
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-output
    name: condition-2
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-output
            value: '{{inputs.parameters.random-number-output}}'
        name: print-2
        template: print-2
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-output
    name: condition-3
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}'
        dependencies:
        - random-number-2
        name: condition-5
        template: condition-5
        when: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}
          > 15'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}'
        dependencies:
        - random-number-2
        name: condition-6
        template: condition-6
        when: '{{tasks.random-number-2.outputs.parameters.random-number-2-output}}
          <= 15'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
        name: random-number-2
        template: random-number-2
    inputs:
      parameters:
      - name: flip-output
    name: condition-4
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{inputs.parameters.random-number-2-output}}'
        name: print-2-3
        template: print-2-3
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-2-output
    name: condition-5
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{inputs.parameters.flip-output}}'
          - name: random-number-2-output
            value: '{{inputs.parameters.random-number-2-output}}'
        name: print-2-3-4
        template: print-2-3-4
    inputs:
      parameters:
      - name: flip-output
      - name: random-number-2-output
    name: condition-6
  - container:
      args:
      - python -c "import random; result = 'heads' if random.randint(0,1) == 0 else
        'tails'; print(result)" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: flip
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: flip-output
        valueFrom:
          path: /tmp/output
  - dag:
      tasks:
      - arguments:
          parameters:
          - name: flip-output
            value: '{{tasks.flip.outputs.parameters.flip-output}}'
        dependencies:
        - flip
        name: condition-1
        template: condition-1
        when: '{{tasks.flip.outputs.parameters.flip-output}} == heads'
      - arguments:
          parameters:
          - name: flip-output
            value: '{{tasks.flip.outputs.parameters.flip-output}}'
        dependencies:
        - flip
        name: condition-4
        template: condition-4
        when: '{{tasks.flip.outputs.parameters.flip-output}} == tails'
      - name: flip
        template: flip
    name: pipeline-flip-coin
  - container:
      command:
      - echo
      - heads and {{inputs.parameters.random-number-output}} > 5!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-output
    name: print
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - heads and {{inputs.parameters.random-number-output}} <= 5!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-output
    name: print-2
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - tails and {{inputs.parameters.random-number-2-output}} > 15!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-2-output
    name: print-2-3
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      command:
      - echo
      - tails and {{inputs.parameters.random-number-2-output}} <= 15!
      image: alpine:3.6
    inputs:
      parameters:
      - name: random-number-2-output
    name: print-2-3-4
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
  - container:
      args:
      - python -c "import random; print(random.randint(0,9))" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: random-number
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: random-number-output
        valueFrom:
          path: /tmp/output
  - container:
      args:
      - python -c "import random; print(random.randint(10,19))" | tee /tmp/output
      command:
      - sh
      - -c
      image: python:alpine3.6
    name: random-number-2
    outputs:
      artifacts:
      - name: mlpipeline-ui-metadata
        path: /mlpipeline-ui-metadata.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-ui-metadata.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      - name: mlpipeline-metrics
        path: /mlpipeline-metrics.json
        s3:
          accessKeySecret:
            key: accesskey
            name: mlpipeline-minio-artifact
          bucket: mlpipeline
          endpoint: minio-service.kubeflow:9000
          insecure: true
          key: runs/{{workflow.uid}}/{{pod.name}}/mlpipeline-metrics.tgz
          secretKeySecret:
            key: secretkey
            name: mlpipeline-minio-artifact
      parameters:
      - name: random-number-2-output
        valueFrom:
          path: /tmp/output

Related issues: #970
cc: @animeshsingh

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
Filehandle not being secured before the main container started.

What you expected to happen:
Filehandle should be secured before the main container started.

How to reproduce it (as minimally and precisely as possible):
Run the workflow definition above with the PNS executor

Anything else we need to know?:

Environment:

  • Argo version: 2.3.0
  • Kubernetes version : 1.13.4
@Tomcli
Copy link
Author

Tomcli commented Mar 14, 2019

Update:
It looks like the Share Process Namespace from the "wait" container doesn't have enough permission to copy files in "main" for containers in the end of a workflow. My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container. However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready. Below are the error logs when the race condition happened.

time="2019-03-14T20:29:00Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: pipeline-flip-coin-fzmf4-745333038)"
time="2019-03-14T20:29:00Z" level=info msg="Executor (version: v2.3.0+83942fc.dirty, build_date: 2019-03-14T20:25:57Z) initialized with template:\narchiveLocation:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: mlpipeline\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\ncontainer:\n  args:\n  - python -c \"import random; result = 'heads' if random.randint(0,1) == 0 else 'tails';\n    print(result)\" | tee /tmp/output\n  command:\n  - sh\n  - -c\n  image: python:alpine3.6\n  name: \"\"\n  resources: {}\ninputs: {}\nmetadata: {}\nname: flip\noutputs:\n  artifacts:\n  - name: mlpipeline-ui-metadata\n    path: /mlpipeline-ui-metadata.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/cb43580d-4697-11e9-a431-ba1598362405/pipeline-flip-coin-fzmf4-745333038/mlpipeline-ui-metadata.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  - name: mlpipeline-metrics\n    path: /mlpipeline-metrics.json\n    s3:\n      accessKeySecret:\n        key: accesskey\n        name: mlpipeline-minio-artifact\n      bucket: mlpipeline\n      endpoint: minio-service.kubeflow:9000\n      insecure: true\n      key: runs/cb43580d-4697-11e9-a431-ba1598362405/pipeline-flip-coin-fzmf4-745333038/mlpipeline-metrics.tgz\n      secretKeySecret:\n        key: secretkey\n        name: mlpipeline-minio-artifact\n  parameters:\n  - name: flip-output\n    valueFrom:\n      path: /tmp/output\n"
time="2019-03-14T20:29:00Z" level=info msg="Waiting on main container"
time="2019-03-14T20:29:00Z" level=warning msg="Polling root processes (1m0s)"
time="2019-03-14T20:29:00Z" level=info msg="Secured filehandle on /proc/18/root"
time="2019-03-14T20:29:00Z" level=info msg="containerID 3130eaf160ff149e0b35aaef5718001026b52413476615e41d82ef6ac92b86f2 mapped to pid 18"
time="2019-03-14T20:29:00Z" level=info msg="main container started with container ID: 3130eaf160ff149e0b35aaef5718001026b52413476615e41d82ef6ac92b86f2"
time="2019-03-14T20:29:00Z" level=info msg="Starting annotations monitor"
time="2019-03-14T20:29:00Z" level=info msg="Main pid identified as 18"
time="2019-03-14T20:29:00Z" level=info msg="Successfully secured file handle on main container root filesystem"
time="2019-03-14T20:29:00Z" level=info msg="Waiting for main pid 18 to complete"
time="2019-03-14T20:29:00Z" level=info msg="Starting deadline monitor"
time="2019-03-14T20:29:00Z" level=info msg="Stopped root processes polling due to successful securing of main root fs"
time="2019-03-14T20:29:01Z" level=info msg="Main pid 18 completed"
time="2019-03-14T20:29:01Z" level=info msg="Main container completed"
time="2019-03-14T20:29:01Z" level=info msg="No sidecars"
time="2019-03-14T20:29:01Z" level=info msg="Saving logs"
time="2019-03-14T20:29:01Z" level=info msg="Deadline monitor stopped"
time="2019-03-14T20:29:01Z" level=info msg="Annotations monitor stopped"
time="2019-03-14T20:29:01Z" level=info msg="S3 Save path: /argo/outputs/logs/main.log, key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038/main.log"
time="2019-03-14T20:29:01Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2019-03-14T20:29:01Z" level=info msg="Saving from /argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-flip-coin-fzmf4/pipeline-flip-coin-fzmf4-745333038/main.log)"
time="2019-03-14T20:29:03Z" level=info msg="Saving output parameters"
time="2019-03-14T20:29:03Z" level=info msg="Saving path output parameter: flip-output"
time="2019-03-14T20:29:03Z" level=info msg="Copying /tmp/output from base image layer"
time="2019-03-14T20:29:03Z" level=error msg="executor error: open /tmp/output: no such file or directory"
time="2019-03-14T20:29:03Z" level=info msg="Alloc=3858 TotalAlloc=12562 Sys=70078 NumGC=5 Goroutines=9"
time="2019-03-14T20:29:03Z" level=fatal msg="open /tmp/output: no such file or directory"

@jessesuen
Copy link
Member

@Tomcli - is it possible to construct a smaller, portable workflow which can reproduce this? Also, there's some caveats to PNS that people need to be aware of, which is: collection of artifacts from the base image layer is subject to race conditions when the main container exits too quickly.

Basically the main container needs to be running for a few seconds for the wait sidecar to reliably secure the filehandle on it's root filesystem. If the main container exits too quickly, then the wait sidecar may not have been able to secure the file handle to successfully collect artifacts.

My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container. However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready

Yes I don't expect privileged mode to help.

However, an alternative workaround is to output the artifacts into an emptyDir volume, mounted in the main container. In v2.3, when a volumes are used, they are now mirrored to the wait sidecar, which eliminates the race with artifact collection, because the wait sidecar has access to the volume long after the main container completed.

@jessesuen
Copy link
Member

jessesuen commented Apr 5, 2019

My temporary work around is adding SYS_PTRACE and set privileged: true to the "wait" container.

Actually I'm wrong. SYS_PTRACE is indeed needed when the user id of the main container is different than the wait sidecar.

However, occasionally I still experiencing some race condition where the "wait" container is trying to copy the output files when it's not ready

I'm also experiencing this race condition. Trying to find a solution, but it does seem timing related.

@Tomcli
Copy link
Author

Tomcli commented Apr 5, 2019

Hi @jessesuen, Thanks for the reply. Since adding SYS_PTRACE and privileged mode is not that we want and we don't have a better way to work around PNS, right now we switched to use the k8sapi executor with emptyDir as our temporary solution.

@jessesuen
Copy link
Member

Just to be clear, privileged is unnecessary, but SYS_PTRACE is. The latter is much more secure than having privileged pods.

@jessesuen
Copy link
Member

@Tomcli I fixed the SYS_PTRACE issue, and also figured out the timing related issue about failing to upload the artifact. PNS should be working much more reliably now in latest version of the PR:

#1214

Given that privileged pods is unnecessary, I think you may want to reconsider PNS.

@animeshsingh
Copy link

Thanks @jessesuen - we will give it a try. With respect to k8sapi executor - you have a viewpoint? Ideally that should be the solution to use with CRI-O?

@jessesuen
Copy link
Member

@animeshsingh there are pros and cons to each executor:

  1. Docker:
+ supports all workflow examples
+ most reliable and well tested
+ very scalable. communicates to docker daemon for heavy lifting
- least secure. requires docker.sock of host to be mounted (often rejected by OPA)
  1. kubelet
+ secure. cannot escape privileges of pod's service account
+ medium scalability - log retrieval and container polling is done against kubelet
- additional kubelet configuration may be required
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)
  1. K8s API
+ secure. cannot escape privileges of pod's service account
+ no extra configuration
- least scalable - log retrieval and container polling is done against k8s API server
- can only save params/artifacts in volumes (e.g. emptyDir), and not the base image layer (e.g. /tmp)
  1. PNS
+ secure. cannot escape privileges of service account
+ artifact collection can be collected from base image layer
+ scalable - process polling is done over procfs and not kubelet/k8s API
- processes will no longer run with pid 1
- artifact collection from base image may fail for containers which complete too fast
- cannot capture artifact directories from base image layer which has a volume mounted under it
- immature

IMO, PNS is the closest thing to the docker executor, without the security concerns, and is what I recommend, except for the fact that it is most immature.

@animeshsingh
Copy link

animeshsingh commented Apr 10, 2019

Thanks @jessesuen for this comparison. Would the overhead of going through k8s apis bypass the demerits introduced through some randomness using PNS? Given that workflows are expected to be long running jobs, as opposed to a serverless model where bypassing k8s api has its merits vis a vis response time, would it matter too much? Also how important it is to store the artifacts in base image layer?

@jessesuen
Copy link
Member

My feeling is PNS is the best compromise between security and functionality.

Would the overhead of going through k8s apis bypass the demerits introduced through some randomness using PNS?

The "randomness" of failing to collect artifacts is usually a non-issue unless containers are completing too quick. Even then, you can mitigate this by outputting the artifact to an emptyDir, and this would never be an issue.

Also how important it is to store the artifacts in base image layer?

Not necessary at all. It's just slightly more convenient not to have to define a emptyDir volume to collect artifacts.

Closing bug since PNS has merged.

@shimmerjs
Copy link

@jessesuen

  • artifact collection from base image may fail for containers which complete too fast

this is causing a bunch of race conditions in our stuff, should we open a separate issue for this on PNS, or do you have any recommendations on how to deal with it properly?

@jessesuen
Copy link
Member

@booninite yes. To ensure that the wait sidecar is able to collect outputs, instead of outputting outputs into the base image layer (such as /tmp), output artifacts into an empty dir (which gets mirrored into the wait sidecar). This ensures that the wait sidecar can collect the artifact without subject to timing problems.

@aeweidne
Copy link
Contributor

aeweidne commented Jul 2, 2019

@jessesuen we are still experiencing intermittent artifact passing issues using emptyDir. Does the emptyDir additionally need to be mounted to a path that does not exist in the base image?

@animeshsingh
Copy link

    1. @jessesuen the emptyDir isn't a full proof solution - are there folks actually using PNS executor in real world scenarios?

@aeweidne
Copy link
Contributor

aeweidne commented Aug 5, 2019

We are running ~5k workflows per month that all use PNS. We only see consistent issues with extremely short duration steps, under 15 seconds.

@animeshsingh
Copy link

Tying this some other folks raising these issues coming on Kubeflow community
kubeflow/pipelines#1654

@Kampe
Copy link

Kampe commented Jun 17, 2020

I see this same issue trying to pass a single file between my workflows, is the volume mount the solution?

@guoweis-work
Copy link

yeah, seeing this with pns as well. Not sure what to do here...

@ggogel
Copy link

ggogel commented Aug 12, 2020

Having the same issue. Running K3OS with CR-IO. So I can't use the docker executor. The other two, kubelet and k8api, simply won't work. Kubelet gives me a certificate error, which the helm chart doesn't give an option for ignoring and k8api gives me errors like "function not found"...

@alexec
Copy link
Contributor

alexec commented Aug 12, 2020

@sarabala1979 is the workaround for this emptyDir?

@ggogel
Copy link

ggogel commented Aug 12, 2020

I was finally able to get it running using the k8sapi

spec:
  volumes:
    - name: source
      emptyDir: {}

Sadly this breaks the functionality of the built-in git solution, because apparently it can not write into a volume. I had to write my own git clone script. Also this kind of makes the artifact passing redundant, as I could just use this volume in every stage.

@stale
Copy link

stale bot commented Dec 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 8, 2020
@bharathguvvala
Copy link

@alexec Is this issue also fixed, as I see #4230 is fixed/closed?

@stale stale bot removed the wontfix label Dec 8, 2020
@alexec
Copy link
Contributor

alexec commented Dec 8, 2020

Test it and see?

@foobarbecue
Copy link
Contributor

foobarbecue commented Dec 21, 2020

I'm experiencing a similar problem on some of my containers, using af03a74 with PNS. Other containers doing almost identical work succeed, and if I keep retrying the workflow everything succeeds eventually. Seems particular to PNS. Here's an example wait container log:

time="2020-12-20T23:40:18.772Z" level=info msg="Waiting on main container"
time="2020-12-20T23:40:24.920Z" level=info msg="main container started with container ID: 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706"
time="2020-12-20T23:40:24.920Z" level=info msg="Starting annotations monitor"
time="2020-12-20T23:40:24.927Z" level=info msg="containerID 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706 mapped to pid 41"
time="2020-12-20T23:40:24.927Z" level=info msg="Main pid identified as 41"
time="2020-12-20T23:40:24.927Z" level=warning msg="Failed to secure file handle on main container's root filesystem. Output artifacts from base image layer will fail"
time="2020-12-20T23:40:24.927Z" level=info msg="Waiting for main pid 41 to complete"
time="2020-12-20T23:40:24.927Z" level=info msg="Starting deadline monitor"
time="2020-12-20T23:41:53.967Z" level=info msg="Main pid 41 completed"
time="2020-12-20T23:41:54.006Z" level=info msg="Main container completed"
time="2020-12-20T23:41:54.012Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-12-20T23:41:54.012Z" level=info msg="Capturing script exit code"
time="2020-12-20T23:41:54.018Z" level=info msg="Getting exit code of 053703ca0cd2eceaf68f64286fe641f09f7bedfc3547f887f8f1eca5f8617706"
time="2020-12-20T23:41:54.080Z" level=info msg="Annotations monitor stopped"
time="2020-12-20T23:41:54.118Z" level=info msg="Deadline monitor stopped"
time="2020-12-20T23:41:57.058Z" level=error msg="executor error: could not get container status: timed out waiting for the condition"
time="2020-12-20T23:41:57.058Z" level=info msg="Killing sidecars"
time="2020-12-20T23:41:57.079Z" level=info msg="Alloc=5590 TotalAlloc=15483 Sys=70848 NumGC=5 Goroutines=10"
time="2020-12-20T23:41:57.159Z" level=fatal msg="could not get container status: timed out waiting for the condition"

@alexec
Copy link
Contributor

alexec commented Dec 21, 2020

I think timed out waiting for the condition might be a new issue. Has anyone run the v2.11.8 executor vs the v2.12.2 executor? Could be caused by #4253.

@alexec
Copy link
Contributor

alexec commented Dec 21, 2020

We do not see Failed to get main PID, so that discounts #4523.

@alexec
Copy link
Contributor

alexec commented Dec 21, 2020

OK. Diagnosis - There is a timeout trying to determine if the pod has finished. We allow three attempts at 1-second intervals. The main container has completed (which we use the shared process namespace to determine), but we ask the Kubernetes API for the actual result. The API has not been updated yet. This could be mitigated by increasing the amount of time we allow the executor to poll for on line 375 of pns.go. The core team short-staffed through until 2021. @foobarbecue would you be interested in submitting a fix?

@foobarbecue
Copy link
Contributor

foobarbecue commented Dec 21, 2020

@alexec Sure, I can play with the timing and see if I come up with a good PR-worthy solution. Thanks for the detailed analysis.

@alexec
Copy link
Contributor

alexec commented Dec 21, 2020

Thank you!

@0-duke
Copy link

0-duke commented Dec 22, 2020

Hi guys,

I was experiencing a lot the same issue recently. Following the comment from @alexec above I've tried to install a previous argo version and everything works well as usual.

The downgraded version I've installed is using workflow-controller and argoexec v2.11.8

@alexec
Copy link
Contributor

alexec commented Jan 28, 2021

It appears to me today that in some cases you must grant privileged for PNS to work with output artifacts.

Provider Works
AWS Yes
Docker for Desktop Yes
GCP No
K3D Yes
K3S No

@alexec
Copy link
Contributor

alexec commented Jan 28, 2021

Maybe fixed in #4954.

@alexec
Copy link
Contributor

alexec commented Feb 2, 2021

v3.0 will have a controller envvar name PNS_PRIVILEGED.

@alexec alexec closed this as completed Feb 2, 2021
@alexec alexec added this to the v3.0 milestone Feb 2, 2021
icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022
Signed-off-by: meijin <meijin@tiduyun.com>

Co-authored-by: Derek Wang <whynowy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet