Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow Retry Does Not Work for Plugin #11489

Closed
3 tasks done
sid8489 opened this issue Jul 31, 2023 · 8 comments · Fixed by #12620
Closed
3 tasks done

Workflow Retry Does Not Work for Plugin #11489

sid8489 opened this issue Jul 31, 2023 · 8 comments · Fixed by #12620
Labels
area/plugins area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug

Comments

@sid8489
Copy link

sid8489 commented Jul 31, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

What happened:
When using Plugins , If a plugin nodes goes into Failed Phase, Workflow Retry Does Not Work. Upon Retry the Plugin Node is not retried and remains in Failed Phase.

what you expected to happen:
Plugin Node should be retried.

Self Diagnosis:
Based on exploring the codebase, I see that workflow controller loads Plugin Node state from WorkflowTaskSet Status. Ref.
On Retry Argo Workflow Server Indeed removes the Failed Plugin Node from Workflow Node Status , But does not patch the WorkflowTaskSet . B/c of which workflow controller loads the plugin node status from WorkflowTaskSet.

IMO Upon Retry Argo Workflow Server should patch the WorkflowTaskSet (Remove the status for the failed Plugin Node).

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

---
# Plugin Config Map
# This is an auto-generated file. DO NOT EDIT
apiVersion: v1
data:
  sidecar.container: |
    args:
    - |
      import json
      import time
      from http.server import BaseHTTPRequestHandler, HTTPServer


      class Plugin(BaseHTTPRequestHandler):

          def args(self):
              return json.loads(self.rfile.read(int(self.headers.get('Content-Length'))))

          def reply(self, reply):
              self.send_response(200)
              self.end_headers()
              self.wfile.write(json.dumps(reply).encode("UTF-8"))

          def unsupported(self):
              self.send_response(404)
              self.end_headers()

          def do_POST(self):
              if self.path == '/api/v1/template.execute':
                  args = self.args()

                  template = args['template']
                  plugin = template.get('plugin', {})
                  print(args['template'])
                  print(args['workflow'])
    
                  if 'python' in plugin:
                      spec = plugin['python']
                      exit(-1)
                      # convert parameters into easy to use dict
                      # artifacts are not supported
                      parameters = {}
                      for parameter in template.get('inputs', {}).get('parameters', []):
                          parameters[parameter['name']] = parameter['value']

                      try:
                          code = compile(spec['expression'], "<string>", "eval")

                          # only allow certain names (primitive sand-boxing)
                          allowed_names = {
                              # allow common type conversions
                              'bool': bool,
                              'float': float,
                              'int': int,
                              'str': str,
                              # TODO - do  we want iterable built-ins, e.g. len, min, max
                              # allow input parameters
                              'parameters': parameters
                          }
                          if code.co_names:
                              for name in code.co_names:
                                  if name not in allowed_names:
                                      raise NameError(f"Use of name '{name}' not allowed")

                          result = eval(code, {"__builtins__": {}}, allowed_names)

                          # convert parameters back from easy to use dict
                          # artifacts are not supported
                          parameters = []
                          for key, value in result.items():
                              parameters.append({'name': key, 'value': value})

                          self.reply({'node': {'phase': 'Succeeded', 'outputs': {'parameters': parameters}}})

                      except Exception as ex:
                          self.reply({'node': {'phase': 'Failed', 'message': repr(ex)}})
                  else:
                      self.reply({})
              else:
                  self.unsupported()


      if __name__ == '__main__':
          httpd = HTTPServer(('', 7984), Plugin)
          httpd.serve_forever()
    command:
    - python
    - -c
    image: python:alpine
    name: python-executor-plugin
    ports:
    - containerPort: 7984
    resources:
      limits:
        cpu: 200m
        memory: 64Mi
      requests:
        cpu: 100m
        memory: 32Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
kind: ConfigMap
metadata:
  annotations:
    workflows.argoproj.io/description: |
      This plugin runs trusted Python expressions.

      Do not use it to run untrusted Python expressions.

      This plugin make attempts to sandbox the expression. It removes built-ins that would allow disk or network access.
      The plugin itself is allowed limited CPU and memory, and is, of course, contained.
    workflows.argoproj.io/version: '>= v3.3'
  creationTimestamp: null
  labels:
    workflows.argoproj.io/configmap-type: ExecutorPlugin
  name: python-executor-plugin
---
# Workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: python-example-
spec:
  entrypoint: main
  arguments:
    parameters:
      - name: value
        value: "1"
  templates:
    - name: main
      steps:
        - - name: add-one
            template: add-one
            arguments:
              parameters:
                - name: value
                  value: "{{workflow.parameters.value}}"
        - - name: print-sum
            template: print-sum
            arguments:
              parameters:
                - name: sum
                  value: "{{steps.add-one.outputs.parameters.sum}}"
        - - name: sleep
            template: sleep
        - - name: add-two
            template: add-one
            arguments:
              parameters:
                - name: value
                  value: "{{workflow.parameters.value}}"

    - name: add-one
      inputs:
        parameters:
          - name: value
      plugin:
        python:
          expression: |
            {"sum": int(parameters["value"]) + 1}
      outputs:
        parameters:
          - name: sum
            # you must specify "value" or "valueFrom", but what you specify does not matter
            valueFrom:
              supplied: { }

    - name: print-sum
      inputs:
        parameters:
          - name: sum
      container:
        image: alpine:3.16
        command: [sh, -c]
        args:
        -  echo
        - "{{inputs.parameters.sum}}"
    - name: sleep
      container:
        image: alpine:3.16
        command: [sh, -c]
        args:
        - sleep 30

Logs from the workflow controller

time="2023-07-31T12:41:01Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2023-07-31T12:41:01Z" level=info msg="cron config" cronSyncPeriod=10s
time="2023-07-31T12:41:01Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2023-07-31T12:41:01.291Z" level=info msg="not enabling pprof debug endpoints"
time="2023-07-31T12:41:01.293Z" level=info msg="config map" name=weave-argo-workflows-controller-configmap
time="2023-07-31T12:41:01.300Z" level=info msg="Get configmaps 200"
time="2023-07-31T12:41:01.303Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: emissary\ninitialDelay: 0s\nmetricsConfig:\n  enabled: true\n  path: /metrics\n  port: 9090\nnodeEvents: {}\npersistence:\n  archive: true\n  connectionPool:\n    maxIdleConns: 100\n  postgresql:\n    database: argo_workflows\n    host: weave-postgresql-hl\n    passwordSecret:\n      key: password\n      name: argo-postgres-config\n    port: 5432\n    tableName: argo_workflows\n    userNameSecret:\n      key: username\n      name: argo-postgres-config\npodSpecLogStrategy: {}\ntelemetryConfig: {}\n"
time="2023-07-31T12:41:01.303Z" level=info msg="Persistence configuration enabled"
time="2023-07-31T12:41:01.303Z" level=info msg="Creating DB session"
time="2023-07-31T12:41:01.307Z" level=info msg="Get secrets 200"
time="2023-07-31T12:41:01.311Z" level=info msg="Get secrets 200"
time="2023-07-31T12:41:01.320Z" level=info msg="Persistence Session created successfully"
time="2023-07-31T12:41:01.322Z" level=info msg="Migrating database schema" clusterName=default dbType=postgres
time="2023-07-31T12:41:01.443Z" level=info msg="Node status offloading is disabled"
time="2023-07-31T12:41:01.443Z" level=info msg="Workflow archiving is enabled"
I0731 12:41:01.444261       1 leaderelection.go:248] attempting to acquire leader lease weavelocal/workflow-controller...
time="2023-07-31T12:41:01.446Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:01.447Z" level=info msg="new leader" leader=weave-argo-workflows-controller-5f69bc78c4-dflnq
time="2023-07-31T12:41:12.034Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:19.819Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:19.829Z" level=info msg="Update leases 200"
I0731 12:41:19.832726       1 leaderelection.go:258] successfully acquired lease weavelocal/workflow-controller
time="2023-07-31T12:41:19.833Z" level=info msg="new leader" leader=weave-argo-workflows-controller-5f69bc78c4-bh2qn
time="2023-07-31T12:41:19.833Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Lease\", Namespace:\"weavelocal\", Name:\"workflow-controller\", UID:\"92c65600-2a9c-4f02-806d-5a76dabafee9\", APIVersion:\"coordination.k8s.io/v1\", ResourceVersion:\"22790\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' weave-argo-workflows-controller-5f69bc78c4-bh2qn became leader"
time="2023-07-31T12:41:19.833Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.3.8
time="2023-07-31T12:41:19.834Z" level=info msg="Current Worker Numbers" podCleanup=4 workflow=32 workflowTtl=4
time="2023-07-31T12:41:19.834Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
time="2023-07-31T12:41:19.834Z" level=info msg=Plugins executorPlugins=true
time="2023-07-31T12:41:19.842Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:19.860Z" level=info msg="Create events 201"
time="2023-07-31T12:41:19.861Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:19.868Z" level=info msg="List workflows 200"
time="2023-07-31T12:41:19.871Z" level=info msg="Manager initialized successfully"
time="2023-07-31T12:41:19.871Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates verb=get
time="2023-07-31T12:41:19.874Z" level=info msg="List workflowtaskresults 200"
time="2023-07-31T12:41:19.874Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2023-07-31T12:41:19.875Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates status="{true false RBAC: allowed by ClusterRoleBinding \"weave-argo-workflows-controller-cluster-template\" of ClusterRole \"weave-argo-workflows-controller-cluster-template\" to ServiceAccount \"weave-argo-workflows-controller/weavelocal\" }" verb=get
time="2023-07-31T12:41:19.875Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates verb=list
time="2023-07-31T12:41:19.875Z" level=info msg="List workflowtemplates 200"
time="2023-07-31T12:41:19.875Z" level=info msg="Watch workflowtaskresults 200"
time="2023-07-31T12:41:19.875Z" level=info msg="List configmaps 200"
time="2023-07-31T12:41:19.875Z" level=info msg="List configmaps 200"
time="2023-07-31T12:41:19.876Z" level=info msg="Executor plugin added" name=python-executor-plugin namespace=weavelocal
time="2023-07-31T12:41:19.877Z" level=info msg="Watch workflowtemplates 200"
time="2023-07-31T12:41:19.877Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2023-07-31T12:41:19.877Z" level=info msg="List workflowtasksets 200"
time="2023-07-31T12:41:19.880Z" level=info msg="Watch configmaps 200"
time="2023-07-31T12:41:19.880Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates status="{true false RBAC: allowed by ClusterRoleBinding \"weave-argo-workflows-controller-cluster-template\" of ClusterRole \"weave-argo-workflows-controller-cluster-template\" to ServiceAccount \"weave-argo-workflows-controller/weavelocal\" }" verb=list
time="2023-07-31T12:41:19.880Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates verb=watch
time="2023-07-31T12:41:19.880Z" level=info msg="List pods 200"
time="2023-07-31T12:41:19.881Z" level=info msg="List workflows 200"
time="2023-07-31T12:41:19.880Z" level=info msg="Watch configmaps 200"
time="2023-07-31T12:41:19.880Z" level=info msg="Watch configmaps 200"
time="2023-07-31T12:41:19.881Z" level=debug msg="received config map weavelocal/artifact-repositories update"
time="2023-07-31T12:41:19.884Z" level=debug msg="received config map weavelocal/weave-minio update"
time="2023-07-31T12:41:19.884Z" level=debug msg="received config map weavelocal/olympus-cluster-properties-configmap update"
time="2023-07-31T12:41:19.884Z" level=debug msg="received config map weavelocal/weave-argo-workflows-controller-configmap update"
time="2023-07-31T12:41:19.884Z" level=debug msg="received config map weavelocal/python-executor-plugin update"
time="2023-07-31T12:41:19.886Z" level=info msg="Watch pods 200"
time="2023-07-31T12:41:19.886Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2023-07-31T12:41:19.887Z" level=info msg="Watch workflowtasksets 200"
time="2023-07-31T12:41:19.887Z" level=debug msg="received config map weavelocal/kube-root-ca.crt update"
time="2023-07-31T12:41:19.888Z" level=debug msg=CanI name= namespace=weavelocal resource=clusterworkflowtemplates status="{true false RBAC: allowed by ClusterRoleBinding \"weave-argo-workflows-controller-cluster-template\" of ClusterRole \"weave-argo-workflows-controller-cluster-template\" to ServiceAccount \"weave-argo-workflows-controller/weavelocal\" }" verb=watch
time="2023-07-31T12:41:19.891Z" level=info msg="List clusterworkflowtemplates 200"
time="2023-07-31T12:41:19.892Z" level=info msg="Watch clusterworkflowtemplates 200"
time="2023-07-31T12:41:19.897Z" level=info msg="Watch workflows 200"
time="2023-07-31T12:41:19.989Z" level=info msg="Performing periodic GC" periodicity=5m0s
time="2023-07-31T12:41:19.989Z" level=info msg="Archived workflows TTL zero - so archived workflow GC disabled - you must restart the controller if you enable this"
time="2023-07-31T12:41:19.989Z" level=info msg="Starting workflow garbage collector controller (retentionWorkers 4)"
time="2023-07-31T12:41:19.989Z" level=info msg="Started workflow garbage collection"
time="2023-07-31T12:41:19.989Z" level=info msg="Starting CronWorkflow controller"
time="2023-07-31T12:41:19.989Z" level=debug msg="Check the workflow existence"
time="2023-07-31T12:41:19.989Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
W0731 12:41:19.996808       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
time="2023-07-31T12:41:20.000Z" level=debug msg="Syncing all CronWorkflows"
time="2023-07-31T12:41:20.003Z" level=info msg="List workflows 200"
time="2023-07-31T12:41:20.011Z" level=info msg="List cronworkflows 200"
time="2023-07-31T12:41:20.014Z" level=info msg="Watch workflows 200"
time="2023-07-31T12:41:20.014Z" level=info msg="Watch cronworkflows 200"
time="2023-07-31T12:41:24.869Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:24.881Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:29.888Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:29.893Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:30.002Z" level=debug msg="Syncing all CronWorkflows"
time="2023-07-31T12:41:34.902Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:34.909Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:39.919Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:39.927Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:40.003Z" level=debug msg="Syncing all CronWorkflows"
time="2023-07-31T12:41:44.935Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:44.944Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:47.826Z" level=info msg="Processing workflow" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.829Z" level=info msg="Task-result reconciliation" namespace=weavelocal numObjs=0 workflow=python-example-vxtdv
time="2023-07-31T12:41:47.829Z" level=debug msg="Evaluating node python-example-vxtdv: template: *v1alpha1.WorkflowStep (main), boundaryID: " namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.829Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:47.829Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:47.829Z" level=debug msg="Getting the template by name" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:47.829Z" level=debug msg="Executing node python-example-vxtdv of Steps is Running" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.830Z" level=debug msg="Evaluating node python-example-vxtdv[0].add-one: template: *v1alpha1.WorkflowStep (add-one), boundaryID: python-example-vxtdv" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.830Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:47.830Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:47.830Z" level=debug msg="Getting the template by name" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:47.834Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.NodeStatus (main)"
time="2023-07-31T12:41:47.835Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.NodeStatus (main)"
time="2023-07-31T12:41:47.835Z" level=debug msg="Getting the template by name" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.NodeStatus (main)"
time="2023-07-31T12:41:47.835Z" level=debug msg="Initializing node python-example-vxtdv[0].add-one: template: *v1alpha1.WorkflowStep (add-one), boundaryID: python-example-vxtdv" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.835Z" level=info msg="Plugin node python-example-vxtdv-3605899385 initialized Pending" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.835Z" level=info msg="Workflow step group node python-example-vxtdv-1868005945 not yet completed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.835Z" level=info msg="TaskSet Reconciliation" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.835Z" level=info msg="Creating TaskSet" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.836Z" level=debug msg="creating new taskset" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.854Z" level=info msg="Create workflowtasksets 409"
time="2023-07-31T12:41:47.855Z" level=debug msg="patching the exiting taskset" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.864Z" level=info msg="Patch workflowtasksets 200"
time="2023-07-31T12:41:47.865Z" level=info msg=reconcileAgentPod namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.872Z" level=info msg="Get secrets 404"
time="2023-07-31T12:41:47.881Z" level=info msg="Get serviceaccounts 200"
time="2023-07-31T12:41:47.883Z" level=debug msg="Creating Agent pod" namespace=weavelocal podName=python-example-vxtdv-1340600742-agent workflow=python-example-vxtdv
time="2023-07-31T12:41:47.995Z" level=info msg="Create pods 201"
time="2023-07-31T12:41:47.998Z" level=info msg="Created Agent pod" namespace=weavelocal podName=python-example-vxtdv-1340600742-agent workflow=python-example-vxtdv
time="2023-07-31T12:41:47.998Z" level=info msg=updateAgentPodStatus namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:47.998Z" level=info msg=assessAgentPodStatus namespace=weavelocal podName=python-example-vxtdv-1340600742-agent
time="2023-07-31T12:41:48.005Z" level=debug msg="Log changes patch: {\"status\":{\"nodes\":{\"python-example-vxtdv-1868005945\":{\"children\":[\"python-example-vxtdv-3605899385\"]},\"python-example-vxtdv-3605899385\":{\"boundaryID\":\"python-example-vxtdv\",\"displayName\":\"add-one\",\"finishedAt\":\"2023-07-31T12:41:47Z\",\"id\":\"python-example-vxtdv-3605899385\",\"inputs\":{\"parameters\":[{\"name\":\"value\",\"value\":\"1\"}]},\"message\":\"Post \\\"http://localhost:7984/api/v1/template.execute\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\",\"name\":\"python-example-vxtdv[0].add-one\",\"phase\":\"Failed\",\"startedAt\":\"2023-07-31T12:41:47Z\",\"templateName\":\"add-one\",\"templateScope\":\"local/python-example-vxtdv\",\"type\":\"Plugin\"}}}}"
time="2023-07-31T12:41:48.006Z" level=info msg="Workflow to be dehydrated" Workflow Size=2497
time="2023-07-31T12:41:48.059Z" level=info msg="Update workflows 200"
time="2023-07-31T12:41:48.075Z" level=info msg="Workflow update successful" namespace=weavelocal phase=Running resourceVersion=22809 workflow=python-example-vxtdv
time="2023-07-31T12:41:48.077Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"weavelocal\", Name:\"python-example-vxtdv\", UID:\"f9e024d8-188c-4e37-9623-1b73523fe0cb\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"22809\", FieldPath:\"\"}): type: 'Normal' reason: 'WorkflowNodeRunning' Running node python-example-vxtdv[0].add-one: Post \"http://localhost:7984/api/v1/template.execute\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="2023-07-31T12:41:48.077Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"weavelocal\", Name:\"python-example-vxtdv\", UID:\"f9e024d8-188c-4e37-9623-1b73523fe0cb\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"22809\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowNodeFailed' Failed node python-example-vxtdv[0].add-one: Post \"http://localhost:7984/api/v1/template.execute\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="2023-07-31T12:41:48.129Z" level=info msg="Patch workflowtasksets 200"
time="2023-07-31T12:41:48.196Z" level=info msg="Create events 201"
time="2023-07-31T12:41:48.235Z" level=info msg="Create events 201"
time="2023-07-31T12:41:49.956Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:49.965Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:50.004Z" level=debug msg="Syncing all CronWorkflows"
time="2023-07-31T12:41:54.978Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:54.987Z" level=info msg="Update leases 200"
time="2023-07-31T12:41:57.871Z" level=info msg="Processing workflow" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.872Z" level=info msg="Task-result reconciliation" namespace=weavelocal numObjs=0 workflow=python-example-vxtdv
time="2023-07-31T12:41:57.873Z" level=info msg=updateAgentPodStatus namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.873Z" level=info msg=assessAgentPodStatus namespace=weavelocal podName=python-example-vxtdv-1340600742-agent
time="2023-07-31T12:41:57.873Z" level=debug msg="Evaluating node python-example-vxtdv: template: *v1alpha1.WorkflowStep (main), boundaryID: " namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.873Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:57.873Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:57.873Z" level=debug msg="Getting the template by name" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (main)"
time="2023-07-31T12:41:57.875Z" level=debug msg="Executing node python-example-vxtdv of Steps is Running" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=debug msg="Evaluating node python-example-vxtdv[0].add-one: template: *v1alpha1.WorkflowStep (add-one), boundaryID: python-example-vxtdv" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:57.875Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:57.875Z" level=debug msg="Getting the template by name" base="*v1alpha1.Workflow (namespace=weavelocal,name=python-example-vxtdv)" depth=0 tmpl="*v1alpha1.WorkflowStep (add-one)"
time="2023-07-31T12:41:57.875Z" level=debug msg="Node python-example-vxtdv[0].add-one already completed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=info msg="Step group node python-example-vxtdv-1868005945 deemed failed: child 'python-example-vxtdv-3605899385' failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=info msg="node python-example-vxtdv-1868005945 phase Running -> Failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=info msg="node python-example-vxtdv-1868005945 message: child 'python-example-vxtdv-3605899385' failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=info msg="node python-example-vxtdv-1868005945 finished: 2023-07-31 12:41:57.875968749 +0000 UTC" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.875Z" level=info msg="step group python-example-vxtdv-1868005945 was unsuccessful: child 'python-example-vxtdv-3605899385' failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Outbound nodes of python-example-vxtdv-3605899385 is [python-example-vxtdv-3605899385]" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Outbound nodes of python-example-vxtdv is [python-example-vxtdv-3605899385]" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="node python-example-vxtdv phase Running -> Failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="node python-example-vxtdv message: child 'python-example-vxtdv-3605899385' failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="node python-example-vxtdv finished: 2023-07-31 12:41:57.876082534 +0000 UTC" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Checking daemoned children of python-example-vxtdv" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="TaskSet Reconciliation" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg=reconcileAgentPod namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Updated phase Running -> Failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Updated message  -> child 'python-example-vxtdv-3605899385' failed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Marking workflow completed" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Marking workflow as pending archiving" namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.876Z" level=info msg="Checking daemoned children of " namespace=weavelocal workflow=python-example-vxtdv
time="2023-07-31T12:41:57.878Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"weavelocal\", Name:\"python-example-vxtdv\", UID:\"f9e024d8-188c-4e37-9623-1b73523fe0cb\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"22809\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowFailed' child 'python-example-vxtdv-3605899385' failed"
time="2023-07-31T12:41:57.880Z" level=debug msg="Log changes patch: {\"metadata\":{\"labels\":{\"workflows.argoproj.io/completed\":\"true\",\"workflows.argoproj.io/phase\":\"Failed\",\"workflows.argoproj.io/workflow-archiving-status\":\"Pending\"}},\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"PodRunning\"},{\"status\":\"True\",\"type\":\"Completed\"}],\"finishedAt\":\"2023-07-31T12:41:57Z\",\"message\":\"child 'python-example-vxtdv-3605899385' failed\",\"nodes\":{\"python-example-vxtdv\":{\"finishedAt\":\"2023-07-31T12:41:57Z\",\"message\":\"child 'python-example-vxtdv-3605899385' failed\",\"outboundNodes\":[\"python-example-vxtdv-3605899385\"],\"phase\":\"Failed\"},\"python-example-vxtdv-1868005945\":{\"finishedAt\":\"2023-07-31T12:41:57Z\",\"message\":\"child 'python-example-vxtdv-3605899385' failed\",\"phase\":\"Failed\"},\"python-example-vxtdv-3605899385\":{\"finishedAt\":\"2023-07-31T12:41:57Z\"}},\"phase\":\"Failed\"}}"
time="2023-07-31T12:41:57.880Z" level=info msg="Workflow to be dehydrated" Workflow Size=2789
time="2023-07-31T12:41:57.881Z" level=info msg="cleaning up pod" action=deletePod key=weavelocal/python-example-vxtdv-1340600742-agent/deletePod
time="2023-07-31T12:41:57.898Z" level=info msg="Create events 201"
time="2023-07-31T12:41:57.905Z" level=info msg="Update workflows 200"
time="2023-07-31T12:41:57.906Z" level=info msg="Workflow update successful" namespace=weavelocal phase=Failed resourceVersion=22832 workflow=python-example-vxtdv
time="2023-07-31T12:41:57.906Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"weavelocal\", Name:\"python-example-vxtdv\", UID:\"f9e024d8-188c-4e37-9623-1b73523fe0cb\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"22832\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowNodeFailed' Failed node python-example-vxtdv: child 'python-example-vxtdv-3605899385' failed"
time="2023-07-31T12:41:57.906Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"weavelocal\", Name:\"python-example-vxtdv\", UID:\"f9e024d8-188c-4e37-9623-1b73523fe0cb\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"22832\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowNodeFailed' Failed node python-example-vxtdv[0]: child 'python-example-vxtdv-3605899385' failed"
time="2023-07-31T12:41:57.918Z" level=info msg="Delete pods 200"
time="2023-07-31T12:41:57.936Z" level=info msg="Create events 201"
time="2023-07-31T12:41:57.941Z" level=info msg="Patch workflowtasksets 200"
time="2023-07-31T12:41:57.952Z" level=info msg="Create events 201"
time="2023-07-31T12:41:57.961Z" level=info msg="DeleteCollection workflowtaskresults 200"
time="2023-07-31T12:41:57.969Z" level=info msg="Patch workflowtasksets 200"
time="2023-07-31T12:41:57.969Z" level=info msg="archiving workflow" namespace=weavelocal uid=f9e024d8-188c-4e37-9623-1b73523fe0cb workflow=python-example-vxtdv
time="2023-07-31T12:41:57.969Z" level=debug msg="Archiving workflow" labels="map[workflows.argoproj.io/completed:true workflows.argoproj.io/phase:Failed workflows.argoproj.io/workflow-archiving-status:Pending]" uid=f9e024d8-188c-4e37-9623-1b73523fe0cb
time="2023-07-31T12:41:58.002Z" level=info msg="Patch workflows 200"
time="2023-07-31T12:41:59.993Z" level=info msg="Get leases 200"
time="2023-07-31T12:41:59.999Z" level=info msg="Update leases 200"
time="2023-07-31T12:42:00.005Z" level=debug msg="Syncing all CronWorkflows"

Logs from in your workflow's wait container

init time="2023-07-31T14:57:10.844Z" level=info msg="creating token file for plugin" filename=/var/run/argo/python-executor-plugin/token plugin=python-executor-plugin
main time="2023-07-31T14:57:12.102Z" level=info msg="Starting Workflow Executor" version=v3.3.8
main time="2023-07-31T14:57:12.105Z" level=info msg="loading token file for plugin" filename=/var/run/argo/python-executor-plugin/token plugin=python-executor-plugin
main time="2023-07-31T14:57:12.105Z" level=info msg="Starting Agent" requeueTime=10s taskWorkers=16 workflow=python-example-vxtdv
main time="2023-07-31T14:57:12.115Z" level=info msg="Watch workflowtasksets 200"
main time="2023-07-31T14:57:12.118Z" level=info msg="TaskSet Event" event_type=ADDED workflow=python-example-vxtdv
@sid8489
Copy link
Author

sid8489 commented Jul 31, 2023

To Have Failed state for the plugin node I have on purpose added exit -1 in the plugin script.

@terrytangyuan
Copy link
Member

terrytangyuan commented Aug 2, 2023

  • I'd like to contribute the fix myself

Looking forward to your PR!

@juliev0 juliev0 added the P3 Low priority label Aug 3, 2023
@agilgur5 agilgur5 added area/plugins area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Aug 10, 2023
@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@jswxstw
Copy link
Member

jswxstw commented Feb 2, 2024

IMO Upon Retry Argo Workflow Server should patch the WorkflowTaskSet (Remove the status for the failed Plugin Node).

I have a doubt that why the controller does not clean up the WorkflowTaskSet when the workflow is completed. The patch action of failed node status when manual retrying may be omitted if so.

@jswxstw
Copy link
Member

jswxstw commented Feb 4, 2024

if woc.wf.Status.Fulfilled() {
err := woc.completeTaskSet(ctx)
if err != nil {
log.WithError(err).Warn("error to complete the taskset")
}
}

err = woc.removeCompletedTaskSetStatus(ctx)

func (woc *wfOperationCtx) getDeleteTaskAndNodePatch() map[string]interface{} {
deletedNode := make(map[string]interface{})
for _, node := range woc.wf.Status.Nodes {
if (node.Type == wfv1.NodeTypeHTTP || node.Type == wfv1.NodeTypePlugin) && node.Fulfilled() {
deletedNode[node.ID] = nil
}
}
// Delete the completed Tasks and nodes status
patch := map[string]interface{}{
"spec": map[string]interface{}{
"tasks": deletedNode,
},
"status": map[string]interface{}{
"nodes": deletedNode,
},
}
return patch
}

Controller will try to delete completed nodes of HTTP/Plugin type.
Howerver, the node delete patch did not affect to the status field.
I've tried json patch like '[{"op": "remove", "path": "/status/nodes/plugin-demo-g5qfg"}]', it also not worked.

@jswxstw
Copy link
Member

jswxstw commented Feb 4, 2024

As the WorkflowTaskSet CRD shown below, it has set the status field with subresources.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apiextensions.k8s.io/v1","kind":"CustomResourceDefinition","metadata":{"annotations":{},"labels":{"app.kubernetes.io/part-of":"argo"},"name":"workflowtasksets.argoproj.io"},"spec":{"group":"argoproj.io","names":{"kind":"WorkflowTaskSet","listKind":"WorkflowTaskSetList","plural":"workflowtasksets","shortNames":["wfts"],"singular":"workflowtaskset"},"scope":"Namespaced","versions":[{"name":"v1alpha1","schema":{"openAPIV3Schema":{"properties":{"apiVersion":{"type":"string"},"kind":{"type":"string"},"metadata":{"type":"object"},"spec":{"type":"object","x-kubernetes-map-type":"atomic","x-kubernetes-preserve-unknown-fields":true},"status":{"type":"object","x-kubernetes-map-type":"atomic","x-kubernetes-preserve-unknown-fields":true}},"required":["metadata","spec"],"type":"object"}},"served":true,"storage":true,"subresources":{"status":{}}}]}}
  creationTimestamp: "2024-01-26T03:22:45Z"
  generation: 1
  labels:
    app.kubernetes.io/part-of: argo
  name: workflowtasksets.argoproj.io
  resourceVersion: "1011"
  uid: 7fc3e411-435d-4b29-a0ea-75fc25510304
spec:
  conversion:
    strategy: None
  group: argoproj.io
  names:
    kind: WorkflowTaskSet
    listKind: WorkflowTaskSetList
    plural: workflowtasksets
    shortNames:
    - wfts
    singular: workflowtaskset
  scope: Namespaced
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        properties:
          apiVersion:
            type: string
          kind:
            type: string
          metadata:
            type: object
          spec:
            type: object
            x-kubernetes-map-type: atomic
            x-kubernetes-preserve-unknown-fields: true
          status: 
            type: object
            x-kubernetes-map-type: atomic
            x-kubernetes-preserve-unknown-fields: true
        required:
        - metadata
        - spec
        type: object
    served: true
    storage: true
    subresources:
      status: {}
status:
  acceptedNames:
    kind: WorkflowTaskSet
    listKind: WorkflowTaskSetList
    plural: workflowtasksets
    shortNames:
    - wfts
    singular: workflowtaskset
  conditions:
  - lastTransitionTime: "2024-01-26T03:22:45Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  - lastTransitionTime: "2024-01-26T03:22:45Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1alpha1
image So, the `status` field needs to be modified with command like `kubectl patch wfts ${workflowTaskSetName} --subresource='status'` now.

@jswxstw
Copy link
Member

jswxstw commented Feb 4, 2024

func (woc *wfOperationCtx) getDeleteTaskAndNodePatch() map[string]interface{} {
deletedNode := make(map[string]interface{})
for _, node := range woc.wf.Status.Nodes {
if (node.Type == wfv1.NodeTypeHTTP || node.Type == wfv1.NodeTypePlugin) && node.Fulfilled() {
deletedNode[node.ID] = nil
}
}
// Delete the completed Tasks and nodes status
patch := map[string]interface{}{
"spec": map[string]interface{}{
"tasks": deletedNode,
},
"status": map[string]interface{}{
"nodes": deletedNode,
},
}
return patch
}

So, spec and status can not be modified in one patch json, status should be patched with subresources.

@jswxstw
Copy link
Member

jswxstw commented Feb 4, 2024

func (woc *wfOperationCtx) getDeleteTaskAndNodePatch() map[string]interface{} {
deletedNode := make(map[string]interface{})
for _, node := range woc.wf.Status.Nodes {
if (node.Type == wfv1.NodeTypeHTTP || node.Type == wfv1.NodeTypePlugin) && node.Fulfilled() {
deletedNode[node.ID] = nil
}
}
// Delete the completed Tasks and nodes status
patch := map[string]interface{}{
"spec": map[string]interface{}{
"tasks": deletedNode,
},
"status": map[string]interface{}{
"nodes": deletedNode,
},
}
return patch
}

So, spec and status can not be modified in one patch json, status should be patched with subresources.

Instead of fixing the patch bug, it would be much simpler to delete the entire WorkflowTaskSet.
I don't quite understand the purpose of keeping the corresponding WorkflowTaskSet after the workflow is completed.

jswxstw pushed a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Feb 5, 2024
…ixes argoproj#11489

Signed-off-by: oninowang <oninowang@tencent.com>
@agilgur5 agilgur5 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/plugins area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug
Projects
None yet
5 participants