Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't stop a workflow, when this workflow uses a plugins #12333

Closed
2 of 3 tasks
mio4kon opened this issue Dec 8, 2023 · 6 comments · Fixed by #12441
Closed
2 of 3 tasks

Can't stop a workflow, when this workflow uses a plugins #12333

mio4kon opened this issue Dec 8, 2023 · 6 comments · Fixed by #12441
Assignees
Milestone

Comments

@mio4kon
Copy link

mio4kon commented Dec 8, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

I wrote a plugin DEMO to simulate a time -consuming task. Now I hope to stop the execution of the plug -in through the stop command of the WF. But I found that the plug -in now cannot receive the instruction of WF at all

The following is my plugin configmap: plugin/hello-executor-plugin-configmap.yaml

# This is an auto-generated file. DO NOT EDIT
apiVersion: v1
data:
  sidecar.automountServiceAccountToken: "false"
  sidecar.container: |
    args:
    - |
      import json
      import random
      import time
      from http.server import BaseHTTPRequestHandler, HTTPServer


      class Plugin(BaseHTTPRequestHandler):

          def args(self):
              return json.loads(self.rfile.read(int(self.headers.get('Content-Length'))))

          def reply(self, reply):
              self.send_response(200)
              self.end_headers()
              self.wfile.write(json.dumps(reply).encode("UTF-8"))

          def forbidden(self):
              self.send_response(403)
              self.end_headers()

          def unsupported(self):
              self.send_response(404)
              self.end_headers()

          def do_POST(self):
              print("=======================================")
              print("self.path: ", self.path)
              if self.path == '/api/v1/template.execute':
                  args = self.args()
                  random_num = random.randint(1, 10)
                  print("random_num: ", random_num)
                  print(args)
                  if 'hello' in args['template'].get('plugin', {}):
                      if random_num > 0:
                          self.reply(
                              {
                                  "node": {
                                      "phase": "Running",
                                      "message": "Long-running task started"
                                  },
                                  "requeue": "5m"
                              }
                          )
                      else:
                          self.reply(
                              {
                                  "node": {
                                      "phase": "Succeeded",
                                      "message": "finish job~~~"
                                  }
                              }
                          )
                  else:
                      self.reply({})
              else:
                  self.unsupported()


      if __name__ == '__main__':
          httpd = HTTPServer(('', 4355), Plugin)
          httpd.serve_forever()
    command:
    - python
    - -u
    - -c
    image: python:alpine3.6
    name: hello-executor-plugin
    ports:
    - containerPort: 4355
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 250m
        memory: 64Mi
    securityContext:
      runAsNonRoot: true
      runAsUser: 65534
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    workflows.argoproj.io/configmap-type: ExecutorPlugin
  name: hello-executor-plugin

Here I simulate the user to terminate the workflow, But my plugin cannot accept this instruction。

image

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

argo executor-plugin build plugin/.
kubectl -n argo apply -f  plugin/hello-executor-plugin-configmap.yaml
argo submit -n argo plugin-hello.yaml --watch --serviceaccount default
argo terminate hello-25wjm

Logs from the workflow controller

time="2023-12-08T10:34:07.212Z" level=info msg=assessAgentPodStatus namespace=argo podName=hello-7wwcp-1340600742-agent

 time="2023-12-08T10:34:07.212Z" level=error msg="was unable to obtain node for hello-7wwcp-2166136261" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:07.212Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:07.212Z" level=info msg="Creating TaskSet" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:12.333Z" level=warning msg="Waited for 5.120860946s, request: Create:https://10.233.0.1:443/apis/argoproj.io/v1alpha1/namespaces/argo/workflowtasksets"

 time="2023-12-08T10:34:12.545Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:12.545Z" level=info msg=updateAgentPodStatus namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:12.545Z" level=info msg=assessAgentPodStatus namespace=argo podName=hello-7wwcp-1340600742-agent

 time="2023-12-08T10:34:12.545Z" level=info msg="Workflow to be dehydrated" Workflow Size=895

 time="2023-12-08T10:34:12.572Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=117253554 workflow=hello-7wwcp

 time="2023-12-08T10:34:22.573Z" level=info msg="Processing workflow" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:22.574Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-7wwcp

 time="2023-12-08T10:34:22.574Z" level=info msg=updateAgentPodStatus namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:22.574Z" level=info msg=assessAgentPodStatus namespace=argo podName=hello-7wwcp-1340600742-agent

 time="2023-12-08T10:34:22.574Z" level=error msg="was unable to obtain node for hello-7wwcp-2166136261" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:22.574Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:22.574Z" level=info msg="Creating TaskSet" namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:23.102Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:23.102Z" level=info msg=updateAgentPodStatus namespace=argo workflow=hello-7wwcp

 time="2023-12-08T10:34:23.102Z" level=info msg=assessAgentPodStatus namespace=argo podName=hello-7wwcp-1340600742-agent

 time="2023-12-08T10:34:23.103Z" level=info msg="Workflow to be dehydrated" Workflow Size=895

 time="2023-12-08T10:34:23.108Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=117253580 workflow=hello-7wwcp

 time="2023-12-08T10:38:33.967Z" level=info msg="Alloc=6957 TotalAlloc=232719 Sys=30565 NumGC=110 Goroutines=177"

Logs from in your workflow's wait container

no wait container。

plugin container:
 =======================================

 self.path:  /api/v1/template.execute

 random_num:  4

 {'workflow': {'metadata': {'name': 'hello-7wwcp', 'namespace': 'argo', 'uid': '5ecf8bbe-8027-4999-8b4a-c84f8ac9fc17'}}, 'template': {'name': 'hello', 'inputs': {}, 'outputs': {}, 'metadata': {}, 'plugin': {'hello': {'jobName': 'iosBuild', 'jobParams': {'param1': 'value1', 'param2': 'value2'}}}}}

 127.0.0.1 - - [08/Dec/2023 10:30:40] "POST /api/v1/template.execute HTTP/1.1" 200 -

 =======================================

 self.path:  /api/v1/template.execute

 random_num:  4

 {'workflow': {'metadata': {'name': 'hello-7wwcp', 'namespace': 'argo', 'uid': '5ecf8bbe-8027-4999-8b4a-c84f8ac9fc17'}}, 'template': {'name': 'hello', 'inputs': {}, 'outputs': {}, 'metadata': {}, 'plugin': {'hello': {'jobName': 'iosBuild', 'jobParams': {'param1': 'value1', 'param2': 'value2'}}}}}

 127.0.0.1 - - [08/Dec/2023 10:35:40] "POST /api/v1/template.execute HTTP/1.1" 200 -

 =======================================

 self.path:  /api/v1/template.execute

 random_num:  4

 {'workflow': {'metadata': {'name': 'hello-7wwcp', 'namespace': 'argo', 'uid': '5ecf8bbe-8027-4999-8b4a-c84f8ac9fc17'}}, 'template': {'name': 'hello', 'inputs': {}, 'outputs': {}, 'metadata': {}, 'plugin': {'hello': {'jobName': 'iosBuild', 'jobParams': {'param1': 'value1', 'param2': 'value2'}}}}}

 127.0.0.1 - - [08/Dec/2023 10:40:40] "POST /api/v1/template.execute HTTP/1.1" 200 -

 =======================================

 self.path:  /api/v1/template.execute

 random_num:  5

 {'workflow': {'metadata': {'name': 'hello-7wwcp', 'namespace': 'argo', 'uid': '5ecf8bbe-8027-4999-8b4a-c84f8ac9fc17'}}, 'template': {'name': 'hello', 'inputs': {}, 'outputs': {}, 'metadata': {}, 'plugin': {'hello': {'jobName': 'iosBuild', 'jobParams': {'param1': 'value1', 'param2': 'value2'}}}}}

 127.0.0.1 - - [08/Dec/2023 10:45:40] "POST /api/v1/template.execute HTTP/1.1" 200 -
@terrytangyuan
Copy link
Member

 time="2023-12-08T10:34:07.212Z" level=error msg="was unable to obtain node for hello-7wwcp-2166136261" namespace=argo workflow=hello-7wwcp

This seems like a bug to me. Relevant to #12132

cc @isubasinghe

@agilgur5 agilgur5 added area/plugins P3 Low priority labels Dec 12, 2023
@isubasinghe
Copy link
Member

@terrytangyuan I will have a look at this, this doesn't seem to be relevant to #12132 I think, #12132 is a special case where it looks for a boundary node.

I suspect the node doesn't actually exist here somehow.

@isubasinghe isubasinghe self-assigned this Dec 25, 2023
sarabala1979 pushed a commit that referenced this issue Jan 4, 2024
Signed-off-by: xin04.zhang <xin04.zhang@horizon.ai>
Co-authored-by: xin04.zhang <xin04.zhang@horizon.ai>
@jswxstw
Copy link
Member

jswxstw commented Feb 23, 2024

@terrytangyuan I will have a look at this, this doesn't seem to be relevant to #12132 I think, #12132 is a special case where it looks for a boundary node.

I suspect the node doesn't actually exist here somehow.

Yes, it is not relevant to #12132 and I have found the root cause of this problem. The error log is printed here:

woc.log.Errorf("was unable to obtain node for %s", nodeID)

HTTP/Plugin nodes are running in the agent pod and it is shared within the same workflow. Therefore, woc.nodeID(agentPod) will generate a nodeID that never exists.

@isubasinghe Should I open a new issue for it? or you fix it together with #12132 ?

@agilgur5 agilgur5 added this to the v3.6.0 milestone May 4, 2024
@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

@jswxstw was that ever fixed? If not, please submit a PR. In general it's better to write down or fix the issue in an open issue rather than a closed one.

And when it's for a specific case, then a PR has particular value as it documents the case well compared to a generic fix. #12132 is also incredibly, overly broad, as Isitha and I wrote there, some cases are extra logs that can be removed, but others are actual bugs that need to be fixed that were previously silent. #12997 is an example of that

@jswxstw
Copy link
Member

jswxstw commented May 4, 2024

@jswxstw was that ever fixed? If not, please submit a PR. In general it's better to write down or fix the issue in an open issue rather than a closed one.

I have opened a new issuse #12726 for it and PR was already submitted.

@agilgur5
Copy link
Member

agilgur5 commented May 4, 2024

Ah great, thanks for that! I thought you fixed something related but hadn't found it in a search.
I try to back-link all my issues & PRs (and Slack threads) if they were related to something so there's a clear history and link chain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants