Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod hangs when container in ContainerSet is OOM Killed #10063

Closed
2 of 3 tasks
rajaie-sg opened this issue Nov 18, 2022 · 9 comments · Fixed by #11484
Closed
2 of 3 tasks

Pod hangs when container in ContainerSet is OOM Killed #10063

rajaie-sg opened this issue Nov 18, 2022 · 9 comments · Fixed by #11484
Labels
area/templates/container-set P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug

Comments

@rajaie-sg
Copy link

rajaie-sg commented Nov 18, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I have a containerSet with 2 containers. The first container runs a script that results in it being OOM Killed. I expect the Pod and Workflow to fail at that point, but they just hang in the "Running" state until they reach their timeoutSeconds deadline.

I found this related issue #8680 but I am on 3.4.3 and still running into it.

I am using the emissary executor

Version

v3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: example
  namespace: argo
  labels:
spec:
  automountServiceAccountToken: false
  dnsConfig:
    nameservers:
      - 10.33.0.2 # AWS VPC Default
      - 1.1.1.1 # CloudFlare
      - 8.8.8.8 # Google
  executor:
    serviceAccountName: "argo-workflow"
  templates:
    - name: entrypoint
      # max workflow duration https://argoproj.github.io/argo-workflows/fields/#workflowspec
      retryStrategy: # https://argoproj.github.io/argo-workflows/retries/
        limit: "3"
        retryPolicy: "OnTransientError"
      metadata:
      containerSet:
        containers:
          - name: one
            imagePullPolicy: Always
            resources:
              requests:
                memory: "50Mi"
                cpu: "50m"
              limits:
                memory: "50Mi"
            image: "ubuntu"
            command:
              - bash
              - '-c'
            args:
              - |
                /bin/bash <<'EOF'
                echo "hello one"
                apt update -y
                apt install stress -y
                echo 'stress --vm 1 --vm-bytes 512M --vm-hang 100' > abc.sh
                bash abc.sh
                EOF
          - name: two # if changing, update 'eks.amazonaws.com/skip-containers'
            imagePullPolicy: Always
            resources:
              requests:
                memory: "150Mi"
                cpu: "50m"
              limits:
                memory: "250Mi"
            image: "ubuntu"
            command:
              - bash
              - '-c'
            args:
              - |
                /bin/bash <<'EOF'
                echo "hello world"
                EOF
            dependencies:
              - one
  entrypoint: entrypoint


### Logs from the workflow controller

n/a

### Logs from in your workflow's wait container

time="2022-11-18T17:07:11.708Z" level=info msg="Starting Workflow Executor" version=v3.4.3
time="2022-11-18T17:07:11.711Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-11-18T17:07:11.711Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=postman-test-fxd4q-entrypoint-1784721547 template="templatehere"
time="2022-11-18T17:07:11.712Z" level=info msg="Starting deadline monitor"
@rajaie-sg
Copy link
Author

rajaie-sg commented Nov 20, 2022

Looking at the controller logs, I can see that it detects the container was killed, but the Pod and Workflow remain in the "Running" phase.

time="2022-11-20T16:42:43.766Z" level=info msg="node postman-test-h2r44-49770529 phase Running -> Failed" namespace=argo workflow=postman-test-h2r44
time="2022-11-20T16:42:43.766Z" level=info msg="node postman-test-h2r44-49770529 message: OOMKilled (exit code 137): " namespace=argo workflow=postman-test-h2r44
time="2022-11-20T16:42:43.767Z" level=info msg="node postman-test-h2r44-49770529 finished: 2022-11-20 16:42:43.76700631 +0000 UTC" namespace=argo workflow=postman-test-h2r44
time="2022-11-20T16:42:43.767Z" level=info msg="node unchanged" namespace=argo nodeID=postman-test-h2r44-1403540372 workflow=postman-test-h2r44
time="2022-11-20T16:42:43.767Z" level=debug msg="Evaluating node postman-test-h2r44: template: *v1alpha1.WorkflowStep (entrypoint), boundaryID: " namespace=argo workflow=postman-test-h2r44

(workflow name is different from the example I shared)

more relevant log lines

time="2022-11-20T16:42:43.770Z" level=debug msg="Log changes patch: {\"status\":{\"conditions\":[{\"status\":\"False\",\"type\":\"PodRunning\"}],\"nodes\":{\"postman-test-h2r44-49770529\":{\"finishedAt\":\"2022-11-20T16:42:43Z\",\"message\":\"OOMKilled (exit code 137): \",\"phase\":\"Failed\"}}}}"
time="2022-11-20T16:42:43.770Z" level=info msg="Workflow to be dehydrated" Workflow Size=11223
time="2022-11-20T16:42:43.790Z" level=info msg="Update workflows 200"
time="2022-11-20T16:42:43.794Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=44830908 workflow=postman-test-h2r44
time="2022-11-20T16:42:43.797Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"argo\", Name:\"postman-test-h2r44\", UID:\"72c7ca6b-3e25-4bbd-88f4-2c5b82af521b\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"44830908\", FieldPath:\"\"}): type: 'Warning' reason: 'WorkflowNodeFailed' Failed node postman-test-h2r44(0).init-application: OOMKilled (exit code 137): "
time="2022-11-20T16:42:43.806Z" level=info msg="Create events 201"

@stale

This comment was marked as resolved.

@rajaie-sg
Copy link
Author

This issue is still happening

@sarabala1979
Copy link
Member

@rajaie-sg Do you like to submit PR for this issue?

@rajaie-sg
Copy link
Author

Hi @sarabala1979 , it looks like there was a fix for this issue but it was reverted at some point? #8456 (comment)

@caelan-io caelan-io added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P3 Low priority labels Feb 23, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Mar 25, 2023
@rajaie-sg
Copy link
Author

rajaie-sg commented Mar 27, 2023

This is still an issue (comment to avoid the stale tag)

@RyanDevlin
Copy link

I've noticed this same issue on v3.4.7 using the emissary executor

@stale stale bot removed the problem/stale This has not had a response in some time label Jun 21, 2023
@alexec
Copy link
Contributor

alexec commented Jul 30, 2023

I think there are two issues;

  • Wrong condition for killing containers means a container-set does not get killed.
  • Signal is only sent to process, to process group.

alexec added a commit that referenced this issue Jul 31, 2023
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec alexec linked a pull request Jul 31, 2023 that will close this issue
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this issue Sep 6, 2023
Signed-off-by: Alex Collins <alex_collins@intuit.com>
terrytangyuan pushed a commit that referenced this issue Sep 6, 2023
…nly (#11757)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Co-authored-by: Alex Collins <alex_collins@intuit.com>
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
…3.4.11 only (argoproj#11757)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Co-authored-by: Alex Collins <alex_collins@intuit.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/templates/container-set P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants