Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient database errors with offload enabled cause workflow to fail #4464

Closed
markterm opened this issue Nov 5, 2020 · 16 comments · Fixed by #4482
Closed

Transient database errors with offload enabled cause workflow to fail #4464

markterm opened this issue Nov 5, 2020 · 16 comments · Fixed by #4482
Assignees
Labels
Milestone

Comments

@markterm
Copy link
Contributor

markterm commented Nov 5, 2020

Summary

If there is a transient database connection error with offload enabled, then active workflows are marked as failed. I would at least expect some kind of retry.

Diagnostics

What Kubernetes provider are you using?
GKE

What version of Argo Workflows are you running?
2.11.0-rc1

Paste the logs from the workflow controller:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}
 time="2020-11-02T13:32:42Z" level=error msg="hydration failed: dial tcp 10.X.X.X:5432: connect: connection refused" namespace=default workflow=XXXX
 time="2020-11-02T13:32:42Z" level=error msg="Failed to archive workflow" err="dial tcp 10.X.X.X:5432: connect: connection refused" namespace=default workflow=XXXX

Seems to be caused by controller.go:517 which marks any workflows as 'error' if hydration fails.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec alexec self-assigned this Nov 5, 2020
@alexec
Copy link
Contributor

alexec commented Nov 5, 2020

This should be fixed. I think we should create a RetryingOffloadNodeStatusRepo that delegates to another OffloadNodeStatusRepo - but adds retry using wait.ExponentialBackoff. Would you be interested in submitting a PR?

@markterm
Copy link
Contributor Author

markterm commented Nov 6, 2020 via email

alexec added a commit to alexec/argo-workflows that referenced this issue Nov 6, 2020


Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor

alexec commented Nov 6, 2020

ee84892

@alexec
Copy link
Contributor

alexec commented Nov 6, 2020

Can you try with v2.11.7?

@markterm
Copy link
Contributor Author

markterm commented Nov 9, 2020

I tried turning off incoming connections for the DB for a few seconds while a workflow was running, and got this, which looks like a hard fail on dehydration:

workflow-controller-59d5969b9d-9j568 workflow-controller time="2020-11-09T16:37:01Z" level=warning msg="Failed to dehydrate: pq: database \"staging_codec\" is not currently accepting connections" namespace=staging workflow= XXXX-7ldnh
workflow-controller-59d5969b9d-9j568 workflow-controller time="2020-11-09T16:37:01Z" level=info msg="Updated phase Running -> Error" namespace=staging workflow= XXXX-7ldnh
workflow-controller-59d5969b9d-9j568 workflow-controller time="2020-11-09T16:37:01Z" level=info msg="Updated message  -> pq: database \"staging_codec\" is not currently accepting connections" namespace=staging workflow=XXXX-7ldnh

@alexec
Copy link
Contributor

alexec commented Nov 9, 2020

How long was your database offline for?

@markterm
Copy link
Contributor Author

markterm commented Nov 10, 2020 via email

@alexec
Copy link
Contributor

alexec commented Nov 10, 2020

Maybe a transient error should not fail the workflow, instead, it should leave it running?

@markterm
Copy link
Contributor Author

Yes - I think if there's an error accessing the database on hydration then we should leave it running and try again later. However if the error is on dehydration then we could end up with a set of workflow updates that we can't persist, I don't see a good way out of that so failing the workflow seems reasonable.

@alexec
Copy link
Contributor

alexec commented Nov 10, 2020

@jessesuen interesting point - SQL database could be offline for minutes.

  • For hydration, was can just re-queue the workflow and try again.
  • For dehydration data might be lost.

What should we do - keep retrying for minutes?

@alexec
Copy link
Contributor

alexec commented Nov 10, 2020

@markterm I've created an engineering build for you to test:

docker pull argoproj/workflow-controller:nr

@markterm
Copy link
Contributor Author

It looked like this worked :)

@markterm
Copy link
Contributor Author

Just checking if this is going to be merged?

@alexec
Copy link
Contributor

alexec commented Nov 23, 2020

I've just re-implemented this is a simpler version. It's targetted for v2.12 which is rc-3 today.

@markterm
Copy link
Contributor Author

markterm commented Nov 23, 2020 via email

alexec added a commit that referenced this issue Nov 23, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
alexcapras pushed a commit to alexcapras/argo that referenced this issue Dec 2, 2020
Signed-off-by: github@finnesand.no <github@finnesand.no>

feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543)

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>

feat(executor): Auto create s3 bucket if not present.

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Apply codegen

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Add argo-e2e label to test wf

Signed-off-by: Alex Capras <alexcapras@gmail.com>

chore: Updated stress test YAML (argoproj#4569)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Updated kubectl apply command in manifests README (argoproj#4577)

Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com>

feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Fix a typo in example (argoproj#4590)

Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com>

feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix(server): use the correct name when downloading artifacts (argoproj#4579)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575)

Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>

docs: Add JSON schema for IDE validation (argoproj#4581)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

refactor: Use polling model for workflow phase metric (argoproj#4557)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

Addressing reviewers comments

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Addressing reviewers comments

docs: Minor typo fix (argoproj#4610)

Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com>

fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

fix(docs): Bring minio chart instructions up to date (argoproj#4586)

Signed-off-by: Ranga Krishnan <ranga@bei.re>

fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601)

Signed-off-by: fsiegmund <siegmund@slb.com>

feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix: Null check pagination variable (argoproj#4617)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix: Perform fields filtering server side (argoproj#4595)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371)

fix: Fix TestCleanFieldsExclude (argoproj#4625)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369)

Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com>

fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973  (argoproj#4419)

Signed-off-by: Alejandro Tejera <aletepe@gmail.com>

fix(ui): Fix Snyk issues (argoproj#4631)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Codegen patch. Signed off by alexcapras@gmail.com

Codegen patch. Signed off by alexcapras@gmail.com

Delete test.patch
alexcapras pushed a commit to alexcapras/argo that referenced this issue Dec 2, 2020
Signed-off-by: github@finnesand.no <github@finnesand.no>

feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543)

Signed-off-by: Tianchu Zhao <evantczhao@gmail.com>

feat(executor): Auto create s3 bucket if not present.

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Apply codegen

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Add argo-e2e label to test wf

Signed-off-by: Alex Capras <alexcapras@gmail.com>

chore: Updated stress test YAML (argoproj#4569)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Updated kubectl apply command in manifests README (argoproj#4577)

Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com>

feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

docs: Fix a typo in example (argoproj#4590)

Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com>

feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix(server): use the correct name when downloading artifacts (argoproj#4579)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589)

Signed-off-by: Daniel Herman <dherman@factset.com>

fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575)

Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com>

docs: Add JSON schema for IDE validation (argoproj#4581)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

refactor: Use polling model for workflow phase metric (argoproj#4557)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

Addressing reviewers comments

Signed-off-by: Alex Capras <alexcapras@gmail.com>

Addressing reviewers comments

docs: Minor typo fix (argoproj#4610)

Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com>

fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

fix(docs): Bring minio chart instructions up to date (argoproj#4586)

Signed-off-by: Ranga Krishnan <ranga@bei.re>

fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601)

Signed-off-by: fsiegmund <siegmund@slb.com>

feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

fix: Null check pagination variable (argoproj#4617)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix: Perform fields filtering server side (argoproj#4595)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371)

fix: Fix TestCleanFieldsExclude (argoproj#4625)

Signed-off-by: Simon Behar <simbeh7@gmail.com>

fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369)

Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com>

fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611)

Signed-off-by: Paul Brabban <paul.brabban@gmail.com>

fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973  (argoproj#4419)

Signed-off-by: Alejandro Tejera <aletepe@gmail.com>

fix(ui): Fix Snyk issues (argoproj#4631)

Signed-off-by: Alex Collins <alex_collins@intuit.com>

feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620)

Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>

Codegen patch. Signed off by alexcapras@gmail.com

Codegen patch. Signed off by alexcapras@gmail.com

Delete test.patch

Signed-off-by: Alex Capras <alexcapras@gmail.com>
alexec added a commit that referenced this issue Dec 3, 2020
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec alexec added this to the v2.12 milestone Dec 3, 2020
@alexec
Copy link
Contributor

alexec commented Dec 3, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants