New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient database errors with offload enabled cause workflow to fail #4464
Comments
This should be fixed. I think we should create a |
Thanks - I’m afraid I don’t have capacity to work on this in the immediate
future.
…On Thu, 5 Nov 2020 at 22:47, Alex Collins ***@***.***> wrote:
This should be fixed. I think we should create a
RetryingOffloadNodeStatusRepo that delegates to another
OffloadNodeStatusRepo - but adds retry using wait.ExponentialBackoff.
Would you be interested in submitting a PR?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4464 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LQMZGTQPAZVYYKLT3TSOMTRXANCNFSM4TLJAWZQ>
.
|
Can you try with |
I tried turning off incoming connections for the DB for a few seconds while a workflow was running, and got this, which looks like a hard fail on dehydration:
|
How long was your database offline for? |
Ah - about a minute and I can see that the maximum retry is under a second.
How about having more retries for DB connection issues? eg single-AZ
database deployments with Amazon RDS can have outages of several minutes
during maintenance.
…On Mon, 9 Nov 2020 at 19:34, Alex Collins ***@***.***> wrote:
How long was your database offline for?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4464 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LSI2BXTVUU2WJSLWELSPA747ANCNFSM4TLJAWZQ>
.
|
Maybe a transient error should not fail the workflow, instead, it should leave it running? |
Yes - I think if there's an error accessing the database on hydration then we should leave it running and try again later. However if the error is on dehydration then we could end up with a set of workflow updates that we can't persist, I don't see a good way out of that so failing the workflow seems reasonable. |
@jessesuen interesting point - SQL database could be offline for minutes.
What should we do - keep retrying for minutes? |
@markterm I've created an engineering build for you to test:
|
It looked like this worked :) |
Just checking if this is going to be merged? |
I've just re-implemented this is a simpler version. It's targetted for v2.12 which is rc-3 today. |
great - thanks!
…On Mon, 23 Nov 2020 at 18:21, Alex Collins ***@***.***> wrote:
I've just re-implemented this is a simpler version. It's targetted for
v2.12 which is rc-3 today.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4464 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABO7LXJU7ZBIYJ5S6Z2D2TSRKRZZANCNFSM4TLJAWZQ>
.
|
Signed-off-by: github@finnesand.no <github@finnesand.no> feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543) Signed-off-by: Tianchu Zhao <evantczhao@gmail.com> feat(executor): Auto create s3 bucket if not present. Signed-off-by: Alex Capras <alexcapras@gmail.com> Apply codegen Signed-off-by: Alex Capras <alexcapras@gmail.com> Add argo-e2e label to test wf Signed-off-by: Alex Capras <alexcapras@gmail.com> chore: Updated stress test YAML (argoproj#4569) Signed-off-by: Alex Collins <alex_collins@intuit.com> docs: Updated kubectl apply command in manifests README (argoproj#4577) Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com> feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562) Signed-off-by: Alex Collins <alex_collins@intuit.com> docs: Fix a typo in example (argoproj#4590) Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com> feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482) Signed-off-by: Alex Collins <alex_collins@intuit.com> fix(server): use the correct name when downloading artifacts (argoproj#4579) Signed-off-by: Daniel Herman <dherman@factset.com> fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589) Signed-off-by: Daniel Herman <dherman@factset.com> fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575) Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com> docs: Add JSON schema for IDE validation (argoproj#4581) Signed-off-by: Paul Brabban <paul.brabban@gmail.com> refactor: Use polling model for workflow phase metric (argoproj#4557) Signed-off-by: Simon Behar <simbeh7@gmail.com> Addressing reviewers comments Signed-off-by: Alex Capras <alexcapras@gmail.com> Addressing reviewers comments docs: Minor typo fix (argoproj#4610) Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com> fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598) Signed-off-by: terrytangyuan <terrytangyuan@gmail.com> fix(docs): Bring minio chart instructions up to date (argoproj#4586) Signed-off-by: Ranga Krishnan <ranga@bei.re> fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601) Signed-off-by: fsiegmund <siegmund@slb.com> feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458) Signed-off-by: Alex Collins <alex_collins@intuit.com> fix: Null check pagination variable (argoproj#4617) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix: Perform fields filtering server side (argoproj#4595) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594) Signed-off-by: Alex Collins <alex_collins@intuit.com> feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371) fix: Fix TestCleanFieldsExclude (argoproj#4625) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369) Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com> fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611) Signed-off-by: Paul Brabban <paul.brabban@gmail.com> fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973 (argoproj#4419) Signed-off-by: Alejandro Tejera <aletepe@gmail.com> fix(ui): Fix Snyk issues (argoproj#4631) Signed-off-by: Alex Collins <alex_collins@intuit.com> feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620) Signed-off-by: terrytangyuan <terrytangyuan@gmail.com> Codegen patch. Signed off by alexcapras@gmail.com Codegen patch. Signed off by alexcapras@gmail.com Delete test.patch
Signed-off-by: github@finnesand.no <github@finnesand.no> feat(ui): Add Template/Cron workflow filter to workflow page. Closes argoproj#4532 (argoproj#4543) Signed-off-by: Tianchu Zhao <evantczhao@gmail.com> feat(executor): Auto create s3 bucket if not present. Signed-off-by: Alex Capras <alexcapras@gmail.com> Apply codegen Signed-off-by: Alex Capras <alexcapras@gmail.com> Add argo-e2e label to test wf Signed-off-by: Alex Capras <alexcapras@gmail.com> chore: Updated stress test YAML (argoproj#4569) Signed-off-by: Alex Collins <alex_collins@intuit.com> docs: Updated kubectl apply command in manifests README (argoproj#4577) Signed-off-by: Stefan Gloutnikov <stefan@gloutnikov.com> feat(controller): Make MAX_OPERATION_TIME configurable. Close argoproj#4239 (argoproj#4562) Signed-off-by: Alex Collins <alex_collins@intuit.com> docs: Fix a typo in example (argoproj#4590) Signed-off-by: Takayoshi Nishida <takayoshi.nishida@gmail.com> feat(controller): Retry transient offload errors. Resolves argoproj#4464 (argoproj#4482) Signed-off-by: Alex Collins <alex_collins@intuit.com> fix(server): use the correct name when downloading artifacts (argoproj#4579) Signed-off-by: Daniel Herman <dherman@factset.com> fix(server): serve artifacts directly from disk to support large artifacts (argoproj#4589) Signed-off-by: Daniel Herman <dherman@factset.com> fix(executor): Handle sidecar killing in a process-namespace-shared pod (argoproj#4575) Signed-off-by: Daisuke Taniwaki <daisuketaniwaki@gmail.com> docs: Add JSON schema for IDE validation (argoproj#4581) Signed-off-by: Paul Brabban <paul.brabban@gmail.com> refactor: Use polling model for workflow phase metric (argoproj#4557) Signed-off-by: Simon Behar <simbeh7@gmail.com> Addressing reviewers comments Signed-off-by: Alex Capras <alexcapras@gmail.com> Addressing reviewers comments docs: Minor typo fix (argoproj#4610) Signed-off-by: Paavo Pokkinen <paavo.pokkinen@vaimo.com> fix(controller): Prevent tasks with names starting with digit to use either 'depends' or 'dependencies' (argoproj#4598) Signed-off-by: terrytangyuan <terrytangyuan@gmail.com> fix(docs): Bring minio chart instructions up to date (argoproj#4586) Signed-off-by: Ranga Krishnan <ranga@bei.re> fix(executor): Fixed waitMainContainerStart returning prematurely. Closes argoproj#4599 (argoproj#4601) Signed-off-by: fsiegmund <siegmund@slb.com> feat(controller): Enhanced artifact repository ref. See argoproj#3184 (argoproj#4458) Signed-off-by: Alex Collins <alex_collins@intuit.com> fix: Null check pagination variable (argoproj#4617) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix: Perform fields filtering server side (argoproj#4595) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix(server): Correct webhook event payload marshalling. Fixes argoproj#4572 (argoproj#4594) Signed-off-by: Alex Collins <alex_collins@intuit.com> feat(ui): Add columns--narrower-height to AttributeRow (argoproj#4371) fix: Fix TestCleanFieldsExclude (argoproj#4625) Signed-off-by: Simon Behar <simbeh7@gmail.com> fix(argo-server): fix global variable validation error with reversed dag.tasks (argoproj#4369) Signed-off-by: chenyu.zheng <chenyu.zheng@hulu.com> fix: derive jsonschema and fix up issues, validate examples dir… (argoproj#4611) Signed-off-by: Paul Brabban <paul.brabban@gmail.com> fix(ui): Reference secrets in EnvVars. Fixes argoproj#3973 (argoproj#4419) Signed-off-by: Alejandro Tejera <aletepe@gmail.com> fix(ui): Fix Snyk issues (argoproj#4631) Signed-off-by: Alex Collins <alex_collins@intuit.com> feat(executor): More informative log when executors do not support output param from base image layer (argoproj#4620) Signed-off-by: terrytangyuan <terrytangyuan@gmail.com> Codegen patch. Signed off by alexcapras@gmail.com Codegen patch. Signed off by alexcapras@gmail.com Delete test.patch Signed-off-by: Alex Capras <alexcapras@gmail.com>
Summary
If there is a transient database connection error with offload enabled, then active workflows are marked as failed. I would at least expect some kind of retry.
Diagnostics
What Kubernetes provider are you using?
GKE
What version of Argo Workflows are you running?
2.11.0-rc1
Seems to be caused by controller.go:517 which marks any workflows as 'error' if hydration fails.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: