upper: no more rows in this result set #2333

alexec · 2020-02-28T18:33:35Z

Still seeing this issue locally. We should never see this for offloaded workflows.

alexec · 2020-02-28T18:57:20Z

The GC should never delete any UIDVersion that is currently live. I.e. if a UID+version can be found in etcd, then we should never delete it. For e2e tests, this is set as follows:

OFFLOAD_NODE_STATUS_TTL=30s
WORKFLOW_GC_PERIOD=30s

I.e. every 30s we try and delete any wf older than 30s. This means that any running workflow should be GC after 1m, i.e. any problems should appear after 1m.

However, at some point greater than 1m, they appear to be getting deleted.

I think this maybe happening after some kind of restart - maybe the lister returns zero records?

level=info msg="Deleting old offloads that are not live" len_old_offloads=3 len_wfs=74

If len_wfs=0, then this would be suspicious.

alexec · 2020-03-25T20:25:34Z

Bug in tests.

duongnt · 2020-11-23T18:52:09Z

@alexec we're still seeing this issue in 2.10.0

alexec · 2020-11-23T19:01:14Z

Please upgrade to v2.11.8

markterm · 2021-01-10T20:23:17Z

I have been seeing this issue in v2.12.2, it continued happening after increasing OFFLOAD_NODE_STATUS_TTL to 30m, but haven't seen it since increasing it to 360m.

alexec · 2021-01-11T01:30:26Z

Interesting...

You should NEVER see this error.

OFFLOAD_NODE_STATUS_TTL is 5m by default. This is so that watches (which can only be 5m long) work.

My question for @markterm is - did you see this in the controller or argo-server?

markterm · 2021-01-11T02:26:58Z

I got the error from the argo-server, but I have confirmed that the workflow-controller deleted the record prematurely.

I did a run with some extra logging, and in this case saw that 'fnv:900512843' was deleted while the workflow was still running, and the liveOffloadNodeStatusVersions for that workflow in the workflowGarbageCollector function was set to an empty string.

markterm · 2021-01-11T02:41:42Z

It appears that if I change workflow/controller/controller.go:442 to:

if !ok || (nodeStatusVersion != record.Version && nodeStatusVersion != "") {

Then the problem doesn't occur.

Also I have logged the liveOffloadNodeStatusVersions value for one of the workflows in progress, here it is:

02:15:06.584 eclipse workflow-controller time="2021-01-11T02:15:06.584Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version="fnv:890200010"
02:20:05.263 eclipse workflow-controller time="2021-01-11T02:20:04.653Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:25:04.504 eclipse workflow-controller time="2021-01-11T02:25:04.489Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:30:04.495 eclipse workflow-controller time="2021-01-11T02:30:04.495Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:35:04.470 eclipse workflow-controller time="2021-01-11T02:35:04.470Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:40:04.521 eclipse workflow-controller time="2021-01-11T02:40:04.521Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=

However I did kubectl -o yaml on this same workflow at 02:38, and got:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
...
  uid: 88e182ad-d01b-4e30-8b16-34f9a311f364
...
status:
  finishedAt: null
  offloadNodeStatusVersion: fnv:2438327164
  phase: Running
  progress: 2/2
  resourcesDuration:
    cpu: 46
    memory: 28
  startedAt: "2021-01-11T02:11:22Z"

As you can see, the problem is it has an offload version in etcd, but the workflow-controller doesn't seem to be seeing it.

This is from the argo_workflows table in the DB:

alexec · 2021-01-11T16:52:26Z

I'm not sure that is the right fix.

It is possible (but very unlikely) for a workflow to have a non-empty value for a nodeStatusVersion, and subsequently for that to be set to empty. This might happen if there was an update conflict. That maybe what happened to you.

But that doesn't explain to me why we would delete the current version.

For the controller, we only ever care about the most current version. If we can't get it, we'll error the workflow.

Do you see errored workflows?

markterm · 2021-01-11T19:24:34Z

I agree it’s a patch, not the right fix. I’m not sure if I saw the workflows going into error state with the status deleted, but they certainly got stuck. One note is that we are configured to always offload, so the nodeStatusVersion should never be empty. Mark.

…

On Mon, 11 Jan 2021 at 16:52, Alex Collins ***@***.***> wrote: I'm not sure that is the right fix. It is possible (but very unlikely) for a workflow to have a non-empty value for a nodeStatusVersion, and subsequently for that to be set to empty. This might happen if there was an update conflict. That maybe what happened to you. But that doesn't explain to me why we would delete the current version. For the controller, we only ever care about the most current version. If we can't get it, we'll error the workflow. Do you see errored workflows? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LXYI2P6JNDBB6BVSWLSZMUFVANCNFSM4K5XTH4A> .

alexec · 2021-01-11T19:57:35Z

always offload,

This is only intended for dev environments - you might be better off using the workflow archive.

nodeStatusVersion should never be empty.

so there presumably is a. bug

markterm · 2021-01-12T00:23:18Z

Yes, I agree that indicates a bug - it seems like it’s seeing the nodeStatusVersion as empty when it isn’t. We are using node offload as otherwise there are intermittent times that calling the ‘node set’ api when a workflow is big enough to be using compression, doesn’t resume the selected node (or maybe it does and it gets overwritten). I’ve not been able to track down more details for now.

…

On Mon, 11 Jan 2021 at 19:57, Alex Collins ***@***.***> wrote: always offload, This is only intended for dev environments - you might be better off using the workflow archive. nodeStatusVersion should never be empty. so there presumably is a. bug — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LT5SSBNZ3D76L7TITTSZNJ35ANCNFSM4K5XTH4A> .

alexec · 2021-01-12T01:11:24Z

ah - so maybe you're using argo node set and there is a bug there? you must used that with ARGO_SERVER

markterm · 2021-01-12T07:55:55Z

That’s correct, we did.

…

On Tue, 12 Jan 2021 at 01:11, Alex Collins ***@***.***> wrote: ah - so maybe you're using argo node set and there is a bug there? you must used that with ARGO_SERVER — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LVXEZHUXS3USBGKV43SZOOUTANCNFSM4K5XTH4A> .

alexec · 2021-01-12T17:05:11Z

This sounds like an edge case bug to me.

Can I ask why you need to have ALWAYS_OFFLOAD_NODE_STATUS=true? This isn't intended for production usage.

markterm · 2021-01-12T17:46:00Z

We're using it to avoid the bug in argo node set. I didn't realise it wasn't intended for production usage, so identifying and fixing that bug is probably more important, but we haven't had luck tracking it down so far.

alexec · 2021-01-12T17:59:53Z

It is expensive to run using ALWAYS_OFFLOAD_NODE_STATUS CPU+memory+network+disk cost will all be much higher, so your AWS bills will be higher too.

That one change may stop the issue is most cases - and reduce your bills.

I think your problem could be caused by this in fact. Do you have ALWAYS_OFFLOAD_NODE_STATUS when you call argo set?

@simster7 I've inspected SetWorkflow and I can see a minor bug in SetWorkflow on line 480: we do not hydrate the workflow on this line.

markterm · 2021-01-12T18:14:48Z

The 'argo node set' API is working when ALWAYS_OFFLOAD_NODE_STATUS is true, and not working when ALWAYS_OFFLOAD_NODE_STATUS is false and the node status is compressed. We always have ALWAYS_OFFLOAD_NODE_STATUS consistent between the API Server and the workflow-controller.

…

On Tue, 12 Jan 2021 at 18:00, Alex Collins ***@***.***> wrote: It is expensive to run using ALWAYS_OFFLOAD_NODE_STATUS CPU+memory+network+disk cost will all be much higher, so your AWS bills will be higher too. That one change may stop the issue is most cases - and reduce your bills. I think your problem could be caused by this in fact. Do you have ALWAYS_OFFLOAD_NODE_STATUS when you call argo set? @simster7 <https://github.com/simster7> I've inspected SetWorkflow and I can see a minor bug in SetWorkflow on line 480: we do not hydrate the workflow on this line. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LVKVAIZRBI5FC5OZALSZSE2XANCNFSM4K5XTH4A> .

alexec · 2021-01-12T18:20:07Z

You must use ALWAYS_OFFLOAD_NODE_STATUS with the CLI too:

env ALWAYS_OFFLOAD_NODE_STATUS=true argo node set ...

But as I said, you should not be using this option.

…rgoproj#2333 Signed-off-by: Alex Collins <alex_collins@intuit.com>

markterm · 2021-01-12T18:31:25Z

I’m not using the CLI, we only do the node set via the api server.

…

On Tue, 12 Jan 2021 at 18:20, Alex Collins ***@***.***> wrote: You must use ALWAYS_OFFLOAD_NODE_STATUS with the CLI too: env ALWAYS_OFFLOAD_NODE_STATUS=true argo node set ... But as I said, you should not be using this option. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABO7LWTKUOWMR2QB25UZBLSZSHGPANCNFSM4K5XTH4A> .

…2333 (#4864) Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec added the type/bug label Feb 28, 2020

alexec added this to the v2.5 milestone Feb 28, 2020

alexec added this to In progress in Argo Workflows OSS Kanban Board Mar 9, 2020

alexec closed this as completed Mar 25, 2020

alexec reopened this Jan 11, 2021

alexec self-assigned this Jan 11, 2021

alexec added a commit to alexec/argo-workflows that referenced this issue Jan 12, 2021

fix(controller): Add matrix tests for node offload disabled. Resolves a…

72164f7

…rgoproj#2333 Signed-off-by: Alex Collins <alex_collins@intuit.com>

alexec mentioned this issue Jan 12, 2021

fix(controller): Add matrix tests for node offload disabled. Resolves #2333 #4864

Merged

1 task

alexec closed this as completed in #4864 Jan 12, 2021

alexec added a commit that referenced this issue Jan 12, 2021

fix(controller): Add matrix tests for node offload disabled. Resolves #…

622624e

…2333 (#4864) Signed-off-by: Alex Collins <alex_collins@intuit.com>

simster7 mentioned this issue Jan 19, 2021

v2.12.5 cherry-pick #4904

Closed

17 tasks

shuangkun mentioned this issue Apr 1, 2024

fix: retry large archived wf. Fixes #12740 #12741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upper: no more rows in this result set #2333

upper: no more rows in this result set #2333

alexec commented Feb 28, 2020

alexec commented Feb 28, 2020

alexec commented Mar 25, 2020

duongnt commented Nov 23, 2020

alexec commented Nov 23, 2020

markterm commented Jan 10, 2021

alexec commented Jan 11, 2021

markterm commented Jan 11, 2021

markterm commented Jan 11, 2021 •

edited

alexec commented Jan 11, 2021

markterm commented Jan 11, 2021 via email

alexec commented Jan 11, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

upper: no more rows in this result set #2333

upper: no more rows in this result set #2333

Comments

alexec commented Feb 28, 2020

alexec commented Feb 28, 2020

alexec commented Mar 25, 2020

duongnt commented Nov 23, 2020

alexec commented Nov 23, 2020

markterm commented Jan 10, 2021

alexec commented Jan 11, 2021

markterm commented Jan 11, 2021

markterm commented Jan 11, 2021 • edited

alexec commented Jan 11, 2021

markterm commented Jan 11, 2021 via email

alexec commented Jan 11, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

alexec commented Jan 12, 2021

markterm commented Jan 12, 2021 via email

markterm commented Jan 11, 2021 •

edited