Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upper: no more rows in this result set #2333

Closed
alexec opened this issue Feb 28, 2020 · 20 comments · Fixed by #4864
Closed

upper: no more rows in this result set #2333

alexec opened this issue Feb 28, 2020 · 20 comments · Fixed by #4864

Comments

@alexec
Copy link
Contributor

alexec commented Feb 28, 2020

Still seeing this issue locally. We should never see this for offloaded workflows.

@alexec alexec added this to the v2.5 milestone Feb 28, 2020
@alexec
Copy link
Contributor Author

alexec commented Feb 28, 2020

The GC should never delete any UIDVersion that is currently live. I.e. if a UID+version can be found in etcd, then we should never delete it. For e2e tests, this is set as follows:

OFFLOAD_NODE_STATUS_TTL=30s
WORKFLOW_GC_PERIOD=30s

I.e. every 30s we try and delete any wf older than 30s. This means that any running workflow should be GC after 1m, i.e. any problems should appear after 1m.

However, at some point greater than 1m, they appear to be getting deleted.

I think this maybe happening after some kind of restart - maybe the lister returns zero records?

level=info msg="Deleting old offloads that are not live" len_old_offloads=3 len_wfs=74

If len_wfs=0, then this would be suspicious.

@alexec alexec added this to In progress in Argo Workflows OSS Kanban Board Mar 9, 2020
@alexec alexec closed this as completed Mar 25, 2020
@alexec
Copy link
Contributor Author

alexec commented Mar 25, 2020

Bug in tests.

@duongnt
Copy link

duongnt commented Nov 23, 2020

@alexec we're still seeing this issue in 2.10.0

@alexec
Copy link
Contributor Author

alexec commented Nov 23, 2020

Please upgrade to v2.11.8

@markterm
Copy link
Contributor

I have been seeing this issue in v2.12.2, it continued happening after increasing OFFLOAD_NODE_STATUS_TTL to 30m, but haven't seen it since increasing it to 360m.

@alexec alexec reopened this Jan 11, 2021
@alexec
Copy link
Contributor Author

alexec commented Jan 11, 2021

Interesting...

You should NEVER see this error.

OFFLOAD_NODE_STATUS_TTL is 5m by default. This is so that watches (which can only be 5m long) work.

My question for @markterm is - did you see this in the controller or argo-server?

@markterm
Copy link
Contributor

I got the error from the argo-server, but I have confirmed that the workflow-controller deleted the record prematurely.

I did a run with some extra logging, and in this case saw that 'fnv:900512843' was deleted while the workflow was still running, and the liveOffloadNodeStatusVersions for that workflow in the workflowGarbageCollector function was set to an empty string.

@markterm
Copy link
Contributor

markterm commented Jan 11, 2021

It appears that if I change workflow/controller/controller.go:442 to:

if !ok || (nodeStatusVersion != record.Version && nodeStatusVersion != "") {

Then the problem doesn't occur.

Also I have logged the liveOffloadNodeStatusVersions value for one of the workflows in progress, here it is:

02:15:06.584 eclipse workflow-controller time="2021-01-11T02:15:06.584Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version="fnv:890200010"
02:20:05.263 eclipse workflow-controller time="2021-01-11T02:20:04.653Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:25:04.504 eclipse workflow-controller time="2021-01-11T02:25:04.489Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:30:04.495 eclipse workflow-controller time="2021-01-11T02:30:04.495Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:35:04.470 eclipse workflow-controller time="2021-01-11T02:35:04.470Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=
02:40:04.521 eclipse workflow-controller time="2021-01-11T02:40:04.521Z" level=info msg=OffloadNodeStatusVersion UID=88e182ad-d01b-4e30-8b16-34f9a311f364 Version=

However I did kubectl -o yaml on this same workflow at 02:38, and got:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
...
  uid: 88e182ad-d01b-4e30-8b16-34f9a311f364
...
status:
  finishedAt: null
  offloadNodeStatusVersion: fnv:2438327164
  phase: Running
  progress: 2/2
  resourcesDuration:
    cpu: 46
    memory: 28
  startedAt: "2021-01-11T02:11:22Z"

As you can see, the problem is it has an offload version in etcd, but the workflow-controller doesn't seem to be seeing it.

This is from the argo_workflows table in the DB:

image

@alexec alexec self-assigned this Jan 11, 2021
@alexec
Copy link
Contributor Author

alexec commented Jan 11, 2021

I'm not sure that is the right fix.

It is possible (but very unlikely) for a workflow to have a non-empty value for a nodeStatusVersion, and subsequently for that to be set to empty. This might happen if there was an update conflict. That maybe what happened to you.

But that doesn't explain to me why we would delete the current version.

For the controller, we only ever care about the most current version. If we can't get it, we'll error the workflow.

Do you see errored workflows?

@markterm
Copy link
Contributor

markterm commented Jan 11, 2021 via email

@alexec
Copy link
Contributor Author

alexec commented Jan 11, 2021

always offload,

This is only intended for dev environments - you might be better off using the workflow archive.

nodeStatusVersion should never be empty.

so there presumably is a. bug

@markterm
Copy link
Contributor

markterm commented Jan 12, 2021 via email

@alexec
Copy link
Contributor Author

alexec commented Jan 12, 2021

ah - so maybe you're using argo node set and there is a bug there? you must used that with ARGO_SERVER

@markterm
Copy link
Contributor

markterm commented Jan 12, 2021 via email

@alexec
Copy link
Contributor Author

alexec commented Jan 12, 2021

This sounds like an edge case bug to me.

Can I ask why you need to have ALWAYS_OFFLOAD_NODE_STATUS=true? This isn't intended for production usage.

@markterm
Copy link
Contributor

We're using it to avoid the bug in argo node set. I didn't realise it wasn't intended for production usage, so identifying and fixing that bug is probably more important, but we haven't had luck tracking it down so far.

@alexec
Copy link
Contributor Author

alexec commented Jan 12, 2021

It is expensive to run using ALWAYS_OFFLOAD_NODE_STATUS CPU+memory+network+disk cost will all be much higher, so your AWS bills will be higher too.

That one change may stop the issue is most cases - and reduce your bills.

I think your problem could be caused by this in fact. Do you have ALWAYS_OFFLOAD_NODE_STATUS when you call argo set?

@simster7 I've inspected SetWorkflow and I can see a minor bug in SetWorkflow on line 480: we do not hydrate the workflow on this line.

@markterm
Copy link
Contributor

markterm commented Jan 12, 2021 via email

@alexec
Copy link
Contributor Author

alexec commented Jan 12, 2021

You must use ALWAYS_OFFLOAD_NODE_STATUS with the CLI too:

env ALWAYS_OFFLOAD_NODE_STATUS=true argo node set ...

But as I said, you should not be using this option.

alexec added a commit to alexec/argo-workflows that referenced this issue Jan 12, 2021
@markterm
Copy link
Contributor

markterm commented Jan 12, 2021 via email

alexec added a commit that referenced this issue Jan 12, 2021
…2333 (#4864)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants