feat(usage): Report resource duration. Closes #1066 #2219

alexec · 2020-02-12T03:36:02Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
Optional. I've added My organization is added to the README.
I've signed the CLA and required builds are green.

As it stands, this computes the usage and stores in it the status field for pods only. It then provides a method to get the total usage.

The usages of steps/DAGs (which would be the sum of the pods within them) is not calculated, but could be useful.

codecov · 2020-02-12T03:55:48Z

Codecov Report

Merging #2219 into master will decrease coverage by 0.43%.
The diff coverage is 22.22%.

@@            Coverage Diff             @@
##           master    #2219      +/-   ##
==========================================
- Coverage   11.92%   11.48%   -0.44%     
==========================================
  Files          53       72      +19     
  Lines       26566    28148    +1582     
==========================================
+ Hits         3169     3234      +65     
- Misses      22993    24507    +1514     
- Partials      404      407       +3

Impacted Files	Coverage Δ
...kg/apis/workflow/v1alpha1/zz_generated.deepcopy.go	`0% <ø> (ø)`	⬆️
workflow/common/util.go	`6.1% <ø> (+0.12%)`	⬆️
cmd/argo/commands/root.go	`0% <ø> (ø)`
workflow/validate/validate.go	`75.42% <ø> (+0.15%)`	⬆️
pkg/apis/workflow/v1alpha1/generated.pb.go	`0.45% <ø> (ø)`	⬆️
persist/sqldb/null_workflow_archive.go	`0% <0%> (ø)`	⬆️
persist/sqldb/workflow_archive.go	`0% <0%> (ø)`	⬆️
workflow/util/util.go	`16.04% <0%> (ø)`	⬆️
persist/sqldb/migrate.go	`0% <0%> (ø)`	⬆️
workflow/controller/config_controller.go	`14.72% <0%> (-0.72%)`	⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aecb13d...5f83f18. Read the comment docs.

ddseapy · 2020-02-12T17:24:07Z

So my understanding from reading this, is that a user configures a "utilization per resource" in the workflow: how much they want to count for each resource per minute (ie. 1Gi RAM/min = 4 utilization points, 1 CPU/min = 6 utilization points).

Argo looks at duration between container start/end times, along with the container's resource request. It takes that and the user's "utilization per resource" to compute an integer that is representative of the utilization.

If this is correct, then this satisfies our use case (other than needing gpu resource type, which I can create a subsequent merge request if that makes sense).

ddseapy · 2020-02-12T22:30:13Z

This looks great @alexec, and thanks for adding the gpu metric too! I'm happy to test this out whenever you think it's ready.

docs/usage.md

alexec · 2020-02-13T02:28:44Z

@jessesuen this is a PoC for creating a indicative but not accurate estimate of resource usage.

This is a white-space project.

I'm in two minds about it. While I like the indicative nature of it, and the cheap cost of implementation, I'm quite aware that the way the usage is estimate is only really very rough.

The question therefore in my mind is this - is the benefit of this metric greater or less than the cost of people miss-understanding it?

Thoughts?

alexec · 2020-02-13T19:54:22Z

The question therefore in my mind is this - is the benefit of this metric greater or less than the cost of people miss-understanding it?

One mitigation to this is to change it from "usage" to "guesstimate usage".

alexec · 2020-02-13T22:13:56Z

Lets call a spade "a spade".

/spec/nodes/-/usageIndicator.

hnykda · 2020-02-14T10:50:47Z

As a naive observer and lurker of the repo, but a user of argo, a rough estimate is better than no estimate if I know it's rough. So if it's well documented (and maybe with a tooltip), it's better to have it.

simster7

I have some strong reservations about this feature overall.

Unless I am misunderstanding this, it seems like all this feature does is multiply the amount of resource requests/limits a container has with the amount of seconds it has run for? Since the resource requests/limits are static, why are we going through all this trouble to essentially tell users how long a container ran for with different units? The users provide the resource requests/limits themselves so they can estimate this themselves if they need to?

If users are concerned about seeing how long different containers in a pod take to run, perhaps a breakdown of run times between different containers in a pod would be a more concrete metric to surface.

I may be misunderstanding this, but in my opinion this feature is esoteric, confusing, and has more potential to confuse and misguide than to provide useful information.

Should we discuss with @jessesuen?

simster7 · 2020-02-20T19:39:14Z

docs/usage-indicator.md

+## Limitations & Assumptions
+
+To calculate the usage we assume that request/limit/default for a resource is a good enough representative of the pods average usage.
+
+This is **never** actually the case:
+
+* The pod will probably use more that the request and less than the limit.
+* The pod may use more than the limit or less than the request.
+
+This is why the usage is **indicative, but not accurate**.
+


This seems like this feature is very limited and that understanding of this doc is required to accurately understand what the usage indicators report. Therefore I strongly suggest having this feature be opt-in as opposed to on by default.

simster7 · 2020-02-20T19:51:58Z

pkg/apis/workflow/v1alpha1/workflow_types.go

+// An indicator (i.e. indicative but not accurate) amount of resource * time in seconds, e.g.
+// CPU 1000m * 1m = 1m
+// memory 1Gi * 2m = 2Gi
+// this is represented as duration in seconds, so can be converted to and from duration (with loss of precision below seconds)


Shouldn't units here be of type resource * time. I.e. 1 Gigibyte * 2 minutes = 2 Gigibyte-minutes? Also this specifically says time in seconds, but minutes are used?

simster7 · 2020-02-20T19:55:16Z

docs/usage-indicator.md

+
+A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1:
+
+* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu


Units are not clearly expressed. Seems like here you define 1 cpu := 1000 milicores but it is not explicitly defined anywhere? The concept of a millicore is K8s specific, we shouldn't assume it is known.

simster7 · 2020-02-20T20:07:22Z

docs/usage-indicator.md

+A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1:
+
+* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu
+* Memory: 3 * 60s * 100m / 1Gi = 0s*memory


This makes no sense to me.

What is m in this context? Milicores? Mebibytes?

What is a memory unit defined as?

If I take this expression literally, it says to me either "18,000 second-milicores per Gigibyte" or "180 seconds per kibibye", depending on what m is here. Neither of these make sense per se or as a measure of memory usage.

How do we end up at "0 second memories" here?

simster7 · 2020-02-20T20:07:54Z

docs/usage-indicator.md

+
+* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu
+* Memory: 3 * 60s * 100m / 1Gi = 0s*memory
+* GPU: 2 * 60s * = 3m*nvidia.com/gpu


This appears to be malformed. How is an nvidia.com/gpu unit defined? How do we arrive at this conclusion?

simster7 · 2020-02-20T20:18:28Z

util/usage/indicator.go

+			i = i.Add(wfv1.UsageIndicator{r: wfv1.NewResourceUsageIndicator(time.Duration(q.Value() * duration.Nanoseconds() / resourceDenominator(r).Value()))})
+		}
+	}
+	return i


I see some issues with this file -- let's discuss in person?

pkg/apis/workflow/v1alpha1/workflow_types.go

util/usage/indicator_test.go

pkg/apis/workflow/v1alpha1/workflow_types.go

util/usage/indicator.go

crenshaw-dev · 2020-03-06T19:33:59Z

I was going to take this PR for a spin. But after building it, I'm getting this from the UI:

Request has been terminated Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.

I'm not getting the error from master. Any ideas?

crenshaw-dev · 2020-03-06T19:45:58Z

I was on cost instead of usage. Disregard. :-)

crenshaw-dev · 2020-03-09T18:24:17Z

@alexec we're hoping to incorporate this into an upcoming release of a downstream project. I can create a branch to work on some of the cosmetic changes I mentioned. Is there anything else you'd like me to work on while I'm in there?

alexec · 2020-03-09T18:42:22Z

@jessesuen thoughts? I think we should aim for v2.7. This has feature flag as requested.

crenshaw-dev · 2020-03-16T12:25:40Z

@jessesuen I'm happy to contribute any changes you need. Also happy to rebase to help free up Alex's time.

feat(usage): Report usage. Closes argoproj#1066

907ed36

usage

02ac82c

feat(usage): report it

87a230b

ddseapy reviewed Feb 12, 2020

View reviewed changes

docs/usage.md Outdated Show resolved Hide resolved

ddseapy reviewed Feb 12, 2020

View reviewed changes

docs/usage.md Outdated Show resolved Hide resolved

ddseapy reviewed Feb 12, 2020

View reviewed changes

docs/usage.md Outdated Show resolved Hide resolved

alexec added 4 commits February 12, 2020 16:13

feat(usage): reduce memory request

9e17a45

Merge branch 'master' into usage

3a08bfa

make pre-commit

1cf1dd9

removed functional test

e3a47dc

alexec marked this pull request as ready for review February 13, 2020 02:24

alexec added this to the v2.7 milestone Feb 13, 2020

alexec added 2 commits February 13, 2020 14:22

Merge branch 'master' into usage

efda065

change to usage indicator

950c02e

alexec changed the title ~~feat(usage): Report usage. Closes #1066~~ feat(usage): Report usage indication. Closes #1066 Feb 13, 2020

fix swagger version

aecb13d

alexec added 4 commits February 20, 2020 11:32

Merge branch 'master' into usage

4f5bb90

codegen

2bab062

test(cli): Fix test

5686c74

test(cli): fix test again

cb96590

simster7 reviewed Feb 20, 2020

View reviewed changes

alexec added 2 commits March 6, 2020 15:05

Merge branch 'master' into usage

2d3a5e3

codegen manifests

e690753

alexec modified the milestones: Backlog, v2.7 Mar 9, 2020

Update workflow-controller-configmap.yaml

1c7daed

alexec requested a review from jessesuen March 9, 2020 20:23

alexec and others added 2 commits March 9, 2020 13:23

Merge branch 'master' into usage

7f17933

Usage tweaks (#2)

5676047

alexec added 5 commits March 17, 2020 08:06

Merge branch 'master' into usage

abf43cb

make codegen

f87c353

Merge branch 'master' into usage

da1df8c

codegen

d232762

manifests

2714ac8

sarabala1979 approved these changes Mar 19, 2020

View reviewed changes

alexec added 2 commits March 19, 2020 10:39

Merge branch 'master' into usage

fdb4b16

make codegen manifests lint

ce94f9f

alexec merged commit 53a1056 into argoproj:master Mar 19, 2020

alexec deleted the usage branch March 19, 2020 18:37

alexec modified the milestones: v2.7, v2.8 Mar 19, 2020

alexec modified the milestones: v2.8, v2.7 Apr 3, 2020

alexec mentioned this pull request Apr 17, 2020

Report resource usage (e.g. CPU, memory) per step and workflow #2728

Closed

salanki mentioned this pull request Apr 26, 2020

Aggregate resource duration to step group and workflow level #2843

Closed

alexec mentioned this pull request May 3, 2020

Enhanced resource duration #2934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(usage): Report resource duration. Closes #1066 #2219

feat(usage): Report resource duration. Closes #1066 #2219

alexec commented Feb 12, 2020 •

edited

Loading

codecov bot commented Feb 12, 2020 •

edited

Loading

ddseapy commented Feb 12, 2020

ddseapy commented Feb 12, 2020

alexec commented Feb 13, 2020

alexec commented Feb 13, 2020

alexec commented Feb 13, 2020

hnykda commented Feb 14, 2020 •

edited

Loading

simster7 left a comment

simster7 Feb 20, 2020

simster7 Feb 20, 2020

simster7 Feb 20, 2020

simster7 Feb 20, 2020

simster7 Feb 20, 2020

simster7 Feb 20, 2020

crenshaw-dev commented Mar 6, 2020

crenshaw-dev commented Mar 6, 2020

crenshaw-dev commented Mar 9, 2020

alexec commented Mar 9, 2020

crenshaw-dev commented Mar 16, 2020


		A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1:

		* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu

feat(usage): Report resource duration. Closes #1066 #2219

feat(usage): Report resource duration. Closes #1066 #2219

Conversation

alexec commented Feb 12, 2020 • edited Loading

codecov bot commented Feb 12, 2020 • edited Loading

Codecov Report

ddseapy commented Feb 12, 2020

ddseapy commented Feb 12, 2020

alexec commented Feb 13, 2020

alexec commented Feb 13, 2020

alexec commented Feb 13, 2020

hnykda commented Feb 14, 2020 • edited Loading

simster7 left a comment

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

simster7 Feb 20, 2020

Choose a reason for hiding this comment

crenshaw-dev commented Mar 6, 2020

crenshaw-dev commented Mar 6, 2020

crenshaw-dev commented Mar 9, 2020

alexec commented Mar 9, 2020

crenshaw-dev commented Mar 16, 2020

alexec commented Feb 12, 2020 •

edited

Loading

codecov bot commented Feb 12, 2020 •

edited

Loading

hnykda commented Feb 14, 2020 •

edited

Loading