Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(usage): Report resource duration. Closes #1066 #2219

Merged
merged 34 commits into from
Mar 19, 2020

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Feb 12, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
  • Optional. I've added My organization is added to the README.
  • I've signed the CLA and required builds are green.

Closes #1066

Screen Shot 2020-02-12 at 4 47 01 PM

Screen Shot 2020-02-12 at 4 47 32 PM

As it stands, this computes the usage and stores in it the status field for pods only. It then provides a method to get the total usage.

The usages of steps/DAGs (which would be the sum of the pods within them) is not calculated, but could be useful.

@codecov
Copy link

codecov bot commented Feb 12, 2020

Codecov Report

Merging #2219 into master will decrease coverage by 0.43%.
The diff coverage is 22.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2219      +/-   ##
==========================================
- Coverage   11.92%   11.48%   -0.44%     
==========================================
  Files          53       72      +19     
  Lines       26566    28148    +1582     
==========================================
+ Hits         3169     3234      +65     
- Misses      22993    24507    +1514     
- Partials      404      407       +3
Impacted Files Coverage Δ
...kg/apis/workflow/v1alpha1/zz_generated.deepcopy.go 0% <ø> (ø) ⬆️
workflow/common/util.go 6.1% <ø> (+0.12%) ⬆️
cmd/argo/commands/root.go 0% <ø> (ø)
workflow/validate/validate.go 75.42% <ø> (+0.15%) ⬆️
pkg/apis/workflow/v1alpha1/generated.pb.go 0.45% <ø> (ø) ⬆️
persist/sqldb/null_workflow_archive.go 0% <0%> (ø) ⬆️
persist/sqldb/workflow_archive.go 0% <0%> (ø) ⬆️
workflow/util/util.go 16.04% <0%> (ø) ⬆️
persist/sqldb/migrate.go 0% <0%> (ø) ⬆️
workflow/controller/config_controller.go 14.72% <0%> (-0.72%) ⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aecb13d...5f83f18. Read the comment docs.

@ddseapy
Copy link
Contributor

ddseapy commented Feb 12, 2020

So my understanding from reading this, is that a user configures a "utilization per resource" in the workflow: how much they want to count for each resource per minute (ie. 1Gi RAM/min = 4 utilization points, 1 CPU/min = 6 utilization points).

Argo looks at duration between container start/end times, along with the container's resource request. It takes that and the user's "utilization per resource" to compute an integer that is representative of the utilization.

If this is correct, then this satisfies our use case (other than needing gpu resource type, which I can create a subsequent merge request if that makes sense).

@ddseapy
Copy link
Contributor

ddseapy commented Feb 12, 2020

This looks great @alexec, and thanks for adding the gpu metric too! I'm happy to test this out whenever you think it's ready.

docs/usage.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
docs/usage.md Outdated Show resolved Hide resolved
@alexec alexec marked this pull request as ready for review February 13, 2020 02:24
@alexec alexec added this to the v2.7 milestone Feb 13, 2020
@alexec
Copy link
Contributor Author

alexec commented Feb 13, 2020

@jessesuen this is a PoC for creating a indicative but not accurate estimate of resource usage.

This is a white-space project.

I'm in two minds about it. While I like the indicative nature of it, and the cheap cost of implementation, I'm quite aware that the way the usage is estimate is only really very rough.

The question therefore in my mind is this - is the benefit of this metric greater or less than the cost of people miss-understanding it?

Thoughts?

@alexec
Copy link
Contributor Author

alexec commented Feb 13, 2020

The question therefore in my mind is this - is the benefit of this metric greater or less than the cost of people miss-understanding it?

One mitigation to this is to change it from "usage" to "guesstimate usage".

@alexec
Copy link
Contributor Author

alexec commented Feb 13, 2020

Lets call a spade "a spade".

/spec/nodes/-/usageIndicator.

@alexec alexec changed the title feat(usage): Report usage. Closes #1066 feat(usage): Report usage indication. Closes #1066 Feb 13, 2020
@hnykda
Copy link

hnykda commented Feb 14, 2020

As a naive observer and lurker of the repo, but a user of argo, a rough estimate is better than no estimate if I know it's rough. So if it's well documented (and maybe with a tooltip), it's better to have it.

Copy link
Member

@simster7 simster7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some strong reservations about this feature overall.

Unless I am misunderstanding this, it seems like all this feature does is multiply the amount of resource requests/limits a container has with the amount of seconds it has run for? Since the resource requests/limits are static, why are we going through all this trouble to essentially tell users how long a container ran for with different units? The users provide the resource requests/limits themselves so they can estimate this themselves if they need to?

If users are concerned about seeing how long different containers in a pod take to run, perhaps a breakdown of run times between different containers in a pod would be a more concrete metric to surface.

I may be misunderstanding this, but in my opinion this feature is esoteric, confusing, and has more potential to confuse and misguide than to provide useful information.

Should we discuss with @jessesuen?

Comment on lines 21 to 31
## Limitations & Assumptions

To calculate the usage we assume that request/limit/default for a resource is a good enough representative of the pods average usage.

This is **never** actually the case:

* The pod will probably use more that the request and less than the limit.
* The pod may use more than the limit or less than the request.

This is why the usage is **indicative, but not accurate**.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like this feature is very limited and that understanding of this doc is required to accurately understand what the usage indicators report. Therefore I strongly suggest having this feature be opt-in as opposed to on by default.

Comment on lines 862 to 865
// An indicator (i.e. indicative but not accurate) amount of resource * time in seconds, e.g.
// CPU 1000m * 1m = 1m
// memory 1Gi * 2m = 2Gi
// this is represented as duration in seconds, so can be converted to and from duration (with loss of precision below seconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't units here be of type resource * time. I.e. 1 Gigibyte * 2 minutes = 2 Gigibyte-minutes? Also this specifically says time in seconds, but minutes are used?


A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1:

* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units are not clearly expressed. Seems like here you define 1 cpu := 1000 milicores but it is not explicitly defined anywhere? The concept of a millicore is K8s specific, we shouldn't assume it is known.

A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1:

* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu
* Memory: 3 * 60s * 100m / 1Gi = 0s*memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense to me.

What is m in this context? Milicores? Mebibytes?

What is a memory unit defined as?

If I take this expression literally, it says to me either "18,000 second-milicores per Gigibyte" or "180 seconds per kibibye", depending on what m is here. Neither of these make sense per se or as a measure of memory usage.

How do we end up at "0 second memories" here?


* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu
* Memory: 3 * 60s * 100m / 1Gi = 0s*memory
* GPU: 2 * 60s * = 3m*nvidia.com/gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be malformed. How is an nvidia.com/gpu unit defined? How do we arrive at this conclusion?

i = i.Add(wfv1.UsageIndicator{r: wfv1.NewResourceUsageIndicator(time.Duration(q.Value() * duration.Nanoseconds() / resourceDenominator(r).Value()))})
}
}
return i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some issues with this file -- let's discuss in person?

pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved
util/usage/indicator_test.go Outdated Show resolved Hide resolved
pkg/apis/workflow/v1alpha1/workflow_types.go Outdated Show resolved Hide resolved
util/usage/indicator.go Outdated Show resolved Hide resolved
@crenshaw-dev
Copy link
Member

I was going to take this PR for a spin. But after building it, I'm getting this from the UI:

Request has been terminated Possible causes: the network is offline, Origin is not allowed by Access-Control-Allow-Origin, the page is being unloaded, etc.

I'm not getting the error from master. Any ideas?

@crenshaw-dev
Copy link
Member

I was on cost instead of usage. Disregard. :-)

@crenshaw-dev
Copy link
Member

@alexec we're hoping to incorporate this into an upcoming release of a downstream project. I can create a branch to work on some of the cosmetic changes I mentioned. Is there anything else you'd like me to work on while I'm in there?

@alexec
Copy link
Contributor Author

alexec commented Mar 9, 2020

@jessesuen thoughts? I think we should aim for v2.7. This has feature flag as requested.

@alexec alexec modified the milestones: Backlog, v2.7 Mar 9, 2020
@alexec alexec requested a review from jessesuen March 9, 2020 20:23
@crenshaw-dev
Copy link
Member

@jessesuen I'm happy to contribute any changes you need. Also happy to rebase to help free up Alex's time.

@alexec alexec merged commit 53a1056 into argoproj:master Mar 19, 2020
@alexec alexec deleted the usage branch March 19, 2020 18:37
@alexec alexec modified the milestones: v2.7, v2.8 Mar 19, 2020
@alexec alexec modified the milestones: v2.8, v2.7 Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Calculate workflow resource usage
6 participants