-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(usage): Report resource duration. Closes #1066 #2219
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2219 +/- ##
==========================================
- Coverage 11.92% 11.48% -0.44%
==========================================
Files 53 72 +19
Lines 26566 28148 +1582
==========================================
+ Hits 3169 3234 +65
- Misses 22993 24507 +1514
- Partials 404 407 +3
Continue to review full report at Codecov.
|
So my understanding from reading this, is that a user configures a "utilization per resource" in the workflow: how much they want to count for each resource per minute (ie. 1Gi RAM/min = 4 utilization points, 1 CPU/min = 6 utilization points). Argo looks at duration between container start/end times, along with the container's resource request. It takes that and the user's "utilization per resource" to compute an integer that is representative of the utilization. If this is correct, then this satisfies our use case (other than needing gpu resource type, which I can create a subsequent merge request if that makes sense). |
This looks great @alexec, and thanks for adding the gpu metric too! I'm happy to test this out whenever you think it's ready. |
@jessesuen this is a PoC for creating a indicative but not accurate estimate of resource usage. This is a white-space project. I'm in two minds about it. While I like the indicative nature of it, and the cheap cost of implementation, I'm quite aware that the way the usage is estimate is only really very rough. The question therefore in my mind is this - is the benefit of this metric greater or less than the cost of people miss-understanding it? Thoughts? |
One mitigation to this is to change it from "usage" to "guesstimate usage". |
Lets call a spade "a spade".
|
As a naive observer and lurker of the repo, but a user of argo, a rough estimate is better than no estimate if I know it's rough. So if it's well documented (and maybe with a tooltip), it's better to have it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some strong reservations about this feature overall.
Unless I am misunderstanding this, it seems like all this feature does is multiply the amount of resource requests/limits a container has with the amount of seconds it has run for? Since the resource requests/limits are static, why are we going through all this trouble to essentially tell users how long a container ran for with different units? The users provide the resource requests/limits themselves so they can estimate this themselves if they need to?
If users are concerned about seeing how long different containers in a pod take to run, perhaps a breakdown of run times between different containers in a pod would be a more concrete metric to surface.
I may be misunderstanding this, but in my opinion this feature is esoteric, confusing, and has more potential to confuse and misguide than to provide useful information.
Should we discuss with @jessesuen?
docs/usage-indicator.md
Outdated
## Limitations & Assumptions | ||
|
||
To calculate the usage we assume that request/limit/default for a resource is a good enough representative of the pods average usage. | ||
|
||
This is **never** actually the case: | ||
|
||
* The pod will probably use more that the request and less than the limit. | ||
* The pod may use more than the limit or less than the request. | ||
|
||
This is why the usage is **indicative, but not accurate**. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like this feature is very limited and that understanding of this doc is required to accurately understand what the usage indicators report. Therefore I strongly suggest having this feature be opt-in as opposed to on by default.
// An indicator (i.e. indicative but not accurate) amount of resource * time in seconds, e.g. | ||
// CPU 1000m * 1m = 1m | ||
// memory 1Gi * 2m = 2Gi | ||
// this is represented as duration in seconds, so can be converted to and from duration (with loss of precision below seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't units here be of type resource * time
. I.e. 1 Gigibyte * 2 minutes = 2 Gigibyte-minutes
? Also this specifically says time in seconds, but minutes are used?
docs/usage-indicator.md
Outdated
|
||
A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1: | ||
|
||
* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Units are not clearly expressed. Seems like here you define 1 cpu := 1000 milicores
but it is not explicitly defined anywhere? The concept of a millicore
is K8s specific, we shouldn't assume it is known.
docs/usage-indicator.md
Outdated
A pod that runs for 3m, with a CPU limit of 2000m, no memory request and an `nvidia.com/gpu` resource limit of 1: | ||
|
||
* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu | ||
* Memory: 3 * 60s * 100m / 1Gi = 0s*memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes no sense to me.
What is m
in this context? Milicores? Mebibytes?
What is a memory
unit defined as?
If I take this expression literally, it says to me either "18,000 second-milicores per Gigibyte" or "180 seconds per kibibye", depending on what m
is here. Neither of these make sense per se or as a measure of memory usage.
How do we end up at "0 second memories" here?
docs/usage-indicator.md
Outdated
|
||
* CPU: 3 * 60s * 2000m / 1000m = 6m*cpu | ||
* Memory: 3 * 60s * 100m / 1Gi = 0s*memory | ||
* GPU: 2 * 60s * = 3m*nvidia.com/gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be malformed. How is an nvidia.com/gpu
unit defined? How do we arrive at this conclusion?
util/usage/indicator.go
Outdated
i = i.Add(wfv1.UsageIndicator{r: wfv1.NewResourceUsageIndicator(time.Duration(q.Value() * duration.Nanoseconds() / resourceDenominator(r).Value()))}) | ||
} | ||
} | ||
return i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some issues with this file -- let's discuss in person?
I was going to take this PR for a spin. But after building it, I'm getting this from the UI:
I'm not getting the error from master. Any ideas? |
I was on |
@alexec we're hoping to incorporate this into an upcoming release of a downstream project. I can create a branch to work on some of the cosmetic changes I mentioned. Is there anything else you'd like me to work on while I'm in there? |
@jessesuen thoughts? I think we should aim for v2.7. This has feature flag as requested. |
@jessesuen I'm happy to contribute any changes you need. Also happy to rebase to help free up Alex's time. |
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.Closes #1066
As it stands, this computes the usage and stores in it the status field for pods only. It then provides a method to get the total usage.
The usages of steps/DAGs (which would be the sum of the pods within them) is not calculated, but could be useful.