New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metrics for image pulling: error; in progress count; thoughput #7313
Conversation
Hi @pacoxu. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I will do some test locally later. |
afc79c3
to
e80b52c
Compare
The metrics now is like above. |
e80b52c
to
cde8646
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
pkg/cri/server/metrics.go
Outdated
@@ -54,5 +59,9 @@ func init() { | |||
containerStopTimer = ns.NewLabeledTimer("container_stop", "time to stop a container", "runtime") | |||
containerStartTimer = ns.NewLabeledTimer("container_start", "time to start a container", "runtime") | |||
|
|||
imagePullingError = ns.NewLabeledCounter("image_pulling_error", "error count of image pulling by image name", "image_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using image name will make the cardinality too high. Prometheus' doc recommends to keep the cardinality lower than 10, so I guess registry name should be a better option here.
[1] https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "status", "registry")
I changed to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's reasonable to assume that containerd will only pull from 10 distinct domains during its lifetime (which appears to be the actual distinction, not registries?). Some container registry providers use many distinct domains (for example, Amazon ECR has a different domain name for each customer account).
pkg/cri/server/metrics.go
Outdated
@@ -34,6 +34,11 @@ var ( | |||
containerCreateTimer metrics.LabeledTimer | |||
containerStopTimer metrics.LabeledTimer | |||
containerStartTimer metrics.LabeledTimer | |||
|
|||
imagePullingError metrics.LabeledCounter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a latency metric instead, with error
, registry
as labels? By doing that users will be able to
- see the image pull latencies,
- see both error count and success count, and therefore calculate the error ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add a count for both error pulling and success pulling, I agree.
imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "error", "registry")
For image pull latencies, do you mean the pulling duration? Could you explain?
- for the duration, I think imagePullThroughput would be better as the image size would 1MiB or 1GB. The latency does not make sense in some scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the error as a label, some error message contains the image name and some detailed information. If so, the registry label is not that important. If not, we have to make the error more groupable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of making this metric something like this:
imagePullTimer = ns.NewLabeledTimer("image_pull", "time to pull an image", "error", "registry")
, where error is just a boolean indicating whether the pull succeeds or not.
By doing this:
- if user wants to get the overall pulling duration, regardless of each image size, they can get this metric with
error=='false'
. - if user wants to get the number of failed image pulls, they can get counts of this metric with
error=='true'
But yeah, I think what you are doing here (i.e. separating the counter metric from duration/throughput metric) is also fine. I don't have a strong preference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added like
imagePulls = ns.NewLabeledCounter("image_pulls", "count of image pulling by registry", "status", "registry")
Two value is valid: status=failed
and status=succeed
cde8646
to
b8de211
Compare
pkg/cri/server/image_pull.go
Outdated
inProgressImagePulls.Inc() | ||
defer inProgressImagePulls.Dec() | ||
var domain string | ||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a common practice of handling error in defer is to use named return value, instead of using a var like here, because in this case you can't be sure err
will always be the return value.
For example, if you look at
containerd/pkg/cri/server/image_pull.go
Line 196 in b8de211
if err := c.createImageReference(ctx, r, image.Target()); err != nil { |
the returned error is based on a local err
created in the for-loop. So your defer here will not catch that local error.
(It's similar to https://go.dev/play/p/LNkan9NkAUt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I have to wrap like below.
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (*runtime.PullImageResponse, error) {
resp, domain, err := c.PullImageImp(ctx, r)
if err != nil {
imagePulls.WithValues("failed", domain).Inc()
} else {
imagePulls.WithValues("succeed", domain).Inc()
}
return resp, err
}
PullImageImp is the original function. Is this OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no you dont need to. Just use a named return value for the returned error and use the named return value in your defer, and golang will make sure it is always the returned error. Like this:
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
defer func() {
if retErr != nil {
...
} else {
...
}
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I need the domain as a label of the metrics, and the image registry domain is not returned.
This still does not work.
So I kept the change, would you take a look again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. I am just trying to minimize the number of wrapper methods. Can we do something like
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
var domain string
defer func() {
if retErr != nil {
...
} else {
...
}
}()
and domain
will be empty if it errs when trying to parse it, and will be non-empty if the parsing succeeds.
Commit b8de211 is missing a the |
Thanks. I will sign off all commits. |
b8de211
to
0b70c38
Compare
0ab35e4
to
54b84ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments..
pkg/cri/server/image_pull.go
Outdated
imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed) | ||
|
||
log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID, | ||
repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod one metric per domain ref is probably enough..
If someone needs per image log data... maybe that should be retrieved from the log?
We might need to add the size and duration to the CRI PullImage Response..
pkg/cri/server/image_pull.go
Outdated
imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed) | ||
|
||
log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID, | ||
repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think per domain by default is too granular/too high cardinality.
8e53068
to
83ba4b4
Compare
@mikebrow @samuelkarp Thanks for your review. |
Code LGTM, but can you squash your commits into one? |
…nt; thoughput Signed-off-by: Paco Xu <paco.xu@daocloud.io>
83ba4b4
to
c59f163
Compare
Squashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on green
apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd? I'd like to measure image pull times for a cluster, and getting it from the CRI metrics is the only way I have found so far? |
Does |
Hi, If I implement |
We may discuss it in kubernetes/enhancements#2371. @shivam99aa |
/cc @cpuguy83
Fixes #7241
This pr tries to add some metrics for image pulling.
image throughout may be related to CPU/disk io/network.
Further options to add for image pulling:
TODO(some unfinished work in this PR)
already exist layers or image
into account. (This means the metrics will show a quick pulling for existing images. As user always want to know which image pulling is slow, this can be fixed later.)