Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics for image pulling: error; in progress count; thoughput #7313

Merged
merged 1 commit into from Dec 13, 2022

Conversation

pacoxu
Copy link
Contributor

@pacoxu pacoxu commented Aug 21, 2022

/cc @cpuguy83
Fixes #7241

This pr tries to add some metrics for image pulling.

  1. Guage: in-progress pulls counts
  2. Counter: image pull error counts
  • Should this count by registry or image name?
  • Should we add the failure reason as a metric label?
  1. Histogram: throughout(image pull time for 1 MiB) group by the registry
    image throughout may be related to CPU/disk io/network.

Further options to add for image pulling:

  1. max_concurrent_downloads: default 3. This may be related to the image pulling speed. Add image pull concurrency limit. #2920
  2. Histogram: [Need Discussion] image-pull time group by the registry(maybe not proper as size may be different.)
  3. Histogram: image size
  4. image_pull_progress_timeout of contained configuration(default 1m0s)

TODO(some unfinished work in this PR)

  • add repo as label
  • the current thoughtput is not taking already exist layers or image into account. (This means the metrics will show a quick pulling for existing images. As user always want to know which image pulling is slow, this can be fixed later.)

@k8s-ci-robot
Copy link

Hi @pacoxu. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu
Copy link
Contributor Author

pacoxu commented Aug 21, 2022

I will do some test locally later.

@pacoxu pacoxu force-pushed the image-pull-metrics branch 2 times, most recently from afc79c3 to e80b52c Compare August 22, 2022 14:18
@pacoxu
Copy link
Contributor Author

pacoxu commented Aug 22, 2022

# HELP containerd_cri_image_pulling_error_total error count of image pulling by image name
# TYPE containerd_cri_image_pulling_error_total counter
containerd_cri_image_pulling_error_total{image_name="daocloud.io/centos:7.3"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/centos:8"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/daocloud/dce-engine:4.0.8"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/daocloud/sprint-boot"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/ubuntu1"} 1
# HELP containerd_cri_image_pulling_in_progress_total in progress pulls
# TYPE containerd_cri_image_pulling_in_progress_total gauge
containerd_cri_image_pulling_in_progress_total 1
# HELP containerd_cri_image_pulling_thoughtput_seconds image pulling duration for 1MiB
# TYPE containerd_cri_image_pulling_thoughtput_seconds histogram
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.005"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.01"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.025"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.05"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.1"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.25"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.5"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="1"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="2.5"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="5"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="10"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="+Inf"} 2
containerd_cri_image_pulling_thoughtput_seconds_sum{registry="daocloud.io"} 1
containerd_cri_image_pulling_thoughtput_seconds_count{registry="daocloud.io"} 2

The metrics now is like above.

Copy link
Contributor

@Jenkins-J Jenkins-J left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@pacoxu
Copy link
Contributor Author

pacoxu commented Aug 31, 2022

/cc @mikebrow @fuweid @dmcgowan

@@ -54,5 +59,9 @@ func init() {
containerStopTimer = ns.NewLabeledTimer("container_stop", "time to stop a container", "runtime")
containerStartTimer = ns.NewLabeledTimer("container_start", "time to start a container", "runtime")

imagePullingError = ns.NewLabeledCounter("image_pulling_error", "error count of image pulling by image name", "image_name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using image name will make the cardinality too high. Prometheus' doc recommends to keep the cardinality lower than 10, so I guess registry name should be a better option here.

[1] https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "status", "registry")

I changed to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's reasonable to assume that containerd will only pull from 10 distinct domains during its lifetime (which appears to be the actual distinction, not registries?). Some container registry providers use many distinct domains (for example, Amazon ECR has a different domain name for each customer account).

@@ -34,6 +34,11 @@ var (
containerCreateTimer metrics.LabeledTimer
containerStopTimer metrics.LabeledTimer
containerStartTimer metrics.LabeledTimer

imagePullingError metrics.LabeledCounter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a latency metric instead, with error, registry as labels? By doing that users will be able to

  1. see the image pull latencies,
  2. see both error count and success count, and therefore calculate the error ratio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add a count for both error pulling and success pulling, I agree.

	imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "error", "registry")

For image pull latencies, do you mean the pulling duration? Could you explain?

  • for the duration, I think imagePullThroughput would be better as the image size would 1MiB or 1GB. The latency does not make sense in some scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the error as a label, some error message contains the image name and some detailed information. If so, the registry label is not that important. If not, we have to make the error more groupable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of making this metric something like this:

imagePullTimer = ns.NewLabeledTimer("image_pull", "time to pull an image", "error", "registry")

, where error is just a boolean indicating whether the pull succeeds or not.

By doing this:

  1. if user wants to get the overall pulling duration, regardless of each image size, they can get this metric with error=='false'.
  2. if user wants to get the number of failed image pulls, they can get counts of this metric with error=='true'

But yeah, I think what you are doing here (i.e. separating the counter metric from duration/throughput metric) is also fine. I don't have a strong preference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added like

	imagePulls = ns.NewLabeledCounter("image_pulls", "count of image pulling by registry", "status", "registry")

Two value is valid: status=failed and status=succeed

@samuelkarp samuelkarp added this to New in Code Review via automation Sep 29, 2022
@samuelkarp samuelkarp added the area/cri Container Runtime Interface (CRI) label Sep 29, 2022
inProgressImagePulls.Inc()
defer inProgressImagePulls.Dec()
var domain string
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a common practice of handling error in defer is to use named return value, instead of using a var like here, because in this case you can't be sure err will always be the return value.

For example, if you look at

if err := c.createImageReference(ctx, r, image.Target()); err != nil {

the returned error is based on a local err created in the for-loop. So your defer here will not catch that local error.

(It's similar to https://go.dev/play/p/LNkan9NkAUt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I have to wrap like below.

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (*runtime.PullImageResponse, error) {
	resp, domain, err := c.PullImageImp(ctx, r)
	if err != nil {
		imagePulls.WithValues("failed", domain).Inc()
	} else {
		imagePulls.WithValues("succeed", domain).Inc()
	}
	return resp, err
}

PullImageImp is the original function. Is this OK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you dont need to. Just use a named return value for the returned error and use the named return value in your defer, and golang will make sure it is always the returned error. Like this:

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
	defer func() {
		if retErr != nil {
			...
		} else {
			...
		}
	}()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I need the domain as a label of the metrics, and the image registry domain is not returned.

This still does not work.

So I kept the change, would you take a look again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. I am just trying to minimize the number of wrapper methods. Can we do something like

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
        var domain string
	defer func() {
		if retErr != nil {
			...
		} else {
			...
		}
	}()

and domain will be empty if it errs when trying to parse it, and will be non-empty if the parsing succeeds.

@samuelkarp
Copy link
Member

   * b8de211b "renames; image pull count all and add registry/status as label" ... FAIL
    - PASS - commit does not have any whitespace errors
    - FAIL - does not have a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*

Commit b8de211 is missing a the Signed-off-by line, which is causing the project checks in CI to fail. Please sign off on this commit as per the contribution guidelines.

@pacoxu
Copy link
Contributor Author

pacoxu commented Sep 30, 2022

Commit b8de211 is missing a the Signed-off-by line, which is causing the project checks in CI to fail. Please sign off on this commit as per the contribution guidelines.

Thanks. I will sign off all commits.

@dmcgowan dmcgowan moved this from New to Needs Discussion in Code Review Oct 1, 2022
@pacoxu pacoxu requested review from mikebrow and samuelkarp and removed request for samuelkarp and mikebrow December 5, 2022 05:39
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments..

pkg/cri/sbserver/image_pull.go Outdated Show resolved Hide resolved
pkg/cri/sbserver/image_pull.go Outdated Show resolved Hide resolved
pkg/cri/server/image_pull.go Outdated Show resolved Hide resolved
pkg/cri/sbserver/metrics.go Outdated Show resolved Hide resolved
pkg/cri/server/image_pull.go Outdated Show resolved Hide resolved
imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed)

log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID,
repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nod one metric per domain ref is probably enough..

If someone needs per image log data... maybe that should be retrieved from the log?

We might need to add the size and duration to the CRI PullImage Response..

pkg/cri/sbserver/metrics.go Outdated Show resolved Hide resolved
pkg/cri/sbserver/metrics.go Outdated Show resolved Hide resolved
pkg/cri/sbserver/image_pull.go Outdated Show resolved Hide resolved
pkg/cri/server/image_pull.go Outdated Show resolved Hide resolved
imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed)

log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID,
repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think per domain by default is too granular/too high cardinality.

@pacoxu pacoxu force-pushed the image-pull-metrics branch 3 times, most recently from 8e53068 to 83ba4b4 Compare December 7, 2022 03:20
@pacoxu
Copy link
Contributor Author

pacoxu commented Dec 7, 2022

@mikebrow @samuelkarp Thanks for your review.
I removed the registry/domain label in imagePullThroughput histogram metrics.

@samuelkarp samuelkarp dismissed their stale review December 7, 2022 06:53

Changes made

@samuelkarp
Copy link
Member

Code LGTM, but can you squash your commits into one?

…nt; thoughput

Signed-off-by: Paco Xu <paco.xu@daocloud.io>
@pacoxu
Copy link
Contributor Author

pacoxu commented Dec 7, 2022

Squashed.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green

@smitphilip
Copy link

smitphilip commented Sep 1, 2023

apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd?

I'd like to measure image pull times for a cluster, and getting it from the CRI metrics is the only way I have found so far?

@pacoxu
Copy link
Contributor Author

pacoxu commented Sep 1, 2023

apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd?

Does CRI metrics mean the kubelet metrics that use CRI data: kubernetes/enhancements#2371?

@shivam99aa
Copy link

Hi, If I implement PodAndContainerStatsFromCRI feature gate, will it be enough for kubelet to scrape these metrics from CRI directly?

@pacoxu
Copy link
Contributor Author

pacoxu commented Jan 23, 2024

Hi, If I implement PodAndContainerStatsFromCRI feature gate, will it be enough for kubelet to scrape these metrics from CRI directly?

We may discuss it in kubernetes/enhancements#2371. @shivam99aa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) needs-ok-to-test
Projects
Development

Successfully merging this pull request may close these issues.

image pulling related metrics: time & thoughout by registry