add metrics for image pulling: error; in progress count; thoughput #7313

pacoxu · 2022-08-21T14:12:22Z

/cc @cpuguy83
Fixes #7241

This pr tries to add some metrics for image pulling.

Guage: in-progress pulls counts
Counter: image pull error counts

Should this count by registry or image name?
Should we add the failure reason as a metric label?

Histogram: throughout(image pull time for 1 MiB) group by the registry
image throughout may be related to CPU/disk io/network.

Further options to add for image pulling:

max_concurrent_downloads: default 3. This may be related to the image pulling speed. Add image pull concurrency limit. #2920
Histogram: [Need Discussion] image-pull time group by the registry(maybe not proper as size may be different.)
Histogram: image size
image_pull_progress_timeout of contained configuration(default 1m0s)

TODO(some unfinished work in this PR)

add repo as label
the current thoughtput is not taking already exist layers or image into account. (This means the metrics will show a quick pulling for existing images. As user always want to know which image pulling is slow, this can be fixed later.)

k8s-ci-robot · 2022-08-21T14:12:31Z

Hi @pacoxu. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacoxu · 2022-08-21T14:14:09Z

I will do some test locally later.

pacoxu · 2022-08-22T14:35:34Z

# HELP containerd_cri_image_pulling_error_total error count of image pulling by image name
# TYPE containerd_cri_image_pulling_error_total counter
containerd_cri_image_pulling_error_total{image_name="daocloud.io/centos:7.3"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/centos:8"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/daocloud/dce-engine:4.0.8"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/daocloud/sprint-boot"} 1
containerd_cri_image_pulling_error_total{image_name="daocloud.io/ubuntu1"} 1
# HELP containerd_cri_image_pulling_in_progress_total in progress pulls
# TYPE containerd_cri_image_pulling_in_progress_total gauge
containerd_cri_image_pulling_in_progress_total 1
# HELP containerd_cri_image_pulling_thoughtput_seconds image pulling duration for 1MiB
# TYPE containerd_cri_image_pulling_thoughtput_seconds histogram
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.005"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.01"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.025"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.05"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.1"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.25"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="0.5"} 1
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="1"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="2.5"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="5"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="10"} 2
containerd_cri_image_pulling_thoughtput_seconds_bucket{registry="daocloud.io",le="+Inf"} 2
containerd_cri_image_pulling_thoughtput_seconds_sum{registry="daocloud.io"} 1
containerd_cri_image_pulling_thoughtput_seconds_count{registry="daocloud.io"} 2

The metrics now is like above.

Jenkins-J

This looks good to me.

pacoxu · 2022-08-31T01:38:11Z

/cc @mikebrow @fuweid @dmcgowan

pkg/cri/sbserver/metrics.go

ruiwen-zhao · 2022-09-28T17:21:41Z

pkg/cri/server/metrics.go

@@ -54,5 +59,9 @@ func init() {
 	containerStopTimer = ns.NewLabeledTimer("container_stop", "time to stop a container", "runtime")
 	containerStartTimer = ns.NewLabeledTimer("container_start", "time to start a container", "runtime")

+	imagePullingError = ns.NewLabeledCounter("image_pulling_error", "error count of image pulling by image name", "image_name")


Using image name will make the cardinality too high. Prometheus' doc recommends to keep the cardinality lower than 10, so I guess registry name should be a better option here.

[1] https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "status", "registry")

I changed to this.

I don't think it's reasonable to assume that containerd will only pull from 10 distinct domains during its lifetime (which appears to be the actual distinction, not registries?). Some container registry providers use many distinct domains (for example, Amazon ECR has a different domain name for each customer account).

pkg/cri/sbserver/image_pull.go

ruiwen-zhao · 2022-09-28T17:27:11Z

pkg/cri/server/metrics.go

@@ -34,6 +34,11 @@ var (
 	containerCreateTimer metrics.LabeledTimer
 	containerStopTimer   metrics.LabeledTimer
 	containerStartTimer  metrics.LabeledTimer
+
+	imagePullingError     metrics.LabeledCounter


Should we use a latency metric instead, with error, registry as labels? By doing that users will be able to

see the image pull latencies,

see both error count and success count, and therefore calculate the error ratio.

To add a count for both error pulling and success pulling, I agree.

imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "error", "registry")

For image pull latencies, do you mean the pulling duration? Could you explain?

for the duration, I think imagePullThroughput would be better as the image size would 1MiB or 1GB. The latency does not make sense in some scenarios.

If we add the error as a label, some error message contains the image name and some detailed information. If so, the registry label is not that important. If not, we have to make the error more groupable.

I was thinking of making this metric something like this:

imagePullTimer = ns.NewLabeledTimer("image_pull", "time to pull an image", "error", "registry")

, where error is just a boolean indicating whether the pull succeeds or not.

By doing this:

if user wants to get the overall pulling duration, regardless of each image size, they can get this metric with error=='false'.

if user wants to get the number of failed image pulls, they can get counts of this metric with error=='true'

But yeah, I think what you are doing here (i.e. separating the counter metric from duration/throughput metric) is also fine. I don't have a strong preference here.

I added like

imagePulls = ns.NewLabeledCounter("image_pulls", "count of image pulling by registry", "status", "registry")

Two value is valid: status=failed and status=succeed

pkg/cri/server/image_pull.go

pkg/cri/sbserver/metrics.go

ruiwen-zhao · 2022-09-29T20:23:16Z

pkg/cri/server/image_pull.go

+	inProgressImagePulls.Inc()
+	defer inProgressImagePulls.Dec()
+	var domain string
+	var err error


I guess a common practice of handling error in defer is to use named return value, instead of using a var like here, because in this case you can't be sure err will always be the return value.

For example, if you look at

containerd/pkg/cri/server/image_pull.go

Line 196 in b8de211

if err := c.createImageReference(ctx, r, image.Target()); err != nil {

the returned error is based on a local err created in the for-loop. So your defer here will not catch that local error.

(It's similar to https://go.dev/play/p/LNkan9NkAUt)

If so, I have to wrap like below.

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (*runtime.PullImageResponse, error) { resp, domain, err := c.PullImageImp(ctx, r) if err != nil { imagePulls.WithValues("failed", domain).Inc() } else { imagePulls.WithValues("succeed", domain).Inc() } return resp, err }

PullImageImp is the original function. Is this OK?

no you dont need to. Just use a named return value for the returned error and use the named return value in your defer, and golang will make sure it is always the returned error. Like this:

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) { defer func() { if retErr != nil { ... } else { ... } }()

As I need the domain as a label of the metrics, and the image registry domain is not returned.

This still does not work.

So I kept the change, would you take a look again?

Sorry for the late reply. I am just trying to minimize the number of wrapper methods. Can we do something like

func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) { var domain string defer func() { if retErr != nil { ... } else { ... } }()

and domain will be empty if it errs when trying to parse it, and will be non-empty if the parsing succeeds.

samuelkarp · 2022-09-29T20:26:48Z

   * b8de211b "renames; image pull count all and add registry/status as label" ... FAIL
    - PASS - commit does not have any whitespace errors
    - FAIL - does not have a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*

Commit b8de211 is missing a the Signed-off-by line, which is causing the project checks in CI to fail. Please sign off on this commit as per the contribution guidelines.

pacoxu · 2022-09-30T03:13:02Z

Commit b8de211 is missing a the Signed-off-by line, which is causing the project checks in CI to fail. Please sign off on this commit as per the contribution guidelines.

Thanks. I will sign off all commits.

mikebrow

see comments..

pkg/cri/sbserver/image_pull.go

pkg/cri/server/image_pull.go

pkg/cri/sbserver/metrics.go

pkg/cri/server/image_pull.go

mikebrow · 2022-12-05T16:42:29Z

pkg/cri/server/image_pull.go

+	imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed)
+
+	log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID,
+		repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds())


nod one metric per domain ref is probably enough..

If someone needs per image log data... maybe that should be retrieved from the log?

We might need to add the size and duration to the CRI PullImage Response..

pkg/cri/sbserver/metrics.go

pkg/cri/sbserver/image_pull.go

pkg/cri/server/image_pull.go

samuelkarp · 2022-12-07T01:28:37Z

pkg/cri/server/image_pull.go

+	imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed)
+
+	log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID,
+		repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds())


I think per domain by default is too granular/too high cardinality.

pacoxu · 2022-12-07T03:20:24Z

@mikebrow @samuelkarp Thanks for your review.
I removed the registry/domain label in imagePullThroughput histogram metrics.

Changes made

samuelkarp · 2022-12-07T06:54:10Z

Code LGTM, but can you squash your commits into one?

…nt; thoughput Signed-off-by: Paco Xu <paco.xu@daocloud.io>

pacoxu · 2022-12-07T07:11:45Z

Squashed.

mikebrow

LGTM on green

smitphilip · 2023-09-01T09:48:25Z

apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd?

I'd like to measure image pull times for a cluster, and getting it from the CRI metrics is the only way I have found so far?

pacoxu · 2023-09-01T09:54:00Z

apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd?

Does CRI metrics mean the kubelet metrics that use CRI data: kubernetes/enhancements#2371?

shivam99aa · 2024-01-22T13:08:48Z

Hi, If I implement PodAndContainerStatsFromCRI feature gate, will it be enough for kubelet to scrape these metrics from CRI directly?

pacoxu · 2024-01-23T03:39:55Z

Hi, If I implement PodAndContainerStatsFromCRI feature gate, will it be enough for kubelet to scrape these metrics from CRI directly?

We may discuss it in kubernetes/enhancements#2371. @shivam99aa

k8s-ci-robot added the needs-ok-to-test label Aug 21, 2022

pacoxu force-pushed the image-pull-metrics branch 2 times, most recently from afc79c3 to e80b52c Compare August 22, 2022 14:18

pacoxu mentioned this pull request Aug 24, 2022

Kubelet - add metrics related to images pull time kubernetes/kubernetes#111259

Open

pacoxu force-pushed the image-pull-metrics branch from e80b52c to cde8646 Compare August 24, 2022 08:32

Jenkins-J reviewed Aug 29, 2022

View reviewed changes

k8s-ci-robot requested review from dmcgowan, fuweid and mikebrow August 31, 2022 01:38

ruiwen-zhao reviewed Sep 28, 2022

View reviewed changes

pkg/cri/sbserver/metrics.go Outdated Show resolved Hide resolved

ruiwen-zhao reviewed Sep 28, 2022

View reviewed changes

pkg/cri/sbserver/image_pull.go Outdated Show resolved Hide resolved

ruiwen-zhao reviewed Sep 28, 2022

View reviewed changes

pkg/cri/server/image_pull.go Outdated Show resolved Hide resolved

ruiwen-zhao reviewed Sep 28, 2022

View reviewed changes

pkg/cri/server/image_pull.go Outdated Show resolved Hide resolved

pacoxu force-pushed the image-pull-metrics branch from cde8646 to b8de211 Compare September 29, 2022 04:01

ruiwen-zhao reviewed Sep 29, 2022

View reviewed changes

pkg/cri/sbserver/metrics.go Outdated Show resolved Hide resolved

samuelkarp added this to New in Code Review via automation Sep 29, 2022

samuelkarp added the area/cri Container Runtime Interface (CRI) label Sep 29, 2022

ruiwen-zhao reviewed Sep 29, 2022

View reviewed changes

pacoxu force-pushed the image-pull-metrics branch from b8de211 to 0b70c38 Compare September 30, 2022 03:13

dmcgowan moved this from New to Needs Discussion in Code Review Oct 1, 2022

pacoxu removed request for dmcgowan and mikebrow October 12, 2022 06:58

pacoxu force-pushed the image-pull-metrics branch from 0ab35e4 to 54b84ff Compare November 23, 2022 08:47

pacoxu requested review from mikebrow and samuelkarp and removed request for samuelkarp and mikebrow December 5, 2022 05:39

mikebrow reviewed Dec 5, 2022

View reviewed changes

samuelkarp reviewed Dec 7, 2022

View reviewed changes

pacoxu force-pushed the image-pull-metrics branch 3 times, most recently from 8e53068 to 83ba4b4 Compare December 7, 2022 03:20

add metrics for image pulling: success/failure count; in progress cou…

c59f163

…nt; thoughput Signed-off-by: Paco Xu <paco.xu@daocloud.io>

pacoxu force-pushed the image-pull-metrics branch from 83ba4b4 to c59f163 Compare December 7, 2022 07:11

samuelkarp approved these changes Dec 7, 2022

View reviewed changes

mikebrow approved these changes Dec 12, 2022

View reviewed changes

fuweid approved these changes Dec 13, 2022

View reviewed changes

fuweid merged commit d2f68bf into containerd:main Dec 13, 2022

Code Review automation moved this from Needs Discussion to Done Dec 13, 2022

pacoxu mentioned this pull request Feb 1, 2023

KEP-3673: Kubelet limit of Parallel Image Pulls kubernetes/enhancements#3713

Merged

phillebaba mentioned this pull request Feb 1, 2023

Document performance measuring spegel-org/spegel#4

Closed

pacoxu mentioned this pull request Apr 3, 2023

Register imagePullThroughput and count with MiB #8337

Merged

pacoxu mentioned this pull request Jan 23, 2024

cAdvisor-less, CRI-full Container and Pod Stats kubernetes/enhancements#2371

Open

8 tasks

ialidzhikov mentioned this pull request Feb 22, 2024

Getting gardener-extension-registry-cache production ready gardener/gardener-extension-registry-cache#3

Open

97 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metrics for image pulling: error; in progress count; thoughput #7313

add metrics for image pulling: error; in progress count; thoughput #7313

pacoxu commented Aug 21, 2022 •

edited

k8s-ci-robot commented Aug 21, 2022

pacoxu commented Aug 21, 2022

pacoxu commented Aug 22, 2022

Jenkins-J left a comment

pacoxu commented Aug 31, 2022

ruiwen-zhao Sep 28, 2022

pacoxu Sep 29, 2022

samuelkarp Sep 29, 2022

ruiwen-zhao Sep 28, 2022

pacoxu Sep 29, 2022

pacoxu Sep 29, 2022

ruiwen-zhao Sep 29, 2022

pacoxu Oct 9, 2022

ruiwen-zhao Sep 29, 2022 •

edited

pacoxu Sep 30, 2022

ruiwen-zhao Sep 30, 2022

pacoxu Oct 9, 2022

ruiwen-zhao Oct 13, 2022

samuelkarp commented Sep 29, 2022

pacoxu commented Sep 30, 2022

mikebrow left a comment

mikebrow Dec 5, 2022

samuelkarp Dec 7, 2022

pacoxu commented Dec 7, 2022

samuelkarp commented Dec 7, 2022

pacoxu commented Dec 7, 2022

mikebrow left a comment

smitphilip commented Sep 1, 2023 •

edited

pacoxu commented Sep 1, 2023

shivam99aa commented Jan 22, 2024

pacoxu commented Jan 23, 2024

add metrics for image pulling: error; in progress count; thoughput #7313

add metrics for image pulling: error; in progress count; thoughput #7313

Conversation

pacoxu commented Aug 21, 2022 • edited

k8s-ci-robot commented Aug 21, 2022

pacoxu commented Aug 21, 2022

pacoxu commented Aug 22, 2022

Jenkins-J left a comment

Choose a reason for hiding this comment

pacoxu commented Aug 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruiwen-zhao Sep 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelkarp commented Sep 29, 2022

pacoxu commented Sep 30, 2022

mikebrow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacoxu commented Dec 7, 2022

samuelkarp commented Dec 7, 2022

pacoxu commented Dec 7, 2022

mikebrow left a comment

Choose a reason for hiding this comment

smitphilip commented Sep 1, 2023 • edited

pacoxu commented Sep 1, 2023

shivam99aa commented Jan 22, 2024

pacoxu commented Jan 23, 2024

pacoxu commented Aug 21, 2022 •

edited

ruiwen-zhao Sep 29, 2022 •

edited

smitphilip commented Sep 1, 2023 •

edited