Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: collect the metrics about snapshotter #261

Merged
merged 1 commit into from
Dec 9, 2022

Conversation

sctb512
Copy link
Member

@sctb512 sctb512 commented Nov 24, 2022

Collect the metrics of preparing snapshots, mounting snapshots, and removing snapshots.
In addition, collect the metrics about snapshotter, including cache usage, cleanup time, cpu usage, memory usage, file descriptor counts, run time and thread counts.

Command:

curl -s --unix-socket /var/lib/containerd-nydus/api.sock  http://unix/metrics | grep -e "event" -e "cache_usage" -e "cleanup" -e "cpu" -e "memory" -e "fd_counts" -e "run_time" -e "thread"

Result:

# HELP nydusd_lifetime_event_counts The lifetime events of nydus daemon.
# TYPE nydusd_lifetime_event_counts counter
nydusd_lifetime_event_counts{nydusd_event="DESTROYED"} 1
nydusd_lifetime_event_counts{nydusd_event="RUNNING"} 12
# HELP snapshotter_cache_usage_kilobytes Disk usage of snapshotter local cache.
# TYPE snapshotter_cache_usage_kilobytes gauge
snapshotter_cache_usage_kilobytes 67444
# HELP snapshotter_cpu_system_time_seconds CPU time of snapshotter in system.
# TYPE snapshotter_cpu_system_time_seconds gauge
snapshotter_cpu_system_time_seconds 0
# HELP snapshotter_cpu_usage_percent Cpu usage percent of snapshotter.
# TYPE snapshotter_cpu_usage_percent gauge
snapshotter_cpu_usage_percent 0.02
# HELP snapshotter_cpu_user_time_seconds CPU time of snapshotter in user.
# TYPE snapshotter_cpu_user_time_seconds gauge
snapshotter_cpu_user_time_seconds 0.01
# HELP snapshotter_fd_counts Fd counts of snapshotter.
# TYPE snapshotter_fd_counts gauge
snapshotter_fd_counts 35
# HELP snapshotter_memory_usage_kilobytes Memory usage (RSS) of snapshotter.
# TYPE snapshotter_memory_usage_kilobytes gauge
snapshotter_memory_usage_kilobytes 31616
# HELP snapshotter_run_time_seconds Run time of snapshotter from starting.
# TYPE snapshotter_run_time_seconds gauge
snapshotter_run_time_seconds 120.03
# HELP snapshotter_snapshot_event_elapsed_milliseconds The elapsed time for snapshot events.
# TYPE snapshotter_snapshot_event_elapsed_milliseconds histogram
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="0.1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="0.15"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="0.2"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="0.3"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="0.5"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="1"} 1
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="1.5"} 1
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="2"} 2
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="3"} 2
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="5"} 2
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="10"} 2
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="25"} 3
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="60"} 3
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="CLEANUP",le="+Inf"} 3
snapshotter_snapshot_event_elapsed_milliseconds_sum{snapshot_event="CLEANUP"} 13.231301
snapshotter_snapshot_event_elapsed_milliseconds_count{snapshot_event="CLEANUP"} 3
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="0.1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="0.15"} 7
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="0.2"} 8
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="0.3"} 10
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="0.5"} 11
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="1"} 14
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="1.5"} 14
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="2"} 14
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="3"} 14
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="5"} 15
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="10"} 15
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="25"} 15
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="60"} 15
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="MOUNT",le="+Inf"} 22
snapshotter_snapshot_event_elapsed_milliseconds_sum{snapshot_event="MOUNT"} 1209.1838229999998
snapshotter_snapshot_event_elapsed_milliseconds_count{snapshot_event="MOUNT"} 22
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="0.1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="0.15"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="0.2"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="0.3"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="0.5"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="1.5"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="2"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="3"} 4
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="5"} 13
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="10"} 31
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="25"} 47
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="60"} 53
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="PREPARE",le="+Inf"} 53
snapshotter_snapshot_event_elapsed_milliseconds_sum{snapshot_event="PREPARE"} 607.935701
snapshotter_snapshot_event_elapsed_milliseconds_count{snapshot_event="PREPARE"} 53
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="0.1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="0.15"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="0.2"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="0.3"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="0.5"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="1"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="1.5"} 0
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="2"} 5
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="3"} 15
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="5"} 22
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="10"} 26
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="25"} 26
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="60"} 26
snapshotter_snapshot_event_elapsed_milliseconds_bucket{snapshot_event="REMOVE",le="+Inf"} 26
snapshotter_snapshot_event_elapsed_milliseconds_sum{snapshot_event="REMOVE"} 84.949005
snapshotter_snapshot_event_elapsed_milliseconds_count{snapshot_event="REMOVE"} 26
# HELP snapshotter_thread_counts Thread counts of snapshotter.
# TYPE snapshotter_thread_counts gauge
snapshotter_thread_counts 15

Signed-off-by: Bin Tang tangbin.bin@bytedance.com

@sctb512 sctb512 force-pushed the metrics branch 2 times, most recently from 1f2591c to f81be53 Compare November 24, 2022 06:34
@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2022

Codecov Report

Base: 36.03% // Head: 35.87% // Decreases project coverage by -0.16% ⚠️

Coverage data is based on head (f27811e) compared to base (44bf88f).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #261      +/-   ##
==========================================
- Coverage   36.03%   35.87%   -0.17%     
==========================================
  Files          28       28              
  Lines        2858     2871      +13     
==========================================
  Hits         1030     1030              
- Misses       1721     1734      +13     
  Partials      107      107              
Impacted Files Coverage Δ
pkg/manager/daemon_adaptor.go 0.00% <0.00%> (ø)
pkg/manager/manager.go 20.95% <0.00%> (-0.44%) ⬇️
pkg/metrics/ttl/gauge.go 100.00% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sctb512 sctb512 force-pushed the metrics branch 3 times, most recently from f2ab06b to 7253281 Compare November 24, 2022 08:39
@sctb512 sctb512 marked this pull request as ready for review November 24, 2022 09:17
@sctb512 sctb512 force-pushed the metrics branch 5 times, most recently from 0d218cb to 67d8b08 Compare November 25, 2022 03:47
@changweige
Copy link
Member

Can you rebase this PR, it can't start

-- The job identifier is 98342.
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: panic: runtime error: invalid memory address or nil pointer dereference
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1225eb2]
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: goroutine 1 [running]:
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/pkg/daemon.(*Rafs).BootstrapFile(0x0)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/pkg/daemon/rafs.go:161 +0x52
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/pkg/manager.(*Manager).buildDaemonCommand(0xc0005d8630, 0xc00048a7e0)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/pkg/manager/daemon_adaptor.go:122 +0x67e
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/pkg/manager.(*Manager).StartDaemon(0xc0005d8630, 0xc00048a7e0)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/pkg/manager/daemon_adaptor.go:25 +0x76
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/pkg/filesystem/fs.NewFileSystem({0x187df98, 0xc0001b5000}, {0xc0005fd860, 0xe, 0x1646694?})
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/pkg/filesystem/fs/fs.go:107 +0x29d
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/snapshot.NewSnapshotter({0x187df98?, 0xc0001b5000}, 0xc00059f820)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/snapshot/snapshot.go:166 +0xad0
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/containerd/nydus-snapshotter/cmd/containerd-nydus-grpc/app/snapshotter.Start({_, _}, {{0x16846db, 0x30}, 0x0, {0x7fffc5466e60, 0x16}, {0x1672ac5, 0x27}, {0x165d482, ...}, ...})
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/cmd/containerd-nydus-grpc/app/snapshotter/main.go:23 +0x79
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: main.main.func1(0xc00056c400?)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/cmd/containerd-nydus-grpc/main.go:70 +0x368
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/urfave/cli/v2.(*App).RunContext(0xc0005a24e0, {0x187dfd0?, 0xc0001ac000}, {0xc0001aa120, 0x6, 0x6})
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:322 +0x953
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: github.com/urfave/cli/v2.(*App).Run(...)
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/go/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:224
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]: main.main()
Nov 28 15:31:36 n227-089-202 containerd-nydus-grpc[2892852]:         /home/gechangwei/git_repo/nydus-snapshotter/cmd/containerd-nydus-grpc/main.go:73 +0x1f1
Nov 28 15:31:36 n227-089-202 systemd[1]: nydus-snapshotter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

defer func(id *string) {
metricTime := time.Since(metricBeginTime)

if err := exporter.ExportMountTimeMetric(*id, metricTime.Seconds()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconds are too rough, can we use milliseconds or use float values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the metricTime.Seconds() is already a float value for seconds.

// Seconds returns the duration as a floating point number of seconds.
func (d Duration) Seconds() float64 {
	sec := d / Second
	nsec := d % Second
	return float64(sec) + float64(nsec)/1e9
}

defer func(id *string) {
metricTime := time.Since(metricBeginTime)

if err := exporter.ExportPrepareTimeMetric(*id, metricTime.Seconds()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to record the snapshot key or ID as a record label so we can analyze how long contained spans calling two APIes

Copy link
Member Author

@sctb512 sctb512 Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it records the snapshot ID now.

@sctb512 sctb512 changed the title metrics: collect metrics about snapshotter metrics: collect the metrics about snapshotter Nov 29, 2022
func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) {
beginTime := begin.Format("2006-01-02 15:04:05.000")
endTime := end.Format("2006-01-02 15:04:05.000")
elapsed, _ := strconv.ParseFloat(fmt.Sprintf("%.6f", end.Sub(begin).Seconds()*1000), 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call Milliseconds rather than Seconds/1000 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, replaced it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call Milliseconds rather than Seconds/1000 ?

Shall we replace it with Nanoseconds()) / 1e6 to show nanoseconds?

}

func (s *SnapshotterCacheUsageCollector) Collect(path string) {
c, b := exec.Command("du", "-bs", path), new(strings.Builder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It relies on bash shell, please try fs.DiskUsage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can be simplified like:

func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) {
	beginTime := begin.Format("2006-01-02 15:04:05.000")
	endTime := end.Format("2006-01-02 15:04:05.000")
	elapsed, _ := strconv.ParseFloat(fmt.Sprintf("%.6f", end.Sub(begin).Seconds()*1000), 64)

	var collector *prometheus.GaugeVec

	switch method {
	case SnapshotterMethodPrepare:
		collector = data.PrepareTime
	case SnapshotterMethodMount:
		collector = data.MountTime
	case SnapshotterMethodRemove:
		collector = data.RemoveTime
	case SnapshotterMethodUnknown:
		fallthrough
	default:
		log.L.Warnf("Unknown method: %s", method)
	}

	if collector != nil {
		collector.WithLabelValues(snapshotID, beginTime, endTime).Set(elapsed)
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

SnapshotterMethodRemove SnapshotterMethod = "REMOVE"
)

func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) {
Copy link
Member

@changweige changweige Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass end time. Just get the current time as the end time in this function. It will be neater

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

var (
snapshotIDLabel = "snapshot_id"
beginLabel = "begin"
endLabel = "end"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to record end time. We already know the start time and eclapse

Help: "The time to remove a snapshot.",
},
[]string{snapshotIDLabel, beginLabel, endLabel},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a metric for Cleanup interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@changweige changweige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, looks good

@@ -217,6 +219,10 @@ func NewSnapshotter(ctx context.Context, cfg *config.Config) (snapshots.Snapshot
}

func (o *snapshotter) Cleanup(ctx context.Context) error {
metricBeginTime := time.Now()
defer func() {
go collector.CollectSnapshotterMetrics(metricBeginTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

Copy link
Member

@changweige changweige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For snapshotter interfaces metrics, we need to consider how to evict long-lived metrics that are never fetched.

@sctb512 sctb512 force-pushed the metrics branch 3 times, most recently from db6926e to 7db79d4 Compare December 1, 2022 03:23
@sctb512 sctb512 force-pushed the metrics branch 6 times, most recently from 08c9888 to ba1744a Compare December 1, 2022 07:47
@@ -365,6 +365,10 @@ func NewManager(opt Opt) (*Manager, error) {
return mgr, nil
}

func (m *Manager) GetCacheDir() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar method CacheDir

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm... Yes, but this method is belong to another package cache. I will change GetCacheDir to CacheDir.

Name: "snapshotter_prepare_snapshot_time_milliseconds",
Help: "The time to prepare a snapshot.",
},
[]string{snapshotIDLabel, beginLabel},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! What worries me most is the expanding gauge vector which will allocate more memory for records with different labels. It means a long-lived nydus-snapshotter will consume much memory. We have to consider about how to release old metric records or ignore the labels

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we replace it with GaugeWithTTL? It will clean up expired metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think the labels snapshot id and begin time are not strongly necessary. We can only record metrics for a snapshot that involves nydus meta layer which is much more time-consuming. Than Gauge should meet our requirements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got it.
We do not care about snapshot id and begin time. Maybe the Summary is more suitable? Because it is not a time-series metric.

}

// Collect snapshotter metrics.
s.snCollector.Collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where collecting nydusd io metrics. Snapshotter should by default has metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will fix this.

return prometheus.NewTimer(prometheus.ObserverFunc(f))
}

func CollectElapsedTimeWithBeginLabel(g *prometheus.GaugeVec) *prometheus.Timer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the timers measuring Prepare/Mounts/Remove/Cleanup by using this helper function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Done.

}
infos := strings.Split(splitAfterStat[1], " ")

files, _ := os.ReadDir(path.Join("/proc", strconv.Itoa(pid), "fdinfo"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least give a explicit error message, don't ignore errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fixed it.

@sctb512 sctb512 force-pushed the metrics branch 2 times, most recently from 9a7b431 to ddcd687 Compare December 7, 2022 08:31
@changweige
Copy link
Member

From the metrics output, we'd better have bucket le=100, le=150, le=200, le=300, le=500. No need to create buckets that are less than 0.25

floatVal, err := strconv.ParseFloat(val, 64)
if err != nil {
log.L.Warnf("parse float failed, error: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ignore error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

statBytes, err := os.ReadFile(path.Join("/proc", strconv.Itoa(pid), "stat"))
if err != nil {
log.L.Warnf("get stat failed: %v", err)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap the error and return it to caller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@sctb512 sctb512 force-pushed the metrics branch 3 times, most recently from c22f652 to 956cc01 Compare December 9, 2022 03:07
Collect the metrics of preparing snapshots, mounting snapshots, and removing snapshots.
In addition, collect the metrics about snapshotter, including cache usage,
cleanup time, cpu usage, memory usage, file descriptor counts, run time and thread counts.

Signed-off-by: Bin Tang <tangbin.bin@bytedance.com>
Copy link
Member

@changweige changweige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done. thanks

@changweige changweige merged commit 0374a5b into containerd:main Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants