-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: collect the metrics about snapshotter #261
Conversation
1f2591c
to
f81be53
Compare
Codecov ReportBase: 36.03% // Head: 35.87% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #261 +/- ##
==========================================
- Coverage 36.03% 35.87% -0.17%
==========================================
Files 28 28
Lines 2858 2871 +13
==========================================
Hits 1030 1030
- Misses 1721 1734 +13
Partials 107 107
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
f2ab06b
to
7253281
Compare
0d218cb
to
67d8b08
Compare
Can you rebase this PR, it can't start
|
snapshot/snapshot.go
Outdated
defer func(id *string) { | ||
metricTime := time.Since(metricBeginTime) | ||
|
||
if err := exporter.ExportMountTimeMetric(*id, metricTime.Seconds()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconds are too rough, can we use milliseconds or use float values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the metricTime.Seconds()
is already a float value for seconds.
// Seconds returns the duration as a floating point number of seconds.
func (d Duration) Seconds() float64 {
sec := d / Second
nsec := d % Second
return float64(sec) + float64(nsec)/1e9
}
snapshot/snapshot.go
Outdated
defer func(id *string) { | ||
metricTime := time.Since(metricBeginTime) | ||
|
||
if err := exporter.ExportPrepareTimeMetric(*id, metricTime.Seconds()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to record the snapshot key or ID as a record label so we can analyze how long contained spans calling two APIes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it records the snapshot ID now.
pkg/metrics/collector/snapshotter.go
Outdated
func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) { | ||
beginTime := begin.Format("2006-01-02 15:04:05.000") | ||
endTime := end.Format("2006-01-02 15:04:05.000") | ||
elapsed, _ := strconv.ParseFloat(fmt.Sprintf("%.6f", end.Sub(begin).Seconds()*1000), 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call Milliseconds
rather than Seconds/1000 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, replaced it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call
Milliseconds
rather than Seconds/1000 ?
Shall we replace it with Nanoseconds()) / 1e6
to show nanoseconds?
pkg/metrics/collector/snapshotter.go
Outdated
} | ||
|
||
func (s *SnapshotterCacheUsageCollector) Collect(path string) { | ||
c, b := exec.Command("du", "-bs", path), new(strings.Builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It relies on bash shell, please try fs.DiskUsage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function can be simplified like:
func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) {
beginTime := begin.Format("2006-01-02 15:04:05.000")
endTime := end.Format("2006-01-02 15:04:05.000")
elapsed, _ := strconv.ParseFloat(fmt.Sprintf("%.6f", end.Sub(begin).Seconds()*1000), 64)
var collector *prometheus.GaugeVec
switch method {
case SnapshotterMethodPrepare:
collector = data.PrepareTime
case SnapshotterMethodMount:
collector = data.MountTime
case SnapshotterMethodRemove:
collector = data.RemoveTime
case SnapshotterMethodUnknown:
fallthrough
default:
log.L.Warnf("Unknown method: %s", method)
}
if collector != nil {
collector.WithLabelValues(snapshotID, beginTime, endTime).Set(elapsed)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
pkg/metrics/collector/snapshotter.go
Outdated
SnapshotterMethodRemove SnapshotterMethod = "REMOVE" | ||
) | ||
|
||
func (s *SnapshotterMetricsCollector) Collect(snapshotID string, begin, end time.Time, method SnapshotterMethod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to pass end time. Just get the current time as the end time in this function. It will be neater
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/metrics/data/snapshotter.go
Outdated
var ( | ||
snapshotIDLabel = "snapshot_id" | ||
beginLabel = "begin" | ||
endLabel = "end" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to record end time. We already know the start time and eclapse
Help: "The time to remove a snapshot.", | ||
}, | ||
[]string{snapshotIDLabel, beginLabel, endLabel}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a metric for Cleanup interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, looks good
snapshot/snapshot.go
Outdated
@@ -217,6 +219,10 @@ func NewSnapshotter(ctx context.Context, cfg *config.Config) (snapshots.Snapshot | |||
} | |||
|
|||
func (o *snapshotter) Cleanup(ctx context.Context) error { | |||
metricBeginTime := time.Now() | |||
defer func() { | |||
go collector.CollectSnapshotterMetrics(metricBeginTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try Prometheus native API to do this https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Timer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For snapshotter interfaces metrics, we need to consider how to evict long-lived metrics that are never fetched.
db6926e
to
7db79d4
Compare
08c9888
to
ba1744a
Compare
pkg/manager/manager.go
Outdated
@@ -365,6 +365,10 @@ func NewManager(opt Opt) (*Manager, error) { | |||
return mgr, nil | |||
} | |||
|
|||
func (m *Manager) GetCacheDir() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a similar method CacheDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm... Yes, but this method is belong to another package cache
. I will change GetCacheDir
to CacheDir
.
pkg/metrics/data/snapshotter.go
Outdated
Name: "snapshotter_prepare_snapshot_time_milliseconds", | ||
Help: "The time to prepare a snapshot.", | ||
}, | ||
[]string{snapshotIDLabel, beginLabel}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! What worries me most is the expanding gauge vector which will allocate more memory for records with different labels. It means a long-lived nydus-snapshotter will consume much memory. We have to consider about how to release old metric records or ignore the labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we replace it with GaugeWithTTL
? It will clean up expired metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think the labels snapshot id
and begin time
are not strongly necessary. We can only record metrics for a snapshot that involves nydus meta layer which is much more time-consuming. Than Gauge should meet our requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got it.
We do not care about snapshot id
and begin time
. Maybe the Summary
is more suitable? Because it is not a time-series metric.
pkg/metrics/serve.go
Outdated
} | ||
|
||
// Collect snapshotter metrics. | ||
s.snCollector.Collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where collecting nydusd io metrics. Snapshotter should by default has metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will fix this.
pkg/metrics/tool/common.go
Outdated
return prometheus.NewTimer(prometheus.ObserverFunc(f)) | ||
} | ||
|
||
func CollectElapsedTimeWithBeginLabel(g *prometheus.GaugeVec) *prometheus.Timer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify the timers measuring Prepare/Mounts/Remove/Cleanup by using this helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Done.
pkg/metrics/tool/stat.go
Outdated
} | ||
infos := strings.Split(splitAfterStat[1], " ") | ||
|
||
files, _ := os.ReadDir(path.Join("/proc", strconv.Itoa(pid), "fdinfo")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least give a explicit error message, don't ignore errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, fixed it.
9a7b431
to
ddcd687
Compare
From the metrics output, we'd better have bucket le=100, le=150, le=200, le=300, le=500. No need to create buckets that are less than 0.25 |
pkg/metrics/tool/common.go
Outdated
floatVal, err := strconv.ParseFloat(val, 64) | ||
if err != nil { | ||
log.L.Warnf("parse float failed, error: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ignore error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/metrics/tool/stat.go
Outdated
statBytes, err := os.ReadFile(path.Join("/proc", strconv.Itoa(pid), "stat")) | ||
if err != nil { | ||
log.L.Warnf("get stat failed: %v", err) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wrap the error and return it to caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
c22f652
to
956cc01
Compare
Collect the metrics of preparing snapshots, mounting snapshots, and removing snapshots. In addition, collect the metrics about snapshotter, including cache usage, cleanup time, cpu usage, memory usage, file descriptor counts, run time and thread counts. Signed-off-by: Bin Tang <tangbin.bin@bytedance.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done. thanks
Collect the metrics of preparing snapshots, mounting snapshots, and removing snapshots.
In addition, collect the metrics about snapshotter, including cache usage, cleanup time, cpu usage, memory usage, file descriptor counts, run time and thread counts.
Command:
Result:
Signed-off-by: Bin Tang tangbin.bin@bytedance.com