Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add metrics for FS hang IOs #308

Merged
merged 1 commit into from
Jan 13, 2023

Conversation

sctb512
Copy link
Member

@sctb512 sctb512 commented Jan 10, 2023

Add the metric TotalHangIO to record the total hang IO counts of FS. The inflight IOs data are from nydus daemon API /api/v1/metrics/inflight.

Output:

# HELP nydusd_hang_IO_counts Total number of hang IOs.
# TYPE nydusd_hang_IO_counts counter
nydusd_hang_IO_counts{opcode="OP_READ"} 2

Signed-off-by: Bin Tang tangbin.bin@bytedance.com

@sctb512 sctb512 force-pushed the add-nydusd-hang-IO-metric branch 6 times, most recently from da6d57c to e707b14 Compare January 11, 2023 02:15
@changweige
Copy link
Member

need rebase 🤣

pkg/metrics/data/fs.go Outdated Show resolved Hide resolved
"github.com/containerd/nydus-snapshotter/pkg/manager"
"github.com/containerd/nydus-snapshotter/pkg/metrics/collector"
"github.com/containerd/nydus-snapshotter/pkg/metrics/exporter"
)

// Default period to determine a hang IO.
const defaultHangIOPeriod = 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think interval is better than period to express

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have replaced period with interval.

}

func (s *Server) StartCollectMetrics(ctx context.Context) error {
// TODO(renzhen): make collect interval time configurable
timer := time.NewTicker(time.Duration(1) * time.Minute)
// The timer period is same as the period for determining hang IOs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar mistakes, it should be is the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.  😊

@sctb512 sctb512 force-pushed the add-nydusd-hang-IO-metric branch 2 times, most recently from f4a5d65 to 0cf593f Compare January 12, 2023 02:10
@sctb512
Copy link
Member Author

sctb512 commented Jan 12, 2023

@changweige Is there any way to rerun the test by myself? 😌

@sctb512 sctb512 force-pushed the add-nydusd-hang-IO-metric branch 2 times, most recently from 2b90df1 to bd4701c Compare January 12, 2023 02:55
@changweige
Copy link
Member

@changweige Is there any way to rerun the test by myself? 😌

No. It's Github limitation. The action must be triggered to rerun by maintainers. Or you can force-push the PR to trigger it to run again

@sctb512
Copy link
Member Author

sctb512 commented Jan 12, 2023

@changweige Is there any way to rerun the test by myself? 😌

No. It's Github limitation. The action must be triggered to rerun by maintainers. Or you can force-push the PR to trigger it to run again

Ok, got it.

pkg/metrics/collector/collector.go Outdated Show resolved Hide resolved
hungIOMap := make(map[string]uint64)
recordedHungIOMap := make(map[uint64]uint64)
nowTime := time.Now()
for _, daemonInflightIOMetrics := range i.MetricsVec {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please briefly introduce the algorithm here that catches the hung IO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add a comment line to introduce and elaborate the algorithm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
i.RecordedMetrics = recordedHungIOMap
for opcode, value := range hungIOMap {
data.TotalHungIO.WithLabelValues(opcode).Add(float64(value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric should be Gauge. Hung io can disappear when the backend remote request is responded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I have replaced it with Gauge.

pkg/metrics/data/fs.go Outdated Show resolved Hide resolved
12: "OP_RENAME",
13: "OP_LINK",
14: "OP_OPEN",
15: "OP_READ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to copy all FUSE op code definitions here? Only counting on READ should meet requirements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If only READ operation is available. The Label is unnecessary.

Add the metric TotalHangIO to record the total hang IO counts of FS.
The inflight IOs data are from nydus daemon API /api/v1/metrics/inflight.

Signed-off-by: Bin Tang <tangbin.bin@bytedance.com>
@changweige changweige merged commit 38ec2ed into containerd:main Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants