Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart percentage_used alert should be clear that nvme device is reaching its end of life #60

Open
przemeklal opened this issue Aug 22, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@przemeklal
Copy link
Member

It's easy to misunderstand the percentage_used nvme alert. The hw-health charm config has a pretty good description:

  Set thresholds for nvme percentage used check. Defaults to 80% warning threshold
  and 90% critical threshold. Percentage Used contains a vendor specific estimate
  of the percentage of NVM subsystem life used based on the actual usage and the
  manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated
  endurance of the NVM in the NVM subsystem has been consumed, but may not indicate
  an NVM subsystem failure.

The alert should suggest that the issue is related to filesystem getting full, rather than the NVME being close to death. E.g. Intel Optane drivers start to throttle and report read-only mode after hitting 105% percentage_used, which means they've exceeded their expected lifetime.

Something like: "nvme drive is close to reaching its estimated lifetime" would help.

@przemeklal przemeklal added the enhancement New feature or request label Aug 22, 2023
@Pjack
Copy link

Pjack commented Sep 25, 2023

We don't have alert for nvme in hw-observer now. Do you submit it in wrong place?

@przemeklal przemeklal changed the title Smart percentage_used alert should be clear that Smart percentage_used alert should be clear that nvme device is reaching its end of life Sep 25, 2023
@przemeklal
Copy link
Member Author

przemeklal commented Sep 25, 2023

This is more of a feature request. I know this alert doesn't exist (yet) and I wanted to capture this requirement for when it is reimplemented in hw-observer.

@Pjack
Copy link

Pjack commented Sep 26, 2023

Got it . This is a missing part in hardware-observer.
We will check if the related metrics available in grafana-agent (node-exporter) or not. If yes, we will add alert there, if not, we could implement it in hardware-observer. thanks for the reminder.

@jneo8
Copy link
Contributor

jneo8 commented Sep 26, 2023

Example to get the metrics:

Currently it's a missing part in node exporter.
We can consider to implement it on hardware-observer or use textfile-collector(I remember it's a missing part on grafana-agent charm and I prefer to use exporter).

@jneo8
Copy link
Contributor

jneo8 commented May 16, 2024

Need to check if https://github.com/prometheus-community/smartctl_exporter has this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants