Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some diagnostic information to be monitored in system_monitor #1770

Closed
3 tasks done
ito-san opened this issue Sep 2, 2022 · 1 comment
Closed
3 tasks done

Add some diagnostic information to be monitored in system_monitor #1770

ito-san opened this issue Sep 2, 2022 · 1 comment
Assignees
Labels
component:system System design and integration. (auto-assigned) type:new-feature New functionalities or additions, feature requests.

Comments

@ito-san
Copy link
Contributor

ito-san commented Sep 2, 2022

Checklist

  • I've read the contribution guidelines.
  • I've searched other issues and no duplicate issues were found.
  • I've agreed with the maintainers that I can plan this task.

Description

We are getting results that we should add some diagnostic information to be monitored into system_monitor based on safety analysis made by TIER IV.

Purpose

This is an approach to ensure more system stability and enables developer/operator to get more chance to detect or predict a hazardous event.
Here is current diagnostic information to be monitored by system_monitor.

image

Possible approaches

Add new diagnostic information to be monitored as follows. (new items are written in red)

image

HardwarePotential Failure ModePotential Failure EffectPotential CausesAction RecommendedNew diagnostic Information to be added
Storage Read/Write error,
Read/Write delay,
Corruption of recorded data
Data loss,
Read/Write delay
Soft error(Transient fault) Observe a CRC error on the storage Recovered Error
High load on communication bandwidth Observe Read data rate Read Data Rate
Observe Write data rate Write Data Rate
Observe input/output per second for read Read IOPS
Observe input/output per second for write Write IOPS
LAN Error in transferred data Communication failure with sensors/ECUs Noise on the line Observe a CRC error on the network interface CRC Error
GPU Wrong GPU instructions, processing delays, hungup Unintended behavior in 3D object detection Input clock error Observe if GPU clock is correct Frequency

HDD Connection

In addition, we will improve hdd_monitor functionality to add diagnostic information on HDD connection.

  • The current implementation generates Error when a device is not connected in addition to limit exceeded, so the cause of a error is mixed.
  • This will allow us to understand a error is happening due to limit exceeded.

Related PRs

Definition of done

Newly added diagnostic information are correctly reported based on system resource status.

@ito-san ito-san added type:new-feature New functionalities or additions, feature requests. component:system System design and integration. (auto-assigned) labels Sep 2, 2022
@ito-san ito-san self-assigned this Sep 2, 2022
@ito-san
Copy link
Contributor Author

ito-san commented Sep 21, 2022

All PRs have been closed. I'm closing this issue.

@ito-san ito-san closed this as completed Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:system System design and integration. (auto-assigned) type:new-feature New functionalities or additions, feature requests.
Projects
None yet
Development

No branches or pull requests

1 participant