DAOS-16999 bio: Set LED on auto-faulty detection#17630
Conversation
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Ticket title is 'LED does not transition to "ON" after auto set-faulty eviction' |
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17630/1/display/redirect |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/2/execution/node/1343/log |
sherintg
left a comment
There was a problem hiding this comment.
@tanabarr any plans to address the below issue aswell?
On HPE ProLiant systems when the SSD is replaced, the location indicator automatically gets turned off. This does not get reflected in “dmg storage list-devices” after “dmg storage replace nvme” command.
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/3/execution/node/1280/log |
src/bio/bio_recovery.c
Outdated
| DP_RC(rc)); | ||
| send_set_led(bbs, CTL__LED_STATE__ON); | ||
| } else { | ||
| send_set_led(bbs, CTL__LED_STATE__OFF); |
There was a problem hiding this comment.
Set LED to off only when new state is NORMAL?
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17630/3/testReport/ |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
src/bio/bio_recovery.c
Outdated
| } | ||
|
|
||
| uuid_copy(led_msg->dev_uuid, bbs->bb_dev->bb_uuid); | ||
| led_msg->xs = bbs->bb_owner_xs; |
There was a problem hiding this comment.
It's not necessary (and not correct) to pass in "owner xs". The set_led() is running on "init xs".
|
|
||
| D_ASSERT(led_msg->xs != NULL); | ||
|
|
||
| rc = bio_led_manage(led_msg->xs, NULL, led_msg->dev_uuid, |
There was a problem hiding this comment.
This "bio_xs_context" is the "device owner xs", it's not the "init xs".
Unfortunately there is an available interface to get "init xstream" in current code, you could replace the "bd_init_thread" with "bd_init_xs" and provide a function to get the "init xs".
src/bio/bio_monitor.c
Outdated
| auto_faulty_detect(struct bio_blobstore *bbs) | ||
| { | ||
| struct smd_dev_info *dev_info; | ||
| struct smd_dev_info *dev_info; |
There was a problem hiding this comment.
This change could be reverted.
src/bio/bio_recovery.c
Outdated
| bbs->bb_state != BIO_BS_STATE_SETUP) | ||
| rc = -DER_INVAL; | ||
| else | ||
| send_set_led(bbs, CTL__LED_STATE__ON); |
There was a problem hiding this comment.
It's better to move downward after the faulty state being persistent (after smd_dev_set_state() is successfully called).
src/bio/bio_recovery.c
Outdated
| if (bbs->bb_state != BIO_BS_STATE_SETUP) | ||
| rc = -DER_INVAL; | ||
| else | ||
| send_set_led(bbs, CTL__LED_STATE__OFF); |
There was a problem hiding this comment.
This won't take effect. It should be called revive_dev() after the normal state being persistent. (after smd_dev_set_state() is successfully called).
| NULL, 0); | ||
| if (rc != 0) | ||
| DL_ERROR(rc, "Reset LED on device:" DF_UUID " failed", DP_UUID(d_bdev->bb_uuid)); | ||
|
|
There was a problem hiding this comment.
Current xs is the "init xs", the bio_led_manage() call should be kept to turn off LED.
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1280/log |
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…function src/bio/bio_internal.h - Added init_xs_context() declaration src/bio/bio_recovery.c - Fixed LED message to use init xstream context Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1321/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/12/execution/node/1363/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/12/execution/node/1322/log |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
src/bio/bio_recovery.c
Outdated
| D_ASSERT(is_init_xstream(led_msg->xs)); | ||
|
|
||
| bdev = lookup_dev_by_id(led_msg->dev_uuid); | ||
| if (bdev != NULL && bdev->bb_led_identify_active) { |
There was a problem hiding this comment.
The identify active flag doesn't seem to be read as intended here, the workflow identify->set_faulty results in LED ON whereas set_faulty->identify->replace results in LED QUICK_BLINK. @NiuYawei any ideas?
There was a problem hiding this comment.
resolved, I had to remove the extra set LED call in drpc set-faulty handler that I had forgot
kjacque
left a comment
There was a problem hiding this comment.
My comments are minor and non-blocking. Once you have confirmed it's working correctly in your tests, I'll approve.
| /* Keep faulty and timer operations mutually exclusive */ | ||
| if (is_faulty != NULL) { | ||
| /* Set is_faulty return value */ | ||
| *is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) ? true : false; |
There was a problem hiding this comment.
Just a suggestion that feels a little more intuitive in C:
| *is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) ? true : false; | |
| *is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) != 0; |
| set_timer_or_check_faulty(struct bio_xs_context *xs_ctxt, struct spdk_pci_addr pci_addr, | ||
| uint64_t *expiry_time, bool *is_faulty) |
There was a problem hiding this comment.
If they're now mutually exclusive, I think it would be better for them to be separate functions. Using helper functions to share the boilerplate would avoid excessive duplication. I won't block on that for now, but I think it would be an improvement to a function that does completely different things based on input params.
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/14/execution/node/1362/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/15/execution/node/1363/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/16/execution/node/429/log |
|
CI failures in the last few runs are intermittent and unrelated to the change. |
|
@daos-stack/daos-gatekeeper ping! |
Centralize LED state updates within the BIO module so that when the BS
state transitions to FAULTY, the LED turns ON, and when it transitions
to NORMAL, the LED turns OFF. This consolidation simplifies testing
and maintenance by ensuring that both manual and automatic set‑faulty
workflows follow the same LED‑related code paths.
Also updates RAS event list with missing entries including LED-related.
Steps for the author:
After all prior steps are complete: