Skip to content

DAOS-16999 bio: Set LED on auto-faulty detection#17630

Merged
mchaarawi merged 15 commits intomasterfrom
tanabarr/bio-autofaulty-setled
Mar 18, 2026
Merged

DAOS-16999 bio: Set LED on auto-faulty detection#17630
mchaarawi merged 15 commits intomasterfrom
tanabarr/bio-autofaulty-setled

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Mar 2, 2026

Centralize LED state updates within the BIO module so that when the BS
state transitions to FAULTY, the LED turns ON, and when it transitions
to NORMAL, the LED turns OFF. This consolidation simplifies testing
and maintenance by ensuring that both manual and automatic set‑faulty
workflows follow the same LED‑related code paths.

Also updates RAS event list with missing entries including LED-related.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Ticket title is 'LED does not transition to "ON" after auto set-faulty eviction'
Status is 'In Review'
Labels: 'LED,SPDK,VMD,scrubbed_2.8'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-16999

@daosbuild3
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17630/1/display/redirect

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/2/execution/node/1343/log

sherintg
sherintg previously approved these changes Mar 3, 2026
Copy link
Collaborator

@sherintg sherintg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanabarr any plans to address the below issue aswell?
On HPE ProLiant systems when the SSD is replaced, the location indicator automatically gets turned off. This does not get reflected in “dmg storage list-devices” after “dmg storage replace nvme” command.

knard38
knard38 previously approved these changes Mar 3, 2026
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr dismissed stale reviews from knard38 and sherintg via 0ea56d3 March 4, 2026 23:01
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/3/execution/node/1280/log

DP_RC(rc));
send_set_led(bbs, CTL__LED_STATE__ON);
} else {
send_set_led(bbs, CTL__LED_STATE__OFF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set LED to off only when new state is NORMAL?

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17630/3/testReport/

tanabarr added 2 commits March 5, 2026 17:02
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr marked this pull request as ready for review March 5, 2026 21:59
@tanabarr tanabarr requested review from a team as code owners March 5, 2026 21:59
}

uuid_copy(led_msg->dev_uuid, bbs->bb_dev->bb_uuid);
led_msg->xs = bbs->bb_owner_xs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary (and not correct) to pass in "owner xs". The set_led() is running on "init xs".


D_ASSERT(led_msg->xs != NULL);

rc = bio_led_manage(led_msg->xs, NULL, led_msg->dev_uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "bio_xs_context" is the "device owner xs", it's not the "init xs".

Unfortunately there is an available interface to get "init xstream" in current code, you could replace the "bd_init_thread" with "bd_init_xs" and provide a function to get the "init xs".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

auto_faulty_detect(struct bio_blobstore *bbs)
{
struct smd_dev_info *dev_info;
struct smd_dev_info *dev_info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change could be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bbs->bb_state != BIO_BS_STATE_SETUP)
rc = -DER_INVAL;
else
send_set_led(bbs, CTL__LED_STATE__ON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to move downward after the faulty state being persistent (after smd_dev_set_state() is successfully called).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (bbs->bb_state != BIO_BS_STATE_SETUP)
rc = -DER_INVAL;
else
send_set_led(bbs, CTL__LED_STATE__OFF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't take effect. It should be called revive_dev() after the normal state being persistent. (after smd_dev_set_state() is successfully called).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

NULL, 0);
if (rc != 0)
DL_ERROR(rc, "Reset LED on device:" DF_UUID " failed", DP_UUID(d_bdev->bb_uuid));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current xs is the "init xs", the bio_led_manage() call should be kept to turn off LED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1280/log

tanabarr added 2 commits March 6, 2026 12:54
Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…function

src/bio/bio_internal.h - Added init_xs_context() declaration
src/bio/bio_recovery.c - Fixed LED message to use init xstream context

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1321/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/12/execution/node/1363/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/12/execution/node/1322/log

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
D_ASSERT(is_init_xstream(led_msg->xs));

bdev = lookup_dev_by_id(led_msg->dev_uuid);
if (bdev != NULL && bdev->bb_led_identify_active) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identify active flag doesn't seem to be read as intended here, the workflow identify->set_faulty results in LED ON whereas set_faulty->identify->replace results in LED QUICK_BLINK. @NiuYawei any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved, I had to remove the extra set LED call in drpc set-faulty handler that I had forgot

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are minor and non-blocking. Once you have confirmed it's working correctly in your tests, I'll approve.

/* Keep faulty and timer operations mutually exclusive */
if (is_faulty != NULL) {
/* Set is_faulty return value */
*is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) ? true : false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion that feels a little more intuitive in C:

Suggested change
*is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) ? true : false;
*is_faulty = (dev_info->bdi_flags & NVME_DEV_FL_FAULTY) != 0;

Comment on lines +839 to +840
set_timer_or_check_faulty(struct bio_xs_context *xs_ctxt, struct spdk_pci_addr pci_addr,
uint64_t *expiry_time, bool *is_faulty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're now mutually exclusive, I think it would be better for them to be separate functions. Using helper functions to share the boilerplate would avoid excessive duplication. I won't block on that for now, but I think it would be an improvement to a function that does completely different things based on input params.

Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr dismissed stale reviews from knard38 and NiuYawei via ee62ade March 13, 2026 00:54
Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/14/execution/node/1362/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/15/execution/node/1363/log

@tanabarr tanabarr self-assigned this Mar 16, 2026
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Mar 16, 2026
@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Mar 16, 2026
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/16/execution/node/429/log

Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last modification LGTM

@tanabarr tanabarr requested a review from a team March 17, 2026 10:21
@tanabarr
Copy link
Contributor Author

CI failures in the last few runs are intermittent and unrelated to the change. Features: nvme is being used to achieve appropriate coverage and probably the reason why a green run cannot be achieved.

@tanabarr
Copy link
Contributor Author

@daos-stack/daos-gatekeeper ping!

@mchaarawi
Copy link
Contributor

@daos-stack/daos-gatekeeper ping!

@tanabarr since @NiuYawei reviewed this PR, he should approve before you request gatekeeper

@mchaarawi
Copy link
Contributor

@daos-stack/daos-gatekeeper ping!

@tanabarr since @NiuYawei reviewed this PR, he should approve before you request gatekeeper

please add gatekeeper again after Niu approves. thanks.

@mchaarawi mchaarawi removed the request for review from a team March 18, 2026 12:23
@tanabarr tanabarr requested a review from a team March 18, 2026 13:00
@mchaarawi mchaarawi merged commit 4e50d3e into master Mar 18, 2026
45 of 47 checks passed
@mchaarawi mchaarawi deleted the tanabarr/bio-autofaulty-setled branch March 18, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

7 participants