DAOS-18361 chk: handle CHK engine side inconsistency in parallel by Nasf-Fan · Pull Request #17556 · daos-stack/daos

Nasf-Fan · 2026-02-13T13:23:25Z

On CHK engine side, most of inconsistencies can be handled in parallel. For each of them, create dedicated ULT to handle the inconsistency and report (including interaction) to CHK leader independently. So even if some ULT was blocked for some reason, such as waiting for interaction, it will not affect the other inconsistencies to be handled in parallel.

Test-tag: recovery

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-02-13T13:23:42Z

Ticket title is 'CR did not detect orphan container shards on Aurora'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-18361

daosbuild3 · 2026-02-13T14:21:02Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17556/1/testReport/

On CHK engine side, most of inconsistencies can be handled in parallel. For each of them, create dedicated ULT to handle the inconsistency and report (including interaction) to CHK leader independently. So even if some ULT was blocked for some reason, such as waiting for interaction, it will not affect the other inconsistencies to be handled in parallel. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>

Nasf-Fan · 2026-03-02T03:09:47Z

Ping reviewers, thanks!

knard38

Mostly LGTM for what I understand: I am not yet very familiar with this part of the code.

knard38 · 2026-03-09T13:18:54Z

src/chk/chk_engine.c

+			ABT_thread_free(&ult->ceu_ult);
+
+		if (rc == 0)
+			rc = ult->ceu_result;


From what I understand, if we have mutltiple errors we will keep only the first error code appearing.
Not sure if, it could be an issue compared to the original behavior.

Yes, for case of multiple CHK report failures, only the first one's err# will be returned to the caller. It is unnecessary to make the caller to know all the failures since the caller will not distinguish the detailed failure reason, instead, it only checks check chk_engine_wait_ults() return value and decides whether go ahead or fail out.

On the other hand, the other potential failures are not discarded, because when related failure happens, it will be recorded via related D_ERROR log. The user/admin still have chance to known that via checking the log.

wangshilong · 2026-03-12T08:42:49Z

src/chk/chk_engine.c

-
-		seq = 0;
-		chk_engine_report(&cru, &seq, NULL);
+		chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr);


[defect] did not check chk_engine_handle_unknown return value, other places check this function...

Fix it in the new commit.

Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>

knard38 · 2026-03-12T14:23:36Z

src/chk/chk_engine.c

+		rc = chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr);
+		if (rc != 0)
+			goto out;


NIT, the following lines are not needed from my understanding.

if (rc != 0) goto out;

We need goto to break the for loop for failure case.

daltonbohning · 2026-03-13T15:42:41Z

Removing gatekeeper until merge approval is granted

Base automatically changed from Nasf-Fan/DAOS-18587 to master February 26, 2026 08:02

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18361_1 branch from 7171952 to 1523016 Compare February 26, 2026 08:10

Nasf-Fan mentioned this pull request Feb 26, 2026

DAOS-18361 chk: handle CHK engine side inconsistency in parallel #17446

Closed

6 tasks

Nasf-Fan marked this pull request as ready for review February 27, 2026 08:18

Nasf-Fan requested review from gnailzenh and wangshilong February 27, 2026 08:18

Nasf-Fan requested a review from knard38 March 9, 2026 02:52

knard38 reviewed Mar 9, 2026

View reviewed changes

wangshilong reviewed Mar 12, 2026

View reviewed changes

Nasf-Fan added 2 commits March 12, 2026 21:27

Merge branch 'master' into Nasf-Fan/DAOS-18361_1

2255396

DAOS-18361 chk: for review feedback

b974d58

Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>

knard38 approved these changes Mar 12, 2026

View reviewed changes

Nasf-Fan requested a review from wangshilong March 13, 2026 00:47

wangshilong approved these changes Mar 13, 2026

View reviewed changes

Nasf-Fan requested a review from a team March 13, 2026 07:24

daltonbohning removed the request for review from a team March 13, 2026 15:42

gnailzenh approved these changes Mar 17, 2026

View reviewed changes

gnailzenh merged commit c5dafb4 into master Mar 17, 2026
40 checks passed

gnailzenh deleted the Nasf-Fan/DAOS-18361_1 branch March 17, 2026 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556

DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556
gnailzenh merged 3 commits intomasterfrom
Nasf-Fan/DAOS-18361_1

Nasf-Fan commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

daosbuild3 commented Feb 13, 2026

Uh oh!

Nasf-Fan commented Mar 2, 2026

Uh oh!

knard38 left a comment

Uh oh!

knard38 Mar 9, 2026

Uh oh!

Nasf-Fan Mar 12, 2026

Uh oh!

wangshilong Mar 12, 2026

Uh oh!

Nasf-Fan Mar 12, 2026

Uh oh!

knard38 Mar 12, 2026

Uh oh!

Nasf-Fan Mar 13, 2026

Uh oh!

daltonbohning commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Conversation

Nasf-Fan commented Feb 13, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

daosbuild3 commented Feb 13, 2026

Uh oh!

Nasf-Fan commented Mar 2, 2026

Uh oh!

knard38 left a comment

Choose a reason for hiding this comment

Uh oh!

knard38 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Nasf-Fan Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Nasf-Fan Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

knard38 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Nasf-Fan Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

daltonbohning commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants