Skip to content

DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556

Merged
gnailzenh merged 3 commits intomasterfrom
Nasf-Fan/DAOS-18361_1
Mar 17, 2026
Merged

DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556
gnailzenh merged 3 commits intomasterfrom
Nasf-Fan/DAOS-18361_1

Conversation

@Nasf-Fan
Copy link
Contributor

On CHK engine side, most of inconsistencies can be handled in parallel. For each of them, create dedicated ULT to handle the inconsistency and report (including interaction) to CHK leader independently. So even if some ULT was blocked for some reason, such as waiting for interaction, it will not affect the other inconsistencies to be handled in parallel.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'CR did not detect orphan container shards on Aurora'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-18361

@daosbuild3
Copy link
Collaborator

Base automatically changed from Nasf-Fan/DAOS-18587 to master February 26, 2026 08:02
On CHK engine side, most of inconsistencies can be handled in parallel.
For each of them, create dedicated ULT to handle the inconsistency and
report (including interaction) to CHK leader independently. So even if
some ULT was blocked for some reason, such as waiting for interaction,
it will not affect the other inconsistencies to be handled in parallel.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18361_1 branch from 7171952 to 1523016 Compare February 26, 2026 08:10
@Nasf-Fan Nasf-Fan marked this pull request as ready for review February 27, 2026 08:18
@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Mar 2, 2026

Ping reviewers, thanks!

@Nasf-Fan Nasf-Fan requested a review from knard38 March 9, 2026 02:52
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM for what I understand: I am not yet very familiar with this part of the code.

ABT_thread_free(&ult->ceu_ult);

if (rc == 0)
rc = ult->ceu_result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, if we have mutltiple errors we will keep only the first error code appearing.
Not sure if, it could be an issue compared to the original behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for case of multiple CHK report failures, only the first one's err# will be returned to the caller. It is unnecessary to make the caller to know all the failures since the caller will not distinguish the detailed failure reason, instead, it only checks check chk_engine_wait_ults() return value and decides whether go ahead or fail out.

On the other hand, the other potential failures are not discarded, because when related failure happens, it will be recorded via related D_ERROR log. The user/admin still have chance to known that via checking the log.


seq = 0;
chk_engine_report(&cru, &seq, NULL);
chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[defect] did not check chk_engine_handle_unknown return value, other places check this function...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it in the new commit.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
Comment on lines +1889 to +1891
rc = chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr);
if (rc != 0)
goto out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, the following lines are not needed from my understanding.

		if (rc != 0)
			goto out;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need goto to break the for loop for failure case.

@Nasf-Fan Nasf-Fan requested a review from wangshilong March 13, 2026 00:47
@Nasf-Fan Nasf-Fan requested a review from a team March 13, 2026 07:24
@daltonbohning
Copy link
Contributor

Removing gatekeeper until merge approval is granted

@daltonbohning daltonbohning removed the request for review from a team March 13, 2026 15:42
@gnailzenh gnailzenh merged commit c5dafb4 into master Mar 17, 2026
40 checks passed
@gnailzenh gnailzenh deleted the Nasf-Fan/DAOS-18361_1 branch March 17, 2026 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants