DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556
DAOS-18361 chk: handle CHK engine side inconsistency in parallel#17556
Conversation
|
Ticket title is 'CR did not detect orphan container shards on Aurora' |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17556/1/testReport/ |
On CHK engine side, most of inconsistencies can be handled in parallel. For each of them, create dedicated ULT to handle the inconsistency and report (including interaction) to CHK leader independently. So even if some ULT was blocked for some reason, such as waiting for interaction, it will not affect the other inconsistencies to be handled in parallel. Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
7171952 to
1523016
Compare
|
Ping reviewers, thanks! |
knard38
left a comment
There was a problem hiding this comment.
Mostly LGTM for what I understand: I am not yet very familiar with this part of the code.
| ABT_thread_free(&ult->ceu_ult); | ||
|
|
||
| if (rc == 0) | ||
| rc = ult->ceu_result; |
There was a problem hiding this comment.
From what I understand, if we have mutltiple errors we will keep only the first error code appearing.
Not sure if, it could be an issue compared to the original behavior.
There was a problem hiding this comment.
Yes, for case of multiple CHK report failures, only the first one's err# will be returned to the caller. It is unnecessary to make the caller to know all the failures since the caller will not distinguish the detailed failure reason, instead, it only checks check chk_engine_wait_ults() return value and decides whether go ahead or fail out.
On the other hand, the other potential failures are not discarded, because when related failure happens, it will be recorded via related D_ERROR log. The user/admin still have chance to known that via checking the log.
src/chk/chk_engine.c
Outdated
|
|
||
| seq = 0; | ||
| chk_engine_report(&cru, &seq, NULL); | ||
| chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr); |
There was a problem hiding this comment.
[defect] did not check chk_engine_handle_unknown return value, other places check this function...
There was a problem hiding this comment.
Fix it in the new commit.
Test-tag: recovery Signed-off-by: Fan Yong <fan.yong@hpe.com>
| rc = chk_engine_handle_unknown(cpr, ccr, NULL, exp_tgt_nr); | ||
| if (rc != 0) | ||
| goto out; |
There was a problem hiding this comment.
NIT, the following lines are not needed from my understanding.
if (rc != 0)
goto out;
There was a problem hiding this comment.
We need goto to break the for loop for failure case.
|
Removing gatekeeper until merge approval is granted |
On CHK engine side, most of inconsistencies can be handled in parallel. For each of them, create dedicated ULT to handle the inconsistency and report (including interaction) to CHK leader independently. So even if some ULT was blocked for some reason, such as waiting for interaction, it will not affect the other inconsistencies to be handled in parallel.
Test-tag: recovery
Steps for the author:
After all prior steps are complete: