DAOS-623 test: add allowed error for FI by mchaarawi · Pull Request #17959 · daos-stack/daos

mchaarawi · 2026-04-09T14:37:27Z

Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-04-09T15:12:57Z

Ticket title is 'Generic ticket for minor code cleanup and improvement'
Status is 'Resolved'
Labels: 'request_for_2.6.5,request_for_2.6.6,request_for_2.8,scrubbed_2.6.5'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-623

daosbuild3 · 2026-04-09T23:38:32Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17959/3/execution/node/747/log

grom72 · 2026-04-10T04:45:45Z

utils/node_local_test.py

+            if 'Sluggish EC boundary report from rank' in log_msg:
+                return False
+
+            if 'The progress callback was not called for too long' in log_msg:


Would it be better to increase the default value of the swim_prot_period_len parameter for NLT/FI tests using the SWIM_PROTOCOL_PERIOD_LEN environment variable?

the variance in the numbers for this in CI VM tests has been too random, so we do not know where to start.. t his is fundamentally a CI networking issue and this is a quick workaround for now so folks do not have to keep restarting their PRs.

@mchaarawi - We are investigating ways to improve the resources of the system the Fault Injection stage runs on, and I want to make sure we are focussing on the correct things. What do you mean by "CI networking issue"? From my understanding, there should be no networking with the test because it's all run on the same system, but honestly, I don't understand the test all that well so am probably missing something.

most of the issues we have seen on the FI injection test are network related.
RPCs failing with timeouts, SWIM not able to make progress, etc.
even that Sluggish EC boundary warning indicates a network slowness.
all of these issues did not use to happen at Intel, but were very common at FC.. so yes this has been a problem for many months and im surprised you are asking since you were part of the discussions on these issues during triage..
anyway this is not the place for this discussion. this PR is just a workaround that can be removed later. if you would rather just have folks keep restarting PRs to get a green run, i can close this.

Not at all, I appreciate the workaround. I'm honestly just trying to understand, not push back on the change.

no worries.. i see my comment did not sound as i intended it should. i meant that if you would rather have a proper fix for this in the infrastructure and have folks keep restarting in the meantime i can close this.

i do think it would be better to have this workaround in the short term though.

grom72 · 2026-04-10T06:15:59Z

utils/node_local_test.py

+            if 'Sluggish EC boundary report from rank' in log_msg:
+                return False


Wouldn't it be better to implement a solution similar to SWIM (see the next comment), which allows us to configure the timeout (600) introduced by the #17309 PR using an environment variable?

daos/src/container/srv_container.c

Line 2206 in b0f8277

cur_ts > eph_ldr->cte_server_ephs[i].re_ec_agg_eph_update_ts + 600)

if (pool->sp_reclaim != DAOS_RECLAIM_DISABLED && cur_ts > eph_ldr->cte_server_ephs[i].re_ec_agg_eph_update_ts + 600)

same answer as above more or less.

ryon-jensen

I agree, let's move forward with this workaround. We will continue to investigate the infrastructure.
Thanks!

mchaarawi · 2026-04-10T16:27:23Z

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

daosbuild3 · 2026-04-10T20:02:35Z

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17959/9/display/redirect

Skip-func-hw-test: true Skip-unit-tests: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi · 2026-04-11T04:18:52Z

OK the PR is now ready to review. in the last run, FI passed and i do see the SWIM errors in the server log:
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17959/11/artifact/nlt_logs/fault-injection/dnt_server_Server.no-debug_0_rept_r2x.log.bz2

grom72 · 2026-04-13T11:51:44Z

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

Could we look at this problem from a slightly different perspective?
It will be very difficult to re-enable these bugs once they’ve been disabled.
A single PR isn’t enough to confirm that the bug has gone.

As I understand it, the main issue is the need to restart CI, primarily because further hardware tests haven’t been run.
This problem can be temporarily fixed by moving the FI tests to the Test Hardware stage in Jenkinsfile.
The validation status will be "UNSTABLE" but gatekeeper can easily check that it is only related to FI tests.

In the end, the FI tests should be moved to a more stable and predictable runtime environment. As we can see, running them in parallel in Docker containers does not guarantee this. All FI tests are actually run on the host where Jenkins' Java agents are run (up to 32), and up to 10 Fault Injection tests can be executed at the same time.

It seems right to use the same assumptions here as for NLT tests. These are run on burn metal directly on dedicated nodes. This is important for keeping NLT test results stable and able to be reproduced.

mchaarawi · 2026-04-13T12:25:58Z

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

Could we look at this problem from a slightly different perspective? It will be very difficult to re-enable these bugs once they’ve been disabled. A single PR isn’t enough to confirm that the bug has gone.

i am not sure what you are saying (re-enable bugs). Maybe you misunderstand what FI is doing.
those errors we see are all network related. once we can confirm that the network is stable, then we can re-enable failing on those error conditions. the FI test is not actually inserting any delays that actually causes those errors, so maybe you need to read more into what it's doing.
It is not difficult to re-enable those error condition checks. just revert this PR.

As I understand it, the main issue is the need to restart CI, primarily because further hardware tests haven’t been run. This problem can be temporarily fixed by moving the FI tests to the Test Hardware stage in Jenkinsfile. The validation status will be "UNSTABLE" but gatekeeper can easily check that it is only related to FI tests.

this is not an acceptable solution IMO. We have a lot of gatekeeper that do not have force-landing privileges and we cannot have only owners be gatekeepers. so no, allow-unstable should not be used.

In the end, the FI tests should be moved to a more stable and predictable runtime environment. As we can see, running them in parallel in Docker containers does not guarantee this. All FI tests are actually run on the host where Jenkins' Java agents are run (up to 32), and up to 10 Fault Injection tests can be executed at the same time.

It seems right to use the same assumptions here as for NLT tests. These are run on burn metal directly on dedicated nodes. This is important for keeping NLT test results stable and able to be reproduced.

sure, i see this as an SRE ticket and im not sure how long it will take. once it is figured out, this PR can be reverted.

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

github-actions bot added the priority Ticket has high priority (automatically managed) label Apr 9, 2026

mchaarawi force-pushed the mschaara/fi_allow_err branch 3 times, most recently from 15e151e to 8b2915c Compare April 9, 2026 23:37

mchaarawi force-pushed the mschaara/fi_allow_err branch 3 times, most recently from c242234 to e436ce2 Compare April 10, 2026 04:03

grom72 reviewed Apr 10, 2026

View reviewed changes

ryon-jensen previously approved these changes Apr 10, 2026

View reviewed changes

mchaarawi dismissed ryon-jensen’s stale review via c504163 April 10, 2026 16:49

mchaarawi force-pushed the mschaara/fi_allow_err branch from e436ce2 to c504163 Compare April 10, 2026 16:49

mchaarawi force-pushed the mschaara/fi_allow_err branch from c504163 to 2c864ea Compare April 10, 2026 20:13

DAOS-623 test: add allowed error for FI

3c3bb3d

Skip-func-hw-test: true Skip-unit-tests: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi force-pushed the mschaara/fi_allow_err branch from 2c864ea to 3c3bb3d Compare April 10, 2026 23:32

mchaarawi marked this pull request as ready for review April 11, 2026 04:18

mchaarawi requested review from a team as code owners April 11, 2026 04:18

mchaarawi requested review from daltonbohning, phender and ryon-jensen April 11, 2026 04:19

mchaarawi requested a review from frostedcmos April 13, 2026 12:28

frostedcmos approved these changes Apr 13, 2026

View reviewed changes

daltonbohning approved these changes Apr 13, 2026

View reviewed changes

mchaarawi merged commit b03decb into master Apr 13, 2026
32 checks passed

mchaarawi deleted the mschaara/fi_allow_err branch April 13, 2026 15:48

mchaarawi added a commit that referenced this pull request Apr 13, 2026

DAOS-623 test: add allowed error for FI (#17959)

a85fb3f

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi added a commit that referenced this pull request Apr 13, 2026

DAOS-623 test: add allowed error for FI (#17959)

b336714

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi added a commit that referenced this pull request Apr 14, 2026

DAOS-623 test: add allowed error for FI (#17959)

cb7b940

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi added a commit that referenced this pull request Apr 14, 2026

DAOS-623 test: add allowed error for FI (#17959)

c827223

Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi added a commit that referenced this pull request Apr 14, 2026

DAOS-623 test: add allowed error for FI (#17959)

dfa6f7c

Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

mchaarawi added a commit that referenced this pull request Apr 14, 2026

DAOS-623 test: add allowed error for FI (#17959)

4ab0078

Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

		if 'Sluggish EC boundary report from rank' in log_msg:
		return False

Conversation

mchaarawi commented Apr 9, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

daosbuild3 commented Apr 9, 2026

Uh oh!

grom72 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mchaarawi Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ryon-jensen Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mchaarawi Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryon-jensen Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mchaarawi Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

grom72 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

mchaarawi Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ryon-jensen left a comment

Choose a reason for hiding this comment

Uh oh!

mchaarawi commented Apr 10, 2026

Uh oh!

daosbuild3 commented Apr 10, 2026

Uh oh!

mchaarawi commented Apr 11, 2026

Uh oh!

grom72 commented Apr 13, 2026

Uh oh!

mchaarawi commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

mchaarawi Apr 10, 2026 •

edited

Loading

mchaarawi commented Apr 13, 2026 •

edited

Loading