Skip to content

DAOS-623 test: add allowed error for FI#17959

Merged
mchaarawi merged 1 commit intomasterfrom
mschaara/fi_allow_err
Apr 13, 2026
Merged

DAOS-623 test: add allowed error for FI#17959
mchaarawi merged 1 commit intomasterfrom
mschaara/fi_allow_err

Conversation

@mchaarawi
Copy link
Copy Markdown
Contributor

Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Ticket title is 'Generic ticket for minor code cleanup and improvement'
Status is 'Resolved'
Labels: 'request_for_2.6.5,request_for_2.6.6,request_for_2.8,scrubbed_2.6.5'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-623

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Apr 9, 2026
@mchaarawi mchaarawi force-pushed the mschaara/fi_allow_err branch 3 times, most recently from 15e151e to 8b2915c Compare April 9, 2026 23:37
@daosbuild3
Copy link
Copy Markdown
Collaborator

@mchaarawi mchaarawi force-pushed the mschaara/fi_allow_err branch 3 times, most recently from c242234 to e436ce2 Compare April 10, 2026 04:03
if 'Sluggish EC boundary report from rank' in log_msg:
return False

if 'The progress callback was not called for too long' in log_msg:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to increase the default value of the swim_prot_period_len parameter for NLT/FI tests using the SWIM_PROTOCOL_PERIOD_LEN environment variable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variance in the numbers for this in CI VM tests has been too random, so we do not know where to start.. t his is fundamentally a CI networking issue and this is a quick workaround for now so folks do not have to keep restarting their PRs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchaarawi - We are investigating ways to improve the resources of the system the Fault Injection stage runs on, and I want to make sure we are focussing on the correct things. What do you mean by "CI networking issue"? From my understanding, there should be no networking with the test because it's all run on the same system, but honestly, I don't understand the test all that well so am probably missing something.

Copy link
Copy Markdown
Contributor Author

@mchaarawi mchaarawi Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the issues we have seen on the FI injection test are network related.
RPCs failing with timeouts, SWIM not able to make progress, etc.
even that Sluggish EC boundary warning indicates a network slowness.
all of these issues did not use to happen at Intel, but were very common at FC.. so yes this has been a problem for many months and im surprised you are asking since you were part of the discussions on these issues during triage..
anyway this is not the place for this discussion. this PR is just a workaround that can be removed later. if you would rather just have folks keep restarting PRs to get a green run, i can close this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, I appreciate the workaround. I'm honestly just trying to understand, not push back on the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries.. i see my comment did not sound as i intended it should. i meant that if you would rather have a proper fix for this in the infrastructure and have folks keep restarting in the meantime i can close this.

i do think it would be better to have this workaround in the short term though.

Comment on lines +5030 to +5031
if 'Sluggish EC boundary report from rank' in log_msg:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to implement a solution similar to SWIM (see the next comment), which allows us to configure the timeout (600) introduced by the #17309 PR using an environment variable?

cur_ts > eph_ldr->cte_server_ephs[i].re_ec_agg_eph_update_ts + 600)

			if (pool->sp_reclaim != DAOS_RECLAIM_DISABLED &&
			    cur_ts > eph_ldr->cte_server_ephs[i].re_ec_agg_eph_update_ts + 600)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above more or less.

ryon-jensen
ryon-jensen previously approved these changes Apr 10, 2026
Copy link
Copy Markdown
Contributor

@ryon-jensen ryon-jensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's move forward with this workaround. We will continue to investigate the infrastructure.
Thanks!

@mchaarawi
Copy link
Copy Markdown
Contributor Author

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17959/9/display/redirect

@mchaarawi mchaarawi force-pushed the mschaara/fi_allow_err branch from c504163 to 2c864ea Compare April 10, 2026 20:13
Skip-func-hw-test: true
Skip-unit-tests: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
@mchaarawi mchaarawi force-pushed the mschaara/fi_allow_err branch from 2c864ea to 3c3bb3d Compare April 10, 2026 23:32
@mchaarawi
Copy link
Copy Markdown
Contributor Author

OK the PR is now ready to review. in the last run, FI passed and i do see the SWIM errors in the server log:
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17959/11/artifact/nlt_logs/fault-injection/dnt_server_Server.no-debug_0_rept_r2x.log.bz2

@mchaarawi mchaarawi marked this pull request as ready for review April 11, 2026 04:18
@mchaarawi mchaarawi requested review from a team as code owners April 11, 2026 04:18
@grom72
Copy link
Copy Markdown
Contributor

grom72 commented Apr 13, 2026

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

Could we look at this problem from a slightly different perspective?
It will be very difficult to re-enable these bugs once they’ve been disabled.
A single PR isn’t enough to confirm that the bug has gone.

As I understand it, the main issue is the need to restart CI, primarily because further hardware tests haven’t been run.
This problem can be temporarily fixed by moving the FI tests to the Test Hardware stage in Jenkinsfile.
The validation status will be "UNSTABLE" but gatekeeper can easily check that it is only related to FI tests.

In the end, the FI tests should be moved to a more stable and predictable runtime environment. As we can see, running them in parallel in Docker containers does not guarantee this. All FI tests are actually run on the host where Jenkins' Java agents are run (up to 32), and up to 10 Fault Injection tests can be executed at the same time.

It seems right to use the same assumptions here as for NLT tests. These are run on burn metal directly on dedicated nodes. This is important for keeping NLT test results stable and able to be reproduced.

@mchaarawi
Copy link
Copy Markdown
Contributor Author

mchaarawi commented Apr 13, 2026

there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this

Could we look at this problem from a slightly different perspective? It will be very difficult to re-enable these bugs once they’ve been disabled. A single PR isn’t enough to confirm that the bug has gone.

i am not sure what you are saying (re-enable bugs). Maybe you misunderstand what FI is doing.
those errors we see are all network related. once we can confirm that the network is stable, then we can re-enable failing on those error conditions. the FI test is not actually inserting any delays that actually causes those errors, so maybe you need to read more into what it's doing.
It is not difficult to re-enable those error condition checks. just revert this PR.

As I understand it, the main issue is the need to restart CI, primarily because further hardware tests haven’t been run. This problem can be temporarily fixed by moving the FI tests to the Test Hardware stage in Jenkinsfile. The validation status will be "UNSTABLE" but gatekeeper can easily check that it is only related to FI tests.

this is not an acceptable solution IMO. We have a lot of gatekeeper that do not have force-landing privileges and we cannot have only owners be gatekeepers. so no, allow-unstable should not be used.

In the end, the FI tests should be moved to a more stable and predictable runtime environment. As we can see, running them in parallel in Docker containers does not guarantee this. All FI tests are actually run on the host where Jenkins' Java agents are run (up to 32), and up to 10 Fault Injection tests can be executed at the same time.

It seems right to use the same assumptions here as for NLT tests. These are run on burn metal directly on dedicated nodes. This is important for keeping NLT test results stable and able to be reproduced.

sure, i see this as an SRE ticket and im not sure how long it will take. once it is figured out, this PR can be reverted.

@mchaarawi mchaarawi requested a review from frostedcmos April 13, 2026 12:28
@mchaarawi mchaarawi merged commit b03decb into master Apr 13, 2026
32 checks passed
@mchaarawi mchaarawi deleted the mschaara/fi_allow_err branch April 13, 2026 15:48
mchaarawi added a commit that referenced this pull request Apr 13, 2026
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
mchaarawi added a commit that referenced this pull request Apr 13, 2026
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
mchaarawi added a commit that referenced this pull request Apr 14, 2026
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
mchaarawi added a commit that referenced this pull request Apr 14, 2026
Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
mchaarawi added a commit that referenced this pull request Apr 14, 2026
Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
mchaarawi added a commit that referenced this pull request Apr 14, 2026
Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

6 participants