DAOS-623 test: add allowed error for FI#17959
Conversation
|
Ticket title is 'Generic ticket for minor code cleanup and improvement' |
15e151e to
8b2915c
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17959/3/execution/node/747/log |
c242234 to
e436ce2
Compare
utils/node_local_test.py
Outdated
| if 'Sluggish EC boundary report from rank' in log_msg: | ||
| return False | ||
|
|
||
| if 'The progress callback was not called for too long' in log_msg: |
There was a problem hiding this comment.
Would it be better to increase the default value of the swim_prot_period_len parameter for NLT/FI tests using the SWIM_PROTOCOL_PERIOD_LEN environment variable?
There was a problem hiding this comment.
the variance in the numbers for this in CI VM tests has been too random, so we do not know where to start.. t his is fundamentally a CI networking issue and this is a quick workaround for now so folks do not have to keep restarting their PRs.
There was a problem hiding this comment.
@mchaarawi - We are investigating ways to improve the resources of the system the Fault Injection stage runs on, and I want to make sure we are focussing on the correct things. What do you mean by "CI networking issue"? From my understanding, there should be no networking with the test because it's all run on the same system, but honestly, I don't understand the test all that well so am probably missing something.
There was a problem hiding this comment.
most of the issues we have seen on the FI injection test are network related.
RPCs failing with timeouts, SWIM not able to make progress, etc.
even that Sluggish EC boundary warning indicates a network slowness.
all of these issues did not use to happen at Intel, but were very common at FC.. so yes this has been a problem for many months and im surprised you are asking since you were part of the discussions on these issues during triage..
anyway this is not the place for this discussion. this PR is just a workaround that can be removed later. if you would rather just have folks keep restarting PRs to get a green run, i can close this.
There was a problem hiding this comment.
Not at all, I appreciate the workaround. I'm honestly just trying to understand, not push back on the change.
There was a problem hiding this comment.
no worries.. i see my comment did not sound as i intended it should. i meant that if you would rather have a proper fix for this in the infrastructure and have folks keep restarting in the meantime i can close this.
i do think it would be better to have this workaround in the short term though.
utils/node_local_test.py
Outdated
| if 'Sluggish EC boundary report from rank' in log_msg: | ||
| return False |
There was a problem hiding this comment.
Wouldn't it be better to implement a solution similar to SWIM (see the next comment), which allows us to configure the timeout (600) introduced by the #17309 PR using an environment variable?
daos/src/container/srv_container.c
Line 2206 in b0f8277
if (pool->sp_reclaim != DAOS_RECLAIM_DISABLED &&
cur_ts > eph_ldr->cte_server_ephs[i].re_ec_agg_eph_update_ts + 600)
There was a problem hiding this comment.
same answer as above more or less.
ryon-jensen
left a comment
There was a problem hiding this comment.
I agree, let's move forward with this workaround. We will continue to investigate the infrastructure.
Thanks!
|
there is still some bug in the PR where the ignore errors are not being ignored.. will need to iterate more on this |
e436ce2 to
c504163
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17959/9/display/redirect |
c504163 to
2c864ea
Compare
Skip-func-hw-test: true Skip-unit-tests: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
2c864ea to
3c3bb3d
Compare
|
OK the PR is now ready to review. in the last run, FI passed and i do see the SWIM errors in the server log: |
Could we look at this problem from a slightly different perspective? As I understand it, the main issue is the need to restart CI, primarily because further hardware tests haven’t been run. In the end, the FI tests should be moved to a more stable and predictable runtime environment. As we can see, running them in parallel in Docker containers does not guarantee this. All FI tests are actually run on the host where Jenkins' Java agents are run (up to 32), and up to 10 Fault Injection tests can be executed at the same time. It seems right to use the same assumptions here as for NLT tests. These are run on burn metal directly on dedicated nodes. This is important for keeping NLT test results stable and able to be reproduced. |
i am not sure what you are saying (re-enable bugs). Maybe you misunderstand what FI is doing.
this is not an acceptable solution IMO. We have a lot of gatekeeper that do not have force-landing privileges and we cannot have only owners be gatekeepers. so no, allow-unstable should not be used.
sure, i see this as an SRE ticket and im not sure how long it will take. once it is figured out, this PR can be reverted. |
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Skip-func-hw-test: true Skip-unit-test: true Skip-unit-test-memcheck: true Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
Skip-func-hw-test: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Steps for the author:
After all prior steps are complete: