Skip to content

SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.8#18252

Open
grom72 wants to merge 4 commits into
release/2.8from
grom72/SRE-3703-2.8
Open

SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.8#18252
grom72 wants to merge 4 commits into
release/2.8from
grom72/SRE-3703-2.8

Conversation

@grom72
Copy link
Copy Markdown
Contributor

@grom72 grom72 commented May 14, 2026

Backport of: #17953

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/SRE-3703

unitTestPost() already processes nlt-junit.xml via the testResults
parameter it receives. The bare 'junit testResults: nlt-junit.xml'
call that follows is redundant and has no failure protection: it uses
the default healthScaleFactor so when fault injection tests
intentionally produce failures in nlt-junit.xml it marks the build
FAILURE immediately, overriding the controlled result handling done
by unitTestPost().

When node_local_test.py runs with --no-root, DAOS logs are written to
/localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync
only fetches from /tmp/, leaving nlt_logs/ empty and causing:

  No artifacts found that match the file pattern "nlt_logs/". Configuration error?

Add a second rsync from build/nlt_logs/ to collect logs from the --no-root
code path. The '|| true' ensures non-fatal behavior when the path does not
exist (plain NLT runs without --no-root).

Jenkinsfile: simplify NLT fault injection recordIssues call

The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection
testing' stage is now handled by unitTestPost() in pipeline-lib, so
remove it from the explicit recordIssues call here.

fault_status falback only based on PATH

- Add fallback `fault_status` detection: if the primary detection via `$PREFIX/bin` fails,
  try resolving `fault_status` via `$PATH`, improving robustness when the binary is
  installed via RPM rather than built in-tree.

nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml

mprotect-based Argobots ULT stack overflow checking causes a TLB
shootdown IPI on every stack allocation/deallocation. On KVM hosts
running multiple VMs in parallel this results in VM exits across all
vCPUs, significantly increasing latency under concurrent load.

Remove the setting to use the default (no overflow check), which is
acceptable for a CI/test environment where crashes are already caught
by the test harness.

ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost

pipeline-lib now supports overriding NLT/FI defaults (always_script,
testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config
map, taking priority over the values auto-detected from the stage name
by parseStageInfo.  Make the Jenkinsfile stages explicit to take
advantage of this and to make the stage configuration self-documenting.

NLT stage (unitTest call):
- Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml',
  always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'

NLT stage (unitTestPost call):
- Remove always_script (now passed to unitTest above)
- Add NLT: true to explicitly activate the NLT post-processing block
  (recordIssues, discoverGitReferenceBuild) instead of relying on
  stage name detection
- Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash

NLT Fault injection testing stage (unitTest call):
- Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml'
- Add with_valgrind: '' to explicitly suppress valgrind for FI

NLT Fault injection testing stage (unitTestPost call):
- Replace always_script with FI: true to explicitly activate fault
  injection post-processing (nlt-client-leaks.json, 'Fault injection'
  naming, discoverGitReferenceBuild) instead of relying on the now-
  removed stage name auto-detection of FI in parseStageInfo

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
Skip-func-test-leap15: true
@grom72 grom72 force-pushed the grom72/SRE-3703-2.8 branch from bbfd6e0 to 60494d6 Compare May 25, 2026 12:30
@daosbuild3
Copy link
Copy Markdown
Collaborator

@grom72 grom72 changed the title SRE-3703 ci: Fault injection testing stage on VM SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.8 May 25, 2026
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-python-bandit: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
Skip-func-test-leap15: true
…3-2.8

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Cancel-prev-build: false
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
Skip-func-test-leap15: true
@grom72 grom72 marked this pull request as ready for review May 28, 2026 13:27
@grom72 grom72 requested review from a team as code owners May 28, 2026 13:27
@grom72 grom72 added clean-cherry-pick Cherry-pick from another branch that did not require additional edits forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. labels May 28, 2026
daltonbohning
daltonbohning previously approved these changes May 28, 2026
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Priority: 2
Cancel-prev-build: false
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
Skip-func-test-leap15: true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean-cherry-pick Cherry-pick from another branch that did not require additional edits forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

3 participants