SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349
Draft
grom72 wants to merge 1 commit into
Draft
SRE-3703 ci: Fault injection testing stage on VM (#17953) 2.6#18349grom72 wants to merge 1 commit into
grom72 wants to merge 1 commit into
Conversation
unitTestPost() already processes nlt-junit.xml via the testResults parameter it receives. The bare 'junit testResults: nlt-junit.xml' call that follows is redundant and has no failure protection: it uses the default healthScaleFactor so when fault injection tests intentionally produce failures in nlt-junit.xml it marks the build FAILURE immediately, overriding the controlled result handling done by unitTestPost(). When node_local_test.py runs with --no-root, DAOS logs are written to /localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync only fetches from /tmp/, leaving nlt_logs/ empty and causing: No artifacts found that match the file pattern "nlt_logs/". Configuration error? Add a second rsync from build/nlt_logs/ to collect logs from the --no-root code path. The '|| true' ensures non-fatal behavior when the path does not exist (plain NLT runs without --no-root). Jenkinsfile: simplify NLT fault injection recordIssues call The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection testing' stage is now handled by unitTestPost() in pipeline-lib, so remove it from the explicit recordIssues call here. fault_status falback only based on PATH - Add fallback `fault_status` detection: if the primary detection via `$PREFIX/bin` fails, try resolving `fault_status` via `$PATH`, improving robustness when the binary is installed via RPM rather than built in-tree. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load. Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting. NLT stage (unitTest call): - Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml', always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' NLT stage (unitTestPost call): - Remove always_script (now passed to unitTest above) - Add NLT: true to explicitly activate the NLT post-processing block (recordIssues, discoverGitReferenceBuild) instead of relying on stage name detection - Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash NLT Fault injection testing stage (unitTest call): - Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' - Add with_valgrind: '' to explicitly suppress valgrind for FI NLT Fault injection testing stage (unitTestPost call): - Replace always_script with FI: true to explicitly activate fault injection post-processing (nlt-client-leaks.json, 'Fault injection' naming, discoverGitReferenceBuild) instead of relying on the now- removed stage name auto-detection of FI in parseStageInfo Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
|
Errors are Unable to load ticket data |
Collaborator
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/1/testReport/ |
Collaborator
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18349/2/testReport/ |
12 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of: #17953
unitTestPost() already processes nlt-junit.xml via the testResults parameter it receives. The bare 'junit testResults: nlt-junit.xml' call that follows is redundant and has no failure protection: it uses the default healthScaleFactor so when fault injection tests intentionally produce failures in nlt-junit.xml it marks the build FAILURE immediately, overriding the controlled result handling done by unitTestPost().
When node_local_test.py runs with --no-root, DAOS logs are written to /localhome/jenkins/build/nlt_logs/ instead of /tmp/. The existing rsync only fetches from /tmp/, leaving nlt_logs/ empty and causing:
No artifacts found that match the file pattern "nlt_logs/". Configuration error?
Add a second rsync from build/nlt_logs/ to collect logs from the --no-root code path. The '|| true' ensures non-fatal behavior when the path does not exist (plain NLT runs without --no-root).
Jenkinsfile: simplify NLT fault injection recordIssues call
The vm_test/nlt-errors.json issue scanning for the 'NLT Fault injection testing' stage is now handled by unitTestPost() in pipeline-lib, so remove it from the explicit recordIssues call here.
fault_status falback only based on PATH
fault_statusdetection: if the primary detection via$PREFIX/binfails, try resolvingfault_statusvia$PATH, improving robustness when the binary is installed via RPM rather than built in-tree.Priority: 2
Cancel-prev-build: false
Skip-python-bandit: true
Skip-unit-test: true
Skip-unit-test-memcheck: true
Skip-func-vm-all: true
Skip-test-el-9-rpms: true
Skip-test-leap-15-rpms: true
Skip-func-hw-test: true
Skip-build-el8-gcc: true
Skip-build-leap15-gcc: true
Skip-func-test-el9: true
nlt: remove ABT_STACK_OVERFLOW_CHECK=mprotect from nlt_server.yaml
mprotect-based Argobots ULT stack overflow checking causes a TLB shootdown IPI on every stack allocation/deallocation. On KVM hosts running multiple VMs in parallel this results in VM exits across all vCPUs, significantly increasing latency under concurrent load.
Remove the setting to use the default (no overflow check), which is acceptable for a CI/test environment where crashes are already caught by the test harness.
ci: explicitly pass NLT/FI parameters to unitTest and unitTestPost
pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting.
NLT stage (unitTest call):
NLT stage (unitTestPost call):
NLT Fault injection testing stage (unitTest call):
NLT Fault injection testing stage (unitTestPost call):
Steps for the author:
After all prior steps are complete: