-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
🏥 CI Failure Investigation - Run #1248
Summary
The Windows CI workflow failed due to 13 regression test benchmarks timing out during the "Run Regressions" step on the x64 build. This failure occurred after commit 62b3668 which removed set cardinality operators from array theory and introduced priority levels for final-check.
Failure Details
- Run: #19720419164
- Workflow: Windows.yml
- Commit: 62b3668
- Author:
@NikolajBjorner - Trigger: push to master
- Date: 2025-11-26T23:35:29Z
Root Cause Analysis
Primary Issue
Performance regression causing multiple SMT2 benchmarks to timeout that previously completed successfully.
Key Changes in Commit
The failing commit made significant changes intended to fix issue #7502:
-
Removed BAPA (Boolean Algebra with Presburger Arithmetic)
- Deleted
src/smt/theory_array_bapa.cpp(644 lines) - Deleted
src/smt/theory_array_bapa.h(43 lines) - Removed set cardinality operators from array theory
- Deleted
-
Introduced Priority Levels for Final-Check
- Modified
smt_theory.hto add priority parameter tofinal_check() - Updated
smt_context.cppto implement two-level final check - Changed all theory solver
final_check()methods to accept priority - Intent: Prevent nlsat from blocking quantifier instantiation
- Modified
-
Modified Theory Solvers
arith_solver.cpp/hnla_core.cpp/hnla_solver.cpp/h- Multiple theory files updated for new signature
Likely Cause
The priority level changes for final-check appear to have inadvertently affected solver performance. While the changes were intended to fix performance issues with nlsat blocking quantifier instantiation (issue #7502), they may have introduced a different performance bottleneck or altered the search strategy in a way that causes these specific benchmarks to timeout.
The removal of BAPA may have also eliminated optimization paths for certain array theory problems, though the timeouts suggest a solver strategy issue rather than missing functionality.
Failed Jobs and Errors
Job: build (x64)
- Conclusion: failure
- Failed Step: Run Regressions
- Duration: ~55 minutes
Timed Out Benchmarks (13 total)
z3test\regressions\smt2\2902.smt2
z3test\regressions\smt2\2943.smt2
z3test\regressions\smt2\3103.smt2
z3test\regressions\smt2\3123.smt2
z3test\regressions\smt2\3127.smt2
z3test\regressions\smt2\3158.smt2
z3test\regressions\smt2\3234.smt2
z3test\regressions\smt2\3523.smt2
z3test\regressions\smt2\3961.smt2
z3test\regressions\smt2\6196.smt2
z3test\regressions\smt2\segfault2.smt2
z3test\regressions\smt2\t151.smt2
z3test\regressions\smt2\t181.smt2
Other Build Results
- ✅ build (x86): SUCCESS (but skipped regressions)
- ✅ build (amd64_arm64): SUCCESS (but skipped regressions)
Investigation Findings
- Only x64 affected: The x86 and ARM64 builds succeeded, though they skip the regression test suite
- All failures are timeouts: No crashes, assertion failures, or incorrect results - only performance issues
- Related to issue regression with arith.solver=6 #7502: The commit message explicitly mentions fixing regression with arith.solver=6 #7502 about nlsat blocking quantifier instantiation
- Trade-off scenario: This appears to be a case where fixing one performance issue (nlsat blocking QI) introduced another (these specific benchmarks timing out)
Recommended Actions
-
Investigate priority level implementation (
src/smt/smt_context.cpp:840-850)- Review the two-level final-check implementation
- Check if the priority levels are being applied correctly
- Verify that the intended behavior (nlsat only at final level) isn't too restrictive
-
Profile the failing benchmarks
- Run the timed-out benchmarks locally with profiling enabled
- Compare solver behavior before and after the commit
- Identify which theory solver is consuming excessive time
-
Review final-check scheduling
- The changes to
final_check()signature affected multiple theories - Verify that all theories properly handle the priority parameter
- Check if some theories need to be more aggressive at lower priority levels
- The changes to
-
Consider selective BAPA restoration
- If the BAPA removal contributed to timeouts, consider partial restoration
- Alternatively, add regression tests for the removed functionality
-
Adjust timeout strategy
- The priority-based approach may need tuning
- Consider adaptive timeouts or resource limits per priority level
Prevention Strategies
To prevent similar issues in the future:
-
Extended Regression Testing
- Run full regression suite on all platforms before merging major solver changes
- Add performance benchmarks to CI with timeout thresholds
-
Incremental Changes
- Separate structural changes (BAPA removal) from algorithmic changes (priority levels)
- This makes bisecting performance issues easier
-
Performance Baselines
- Establish and track performance baselines for key benchmarks
- Alert on significant regressions (e.g., >50% slowdown)
-
Solver Configuration Testing
- Test with various solver configurations (arith.solver settings, etc.)
- Verify that fixes for one configuration don't break others
AI Team Self-Improvement
When working on Z3 or similar theorem provers, AI coding agents should:
- Always run regression tests locally before committing changes to core solver logic, especially changes affecting theory solvers or final-check mechanisms
- Profile benchmarks when making performance-related changes - use Z3's built-in profiling (
-stflag) to verify improvements don't cause regressions elsewhere - Separate refactoring from algorithmic changes - removing code (like BAPA) should be in separate commits from changing solver strategies (priority levels)
- Test multiple solver configurations when modifying core solver behavior - use different
arith.solversettings and solver modes - Document performance trade-offs - when fixing one performance issue, explicitly test that you haven't created others in different problem domains
- Add timeout monitoring - when modifying search strategies or final-check behavior, add specific test cases with known timeout characteristics
Historical Context
Related issues:
- regression with arith.solver=6 #7502: Performance regression with arith.solver=6 (the issue this commit intended to fix)
- Soundness: SetHasSize results in an incorrect
unsatanswer #4700: Previous issues with set cardinality operators and BAPA soundness - Unexpected performance regression on a simple-looking query #4702: Historical performance regression in arithmetic logics
Next Steps
@NikolajBjorner - This investigation shows that while the commit addresses #7502, it introduces timeout regressions in 13 benchmarks. Would you like to:
- Revert and redesign the priority level approach?
- Investigate and tune the priority thresholds?
- Add special handling for the affected benchmark types?
The trade-off between fixing #7502 and these new timeouts needs evaluation.
AI-generated content by CI Failure Doctor may contain mistakes.
Generated by Agentic Workflow Run