Skip to content

CI Failure Doctor🏥 CI Failure Investigation - Windows x64 Regression Test Timeouts (Run #1248) #8051

@github-actions

Description

@github-actions

🏥 CI Failure Investigation - Run #1248

Summary

The Windows CI workflow failed due to 13 regression test benchmarks timing out during the "Run Regressions" step on the x64 build. This failure occurred after commit 62b3668 which removed set cardinality operators from array theory and introduced priority levels for final-check.

Failure Details

  • Run: #19720419164
  • Workflow: Windows.yml
  • Commit: 62b3668
  • Author: @NikolajBjorner
  • Trigger: push to master
  • Date: 2025-11-26T23:35:29Z

Root Cause Analysis

Primary Issue

Performance regression causing multiple SMT2 benchmarks to timeout that previously completed successfully.

Key Changes in Commit

The failing commit made significant changes intended to fix issue #7502:

  1. Removed BAPA (Boolean Algebra with Presburger Arithmetic)

    • Deleted src/smt/theory_array_bapa.cpp (644 lines)
    • Deleted src/smt/theory_array_bapa.h (43 lines)
    • Removed set cardinality operators from array theory
  2. Introduced Priority Levels for Final-Check

    • Modified smt_theory.h to add priority parameter to final_check()
    • Updated smt_context.cpp to implement two-level final check
    • Changed all theory solver final_check() methods to accept priority
    • Intent: Prevent nlsat from blocking quantifier instantiation
  3. Modified Theory Solvers

    • arith_solver.cpp/h
    • nla_core.cpp/h
    • nla_solver.cpp/h
    • Multiple theory files updated for new signature

Likely Cause

The priority level changes for final-check appear to have inadvertently affected solver performance. While the changes were intended to fix performance issues with nlsat blocking quantifier instantiation (issue #7502), they may have introduced a different performance bottleneck or altered the search strategy in a way that causes these specific benchmarks to timeout.

The removal of BAPA may have also eliminated optimization paths for certain array theory problems, though the timeouts suggest a solver strategy issue rather than missing functionality.

Failed Jobs and Errors

Job: build (x64)

  • Conclusion: failure
  • Failed Step: Run Regressions
  • Duration: ~55 minutes

Timed Out Benchmarks (13 total)

z3test\regressions\smt2\2902.smt2
z3test\regressions\smt2\2943.smt2
z3test\regressions\smt2\3103.smt2
z3test\regressions\smt2\3123.smt2
z3test\regressions\smt2\3127.smt2
z3test\regressions\smt2\3158.smt2
z3test\regressions\smt2\3234.smt2
z3test\regressions\smt2\3523.smt2
z3test\regressions\smt2\3961.smt2
z3test\regressions\smt2\6196.smt2
z3test\regressions\smt2\segfault2.smt2
z3test\regressions\smt2\t151.smt2
z3test\regressions\smt2\t181.smt2

Other Build Results

  • build (x86): SUCCESS (but skipped regressions)
  • build (amd64_arm64): SUCCESS (but skipped regressions)

Investigation Findings

  1. Only x64 affected: The x86 and ARM64 builds succeeded, though they skip the regression test suite
  2. All failures are timeouts: No crashes, assertion failures, or incorrect results - only performance issues
  3. Related to issue regression with arith.solver=6 #7502: The commit message explicitly mentions fixing regression with arith.solver=6 #7502 about nlsat blocking quantifier instantiation
  4. Trade-off scenario: This appears to be a case where fixing one performance issue (nlsat blocking QI) introduced another (these specific benchmarks timing out)

Recommended Actions

  1. Investigate priority level implementation (src/smt/smt_context.cpp:840-850)

    • Review the two-level final-check implementation
    • Check if the priority levels are being applied correctly
    • Verify that the intended behavior (nlsat only at final level) isn't too restrictive
  2. Profile the failing benchmarks

    • Run the timed-out benchmarks locally with profiling enabled
    • Compare solver behavior before and after the commit
    • Identify which theory solver is consuming excessive time
  3. Review final-check scheduling

    • The changes to final_check() signature affected multiple theories
    • Verify that all theories properly handle the priority parameter
    • Check if some theories need to be more aggressive at lower priority levels
  4. Consider selective BAPA restoration

    • If the BAPA removal contributed to timeouts, consider partial restoration
    • Alternatively, add regression tests for the removed functionality
  5. Adjust timeout strategy

    • The priority-based approach may need tuning
    • Consider adaptive timeouts or resource limits per priority level

Prevention Strategies

To prevent similar issues in the future:

  1. Extended Regression Testing

    • Run full regression suite on all platforms before merging major solver changes
    • Add performance benchmarks to CI with timeout thresholds
  2. Incremental Changes

    • Separate structural changes (BAPA removal) from algorithmic changes (priority levels)
    • This makes bisecting performance issues easier
  3. Performance Baselines

    • Establish and track performance baselines for key benchmarks
    • Alert on significant regressions (e.g., >50% slowdown)
  4. Solver Configuration Testing

    • Test with various solver configurations (arith.solver settings, etc.)
    • Verify that fixes for one configuration don't break others

AI Team Self-Improvement

When working on Z3 or similar theorem provers, AI coding agents should:

  1. Always run regression tests locally before committing changes to core solver logic, especially changes affecting theory solvers or final-check mechanisms
  2. Profile benchmarks when making performance-related changes - use Z3's built-in profiling (-st flag) to verify improvements don't cause regressions elsewhere
  3. Separate refactoring from algorithmic changes - removing code (like BAPA) should be in separate commits from changing solver strategies (priority levels)
  4. Test multiple solver configurations when modifying core solver behavior - use different arith.solver settings and solver modes
  5. Document performance trade-offs - when fixing one performance issue, explicitly test that you haven't created others in different problem domains
  6. Add timeout monitoring - when modifying search strategies or final-check behavior, add specific test cases with known timeout characteristics

Historical Context

Related issues:

Next Steps

@NikolajBjorner - This investigation shows that while the commit addresses #7502, it introduces timeout regressions in 13 benchmarks. Would you like to:

  1. Revert and redesign the priority level approach?
  2. Investigate and tune the priority thresholds?
  3. Add special handling for the affected benchmark types?

The trade-off between fixing #7502 and these new timeouts needs evaluation.


AI-generated content by CI Failure Doctor may contain mistakes.

Generated by Agentic Workflow Run

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions