-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Problem Statement
AgentReady's current testing approach has too many tests with insufficient signal:
- Test failures: Unclear what broke and why
- Flaky tests: Tests fail intermittently without code changes
- Slow CI: Tests take too long, slowing development velocity
- Low coverage: ~37% coverage despite many tests
- GHA complexity: Multiple workflows with overlapping responsibilities
Root cause: Focus on quantity over quality. More tests ≠ better testing.
Proposed Solution: Signal-Focused Testing Strategy
Phase 1: Categorize and Audit Existing Tests (Week 1)
Goal: Understand what we have and what provides value.
-
Inventory all tests:
- Count tests by category (unit, integration, e2e)
- Identify duplicate/overlapping tests
- Find tests with unclear assertions
- Flag flaky tests (fail >5% of runs)
-
Measure signal quality:
- Which tests catch real bugs?
- Which tests provide clear failure messages?
- Which tests are too brittle (fail on safe refactors)?
-
Deliverable: Testing audit report
- List of tests to keep/delete/refactor
- Signal-to-noise ratio analysis
- Recommended testing philosophy document
Phase 2: Simplify GitHub Actions (Week 1-2)
Goal: Reduce GHA complexity and improve CI speed.
Current state:
- Multiple workflows with overlapping responsibilities
- Tests run multiple times (waste compute)
- Hard to understand what failed and why
Proposed changes:
-
Consolidate workflows:
- Single PR workflow for all quality checks
- Separate release workflow (keep existing)
- Remove redundant/duplicate checks
-
Optimize test execution:
- Run E2E tests first (fast, high signal)
- Run unit tests in parallel by module
- Skip slow tests for draft PRs
- Cache dependencies aggressively
-
Improve failure reporting:
- Clear job names that explain what they test
- Fail-fast for E2E failures
- Annotate PRs with specific failure context
-
Deliverable: Simplified GHA configuration
- Single
.github/workflows/pr.ymlfor all checks - Clear job structure with descriptive names
- <5 minute CI time for typical PRs
- Single
Phase 3: Refactor Test Suite (Week 2-3)
Goal: High-signal tests that catch real issues quickly.
Testing pyramid target:
E2E Tests (5-10 tests) ← Critical user journeys only
├─ Happy path: assess current repo
├─ Error handling: invalid config
├─ Security: sensitive directory blocking
└─ Performance: large repo (<5min timeout)
Integration Tests (20-30 tests) ← Module boundaries
├─ Scanner + Assessors
├─ Reporter + Templates
└─ CLI + Services
Unit Tests (100-150 tests) ← Core logic only
├─ Assessment scoring algorithm
├─ Pattern extraction
├─ Research report validation
└─ Edge cases and error handling
Principles:
-
Each test has clear purpose:
- What does it test? (one thing)
- What could break? (specific failure mode)
- How do you fix it? (actionable error message)
-
Avoid testing implementation details:
- Test behavior, not internal structure
- Refactors shouldn't break tests
- Mock only external dependencies
-
Fast feedback:
- E2E tests: <10s each (total <2min)
- Integration tests: <1s each
- Unit tests: <100ms each
- Full suite: <5min
-
Deliverable: Refactored test suite
- Delete 50%+ of existing tests (low signal)
- Rewrite 30% with clearer assertions
- Keep 20% as-is (already good)
- Target 70% coverage of critical paths
Phase 4: Documentation & Process (Week 3-4)
Goal: Prevent test suite from degrading again.
-
Testing guidelines (
TESTING.md):- When to write unit vs integration vs e2e tests
- How to write high-signal tests
- Common anti-patterns to avoid
-
PR checklist template:
- New feature = new test (which category?)
- Bug fix = regression test first
- Refactor = tests stay green
-
Test review process:
- Code reviews include test quality check
- PRs with low-signal tests get feedback
- Flaky test reports trigger investigation
-
Deliverable: Testing culture documentation
TESTING.mdguide- Updated
CONTRIBUTING.mdwith test requirements - PR template with test checklist
Success Metrics
| Metric | Current | Target | Measure |
|---|---|---|---|
| CI time | ~15min | <5min | GitHub Actions duration |
| Test count | ~800 tests | 150-200 tests | pytest count |
| Coverage | ~37% | 70% (critical paths) | pytest-cov |
| Flakiness | Unknown | <1% failure rate | Track over 100 runs |
| Signal quality | Low | High | Failure investigation time |
Definition of "high signal":
- When test fails, developer knows what broke immediately
- Fix time: <30 minutes from failure to root cause identified
- False positive rate: <1% (tests fail only when code is broken)
Out of Scope (Not Changing)
- E2E test framework (pytest is fine)
- Assertion library (assert statements are fine)
- Test discovery mechanism (pytest auto-discovery works)
Related Issues
- Replaces fix: resolve 77 test failures across multiple modules #179 (test failures fix - too broad)
- Replaces [P0] Improve Test Coverage to Meet 90% Threshold #103 (coverage target - wrong focus)
- Incorporates Test Reliability: Configurable timeouts and sensitive directory E2E test #192 (test reliability - now completed)
- Blocks feat: Add Codecov integration to release workflow #156 (Codecov integration - wait for coverage improvements)
Acceptance Criteria
- Testing audit report completed
- GHA workflows consolidated to single PR workflow
- CI time reduced to <5 minutes for typical PRs
- Test count reduced to 150-200 high-signal tests
- Coverage reaches 70% of critical code paths
- Flakiness rate <1% (measured over 100 CI runs)
-
TESTING.mdguide created and reviewed - PR template updated with test checklist
Priority: P0
Why P0: Testing is infrastructure. Bad tests slow down all development.
Timeline: 3-4 weeks (can be done incrementally in PRs)
Assignee: TBD (could be broken into multiple assignees for phases)
🤖 Generated with Claude Code