test(cli): add comprehensive test coverage for benchmark and harbor commands #266

chambridge · 2026-01-16T19:44:26Z

Description

Add 50 new unit tests covering CLI command functionality:

tests/unit/test_cli_benchmark.py: 27 tests for benchmark command (80% coverage)
tests/unit/test_cli_harbor.py: 23 tests for harbor command (96% coverage)

Key improvements:

Project coverage: 64.5% → 67.0% (+2.5pp)
cli/harbor.py: 20% → 96% (+76pp)
cli/benchmark.py: 20% → 80% (+60pp)
All 50 tests pass with comprehensive module documentation
Uses actual data models for type safety
Covers success paths, error handling, and edge cases

Also updates .gitignore to exclude coverage.json and .claude/settings.local.json

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test coverage improvement

Related Issues

Fixes #
Relates to #

Changes Made

tests/unit/test_cli_benchmark.py: 27 tests for benchmark command (80% coverage)
tests/unit/test_cli_harbor.py: 23 tests for harbor command (96% coverage)

Testing

Unit tests pass (pytest)
Integration tests pass
Manual testing performed
No new warnings or errors

Checklist

My code follows the project's code style
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

…ommands Add 50 new unit tests covering CLI command functionality: - tests/unit/test_cli_benchmark.py: 27 tests for benchmark command (80% coverage) - tests/unit/test_cli_harbor.py: 23 tests for harbor command (96% coverage) Key improvements: - Project coverage: 64.5% → 67.0% (+2.5pp) - cli/harbor.py: 20% → 96% (+76pp) - cli/benchmark.py: 20% → 80% (+60pp) - All 50 tests pass with comprehensive module documentation - Uses actual data models for type safety - Covers success paths, error handling, and edge cases Also updates .gitignore to exclude coverage.json and .claude/settings.local.json 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Chris Hambridge <chambrid@redhat.com>

Format test files to comply with black code style requirements. Signed-off-by: Jeremy Eder <jeder@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-01-16T19:46:29Z

AgentReady Code Review: PR #266

Overview

This PR adds comprehensive test coverage for CLI benchmark and harbor commands with 50 new unit tests. The implementation demonstrates strong testing practices with 80% coverage for cli/benchmark.py and 96% coverage for cli/harbor.py.

AgentReady Attribute Assessment

✅ Strengths (Passing Attributes)

1. Test Coverage (Tier 1) - EXCELLENT

Impact: +2.5pp project coverage (64.5% → 67.0%)
Module Coverage:
- cli/harbor.py: 20% → 96% (+76pp)
- cli/benchmark.py: 20% → 80% (+60pp)
Test Quality: Uses actual data models (not mock dicts), covers success/failure paths, edge cases

2. Code Documentation (Tier 2) - PASS

Comprehensive module docstrings explain test strategy, coverage targets, fixtures
Each test class and method has clear, descriptive docstrings
Fixtures document their purpose and data structure

3. Code Structure (Tier 2) - PASS

Well-organized into logical test classes (TestBenchmarkCommand, TestRunTbench, etc.)
Proper separation of CLI command tests vs internal helper function tests
Clean fixture organization with reusable components

4. Type Safety (Tier 3) - PASS

Uses actual data models (HarborComparison, HarborRunMetrics, HarborTaskResult)
Avoids brittle mock dictionaries in favor of proper type instantiation
Example: Lines 81-122 in test_cli_benchmark.py use HarborRunMetrics constructor

5. Error Handling (Tier 2) - PASS

Tests exception scenarios: missing API keys, invalid inputs, parse failures
Validates proper error messages and exit codes
Examples: test_run_tbench_missing_api_key, test_compare_parse_results_failure

⚠️ Areas for Improvement

1. Security: API Key Handling (Tier 1) - MINOR ISSUE

Location: test_cli_benchmark.py:307-333, test_cli_benchmark.py:517-538

Issue: Tests modify os.environ without proper isolation, potential test pollution

# Current approach (risky)
old_key = os.environ.get("ANTHROPIC_API_KEY")
if old_key:
    del os.environ["ANTHROPIC_API_KEY"]
try:
    # test code
finally:
    if old_key:
        os.environ["ANTHROPIC_API_KEY"] = old_key

Recommendation: Use @patch.dict("os.environ", {}, clear=False) or pytest's monkeypatch fixture:

# Better approach
@patch.dict("os.environ", {"ANTHROPIC_API_KEY": ""}, clear=False)
def test_run_tbench_missing_api_key(self, tmp_path):
    # Test code - no manual cleanup needed

Impact: Low severity but important for test reliability in parallel execution

2. Test Determinism (Tier 3) - GOOD PRACTICE

Strength: Fixed timestamps for test comparisons (e.g., created_at="2024-01-01T12:00:00")

Recommendation: Consider freezing time for all tests that depend on timestamps:

@patch("agentready.cli.harbor.datetime")
def test_compare_basic_execution(self, mock_datetime, ...):
    mock_datetime.now.return_value = datetime(2024, 1, 1, 12, 0, 0)
    mock_datetime.now.return_value.strftime.return_value = "20240101_120000"

3. Hardcoded Magic Numbers (Tier 4) - MINOR

Location: Multiple test methods checking args[3], args[4], etc.

Example: test_cli_benchmark.py:166

args, kwargs = mock_run.call_args
assert args[3] is True  # verbose parameter

Recommendation: Use named tuple unpacking or constants:

# Better
repo_path, subset, model, verbose, timeout, output_dir, skip_preflight = mock_run.call_args[0]
assert verbose is True

Impact: Improves maintainability if function signature changes

4. gitignore Updates (Tier 2) - PASS

Changes: Added coverage.json and .claude/settings.local.json

Validation: ✅ Both are appropriate exclusions:

coverage.json: Generated artifact
.claude/settings.local.json: User-specific configuration

🔍 Code Quality Assessment

Test Architecture

Pattern: Click CliRunner with isolated filesystem ✅
Mocking Strategy: External dependencies mocked, internal logic tested ✅
Fixtures: Reusable, well-documented ✅

Edge Case Coverage

✅ Missing files/directories
✅ Invalid inputs
✅ Missing environment variables
✅ Exception handling
✅ Symlink creation failures (Windows compatibility)

Best Practices

✅ Uses @patch for external dependencies
✅ Tests both success and failure paths
✅ Validates output messages and exit codes
✅ Proper cleanup in fixtures (context managers, try/finally)

📊 AgentReady Score Impact

Estimated Score Change: 80.0 → 81.5 (+1.5 points)

Contributing Factors:

Test Coverage improvement: +2.5pp project coverage
Code Quality: Excellent test documentation and structure
Maintainability: Clear test organization, reusable fixtures

Certification: Remains Gold (75-89 range)

🎯 Recommendations

Priority: High

Security: Replace manual os.environ manipulation with @patch.dict for API key tests

Priority: Medium
2. Maintainability: Use named unpacking instead of positional args[N] indexing
3. Robustness: Add explicit time freezing for timestamp-dependent tests

Priority: Low
4. Documentation: Consider adding a "Testing Guide" section to CLAUDE.md referencing these tests as examples

✅ Approval Status

Recommended Action: APPROVE WITH MINOR SUGGESTIONS

Justification:

Excellent test coverage increase (+2.5pp project-wide)
Strong adherence to testing best practices
Proper use of data models over mocks
Minor suggestions do not block merge (can be addressed in follow-up)

Security: ✅ No security vulnerabilities introduced
Breaking Changes: ✅ None - purely additive tests
Documentation: ✅ Well-documented with clear docstrings

Review Completed: 2026-01-16
AgentReady Version: 2.23.0
Reviewer: Claude (AgentReady Review Agent)

🤖 Generated with AgentReady /review-agentready command

github-actions · 2026-01-16T19:48:13Z

📈 Test Coverage Report

Branch	Coverage
This PR	65.2%
Main	62.5%
Diff	✅ +2.7%

Coverage calculated from unit tests only

- Replace manual os.environ manipulation with @patch.dict for thread-safety - Replace hardcoded positional args indices with named variable unpacking - Improve test maintainability and reduce coupling to function signatures Addresses feedback on: - Security: API key handling (use @patch.dict instead of manual cleanup) - Maintainability: Replace args[N] with descriptive variable names Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Chris Hambridge <chambrid@redhat.com>

github-actions · 2026-01-16T19:55:56Z

AgentReady Code Review - PR #266

Overall Assessment

Status: ✅ APPROVED with minor suggestions

This PR adds excellent test coverage for CLI commands with well-structured, comprehensive unit tests. The test strategy is sound, coverage improvements are significant, and the code follows best practices.

📊 Coverage Impact

Project Coverage: 64.5% → 67.0% (+2.5pp)
cli/harbor.py: 20% → 96% (+76pp) ✨
cli/benchmark.py: 20% → 80% (+60pp) ✨
Total Tests Added: 50 new tests (27 benchmark + 23 harbor)

AgentReady Attribute Compliance

✅ Passing Attributes

test_coverage (Tier 1) - Excellent improvement from 64.5% to 67.0%
test_framework (Tier 1) - Proper use of pytest with fixtures
code_style (Tier 2) - Consistent naming, clear docstrings
type_annotations (Tier 2) - Good use of type hints in test fixtures
documentation (Tier 1) - Comprehensive module docstrings explaining test strategy
test_organization (Tier 2) - Well-organized test classes by functionality

⚠️ Suggestions for Improvement

Integration Tests (Tier 2)
- Tests are all unit tests with mocked dependencies
- Consider adding at least one integration test that calls real CLI without mocks
- Impact: Would validate end-to-end behavior and catch integration issues
Edge Case Coverage (Tier 3)
- Missing some edge cases: Windows path handling, large task lists, concurrent execution failures
- Impact: Could miss platform-specific or load-related bugs

🔒 Security Analysis

✅ No Security Issues Found

.gitignore additions are appropriate (coverage.json, .claude/settings.local.json)
Test isolation: Uses CliRunner.isolated_filesystem() correctly
No credential exposure: Mock API keys used (test-key)
No command injection: All paths use Path objects

💻 Code Quality

Strengths

Excellent Documentation - Comprehensive module docstrings explaining test strategy
Proper Test Isolation - Uses mocks, isolated filesystems, context managers
Type Safety - Uses actual data models (HarborComparison, HarborRunMetrics)
Comprehensive Coverage - Tests success paths, error handling, edge cases

Minor Issues (All Low Priority)

Inconsistent Mock Setup - test_cli_benchmark.py:62-70 could use spec parameter
Duplicate Test Pattern - API key validation tests could be parametrized
Magic Numbers - Could use named constants like MOCK_TASK_DURATION_SEC

🧪 Test Quality Assessment

Component	Coverage	Assessment
Success Paths	✅ Excellent	All major commands tested
Error Handling	✅ Excellent	API key, invalid inputs, exceptions
Edge Cases	⚠️ Good	Missing some platform-specific cases
Integration	⚠️ Minimal	Only unit tests with mocks

🎯 AgentReady Attribute Scoring Impact

Estimated Overall Score Impact: +1.5 to +2.0 points (on 100-point scale)

test_coverage: ~60/100 → ~70/100 ⬆️
code_style: 85/100 → 90/100 ⬆️

✅ Approval Checklist

No security vulnerabilities introduced
Code follows project style guidelines
Tests are comprehensive and well-structured
Documentation is clear and complete
No breaking changes
.gitignore changes are appropriate
All tests pass (per PR description)
Coverage improved significantly

🎉 Summary

This PR represents high-quality test engineering work. The minor suggestions above are truly optional - the code is production-ready as-is.

Recommendation: Merge after CI passes.

Key Strengths:

🎯 Achieves stated coverage goals (96% harbor.py, 80% benchmark.py)
📚 Excellent documentation of test strategy
🔒 Proper security practices
🏗️ Well-structured test organization

Optional improvements for future PRs:

Consider adding 1-2 integration tests
Extract some duplicate test patterns
Add more edge case coverage for platform-specific behavior

Reviewed by: AgentReady Code Review Agent
Review Date: 2026-01-16
AgentReady Version: 2.23.0

Great work! 🚀

chambridge and others added 2 commits January 16, 2026 14:42

style: apply black formatting to test files

f22afd8

Format test files to comply with black code style requirements. Signed-off-by: Jeremy Eder <jeder@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

chambridge merged commit c860276 into ambient-code:main Jan 16, 2026
10 checks passed

chambridge deleted the test/cli-benchmark-harbor-coverage branch January 16, 2026 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(cli): add comprehensive test coverage for benchmark and harbor commands #266

test(cli): add comprehensive test coverage for benchmark and harbor commands #266

chambridge commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

test(cli): add comprehensive test coverage for benchmark and harbor commands #266

test(cli): add comprehensive test coverage for benchmark and harbor commands #266

Conversation

chambridge commented Jan 16, 2026

Description

Type of Change

Related Issues

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Jan 16, 2026

AgentReady Code Review: PR #266

Overview

AgentReady Attribute Assessment

✅ Strengths (Passing Attributes)

1. Test Coverage (Tier 1) - EXCELLENT

2. Code Documentation (Tier 2) - PASS

3. Code Structure (Tier 2) - PASS

4. Type Safety (Tier 3) - PASS

5. Error Handling (Tier 2) - PASS

⚠️ Areas for Improvement

1. Security: API Key Handling (Tier 1) - MINOR ISSUE

2. Test Determinism (Tier 3) - GOOD PRACTICE

3. Hardcoded Magic Numbers (Tier 4) - MINOR

4. gitignore Updates (Tier 2) - PASS

🔍 Code Quality Assessment

Test Architecture

Edge Case Coverage

Best Practices

📊 AgentReady Score Impact

🎯 Recommendations

✅ Approval Status

Uh oh!

github-actions bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📈 Test Coverage Report

Uh oh!

github-actions bot commented Jan 16, 2026

AgentReady Code Review - PR #266

Overall Assessment

📊 Coverage Impact

AgentReady Attribute Compliance

✅ Passing Attributes

⚠️ Suggestions for Improvement

🔒 Security Analysis

✅ No Security Issues Found

💻 Code Quality

Strengths

Minor Issues (All Low Priority)

🧪 Test Quality Assessment

🎯 AgentReady Attribute Scoring Impact

✅ Approval Checklist

🎉 Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Jan 16, 2026 •

edited

Loading