Skip to content

Implement agent maturity level system#152

Merged
frankbria merged 3 commits into
mainfrom
claude/agent-maturity-system-C8WZ0
Dec 27, 2025
Merged

Implement agent maturity level system#152
frankbria merged 3 commits into
mainfrom
claude/agent-maturity-system-C8WZ0

Conversation

@frankbria
Copy link
Copy Markdown
Owner

@frankbria frankbria commented Dec 26, 2025

Add comprehensive maturity assessment for worker agents based on task performance history. The system calculates a weighted score from completion rate (40%), test pass rate (30%), and self-correction rate (30%), mapping to four maturity levels (D1-D4).

Key changes:

  • Add assess_maturity() method to WorkerAgent with metrics calculation
  • Add get_tasks_by_agent() query to TaskRepository
  • Update AgentCard.tsx with maturity badges and performance metrics
  • Add AgentMetricsResponse model for API responses
  • Include metrics JSON in agent API endpoint responses
  • Add 11 unit tests and 8 integration tests for maturity system

Maturity levels:

  • D1 (directive/novice): score < 0.5
  • D2 (coaching/intermediate): 0.5 <= score < 0.7
  • D3 (supporting/advanced): 0.7 <= score < 0.9
  • D4 (delegating/expert): score >= 0.9

This closes issue #112

Summary by CodeRabbit

  • New Features
    • Agent maturity assessment with four maturity levels, calculated metrics, automated assessment trigger, and UI maturity badges plus a Performance Metrics panel.
  • Bug Fixes
    • Improved validation, error handling, persistence, and audit logging around maturity assessments and edge cases.
  • Tests
    • Extensive unit and integration tests for maturity progression, persistence, auditing, triggers, and sync/async task queries.
  • Chores
    • Added task-by-agent query paths and a DB index to optimize agent task lookups; surfaced metrics in API responses.

✏️ Tip: You can customize this high-level summary in your review settings.

Add comprehensive maturity assessment for worker agents based on
task performance history. The system calculates a weighted score
from completion rate (40%), test pass rate (30%), and self-correction
rate (30%), mapping to four maturity levels (D1-D4).

Key changes:
- Add assess_maturity() method to WorkerAgent with metrics calculation
- Add get_tasks_by_agent() query to TaskRepository
- Update AgentCard.tsx with maturity badges and performance metrics
- Add AgentMetricsResponse model for API responses
- Include metrics JSON in agent API endpoint responses
- Add 11 unit tests and 8 integration tests for maturity system

Maturity levels:
- D1 (directive/novice): score < 0.5
- D2 (coaching/intermediate): 0.5 <= score < 0.7
- D3 (supporting/advanced): 0.7 <= score < 0.9
- D4 (delegating/expert): score >= 0.9
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 26, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Implements agent maturity assessment: computes weighted maturity score (completion, test pass, self-correction), maps to D1–D4, persists metrics to DB, emits audit logs, adds assessment trigger logic, repository/query support, API/UI model updates, and unit + integration tests.

Changes

Cohort / File(s) Summary
Agent Maturity Logic
codeframe/agents/worker_agent.py
Implements assess_maturity() (now returns {maturity_level, maturity_score, metrics, changed}), adds should_assess_maturity(), and helpers _get_current_maturity, _update_agent_maturity, _log_maturity_assessment; enforces DB initialization and updates state/audit logs.
Persistence Delegation & Index
codeframe/persistence/database.py, codeframe/persistence/schema_manager.py
Adds get_tasks_by_agent() and get_tasks_by_agent_async() delegations on Database; creates DB index idx_tasks_assigned_to to optimize assigned_to/project/created_at queries.
Task & Agent Repositories
codeframe/persistence/repositories/task_repository.py, codeframe/persistence/repositories/agent_repository.py
Adds get_tasks_by_agent() (sync + async) to TaskRepository with optional project_id filtering and ordering; agent_repository SELECT now includes metrics column.
API Models & Router
codeframe/ui/models.py, codeframe/ui/routers/agents.py
Adds AgentMetricsResponse model and optional metrics field to AgentAssignmentResponse; router parses JSON metrics safely and falls back to None on invalid data.
UI Display
web-ui/src/components/AgentCard.tsx
Introduces AgentMetrics type and optional maturityLevel, renders maturity badge and Performance Metrics section, and updates memoization to include metrics/maturity.
Tests
tests/agents/test_worker_agent.py, tests/integration/test_agent_maturity.py
Adds unit and integration tests covering D1–D4 progression/demotion, persistence, audit logs, should_assess_maturity trigger logic, and sync/async task query behavior.

Sequence Diagram

sequenceDiagram
    actor User
    participant UI as Web UI
    participant API as Agents Router
    participant Agent as WorkerAgent
    participant DB as Database
    participant TaskRepo as TaskRepository
    participant Audit as AuditLog

    User->>UI: View agents list / open agent
    UI->>API: GET /projects/{id}/agents
    API->>DB: get_agents_for_project()
    DB-->>API: agents (raw metrics)

    alt Assessment trigger (should_assess_maturity == true)
        API->>Agent: trigger assess_maturity(agent_id)
        Agent->>DB: get_tasks_by_agent(agent_id)
        DB->>TaskRepo: execute query (assigned_to[, project_id])
        TaskRepo-->>DB: return task rows
        DB-->>Agent: task history
        rect rgba(220,235,250,0.9)
            Note over Agent: compute metrics\ncompletion(40%), test_pass(30%), self_correct(30%)
        end
        Agent->>DB: _update_agent_maturity(metrics, score, level)
        Agent->>Audit: _log_maturity_assessment(event)
        Audit-->>Agent: ack
        Agent-->>API: assessment result (level, score, changed)
    end

    API->>UI: AgentAssignmentResponse (with parsed metrics)
    UI->>User: Render maturity badge & metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hop through logs and metric trees,

D1 to D4 on whispering breeze.
Completion, tests, and fixes combine,
Into a score that gently shines.
Audit seeds in soil of code, my carrot-coded sign.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective of the PR: implementing a maturity level system for agents. It is specific enough to understand the primary change from the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/agent-maturity-system-C8WZ0

Comment @coderabbitai help to get the list of available commands and usage tips.

@frankbria frankbria linked an issue Dec 26, 2025 that may be closed by this pull request
6 tasks
@claude
Copy link
Copy Markdown

claude Bot commented Dec 26, 2025

Code Review: Agent Maturity Level System

Summary

This PR implements a comprehensive maturity assessment system for worker agents based on task performance. The implementation is solid with excellent test coverage (19 tests total). Below are my findings.


✅ Strengths

1. Excellent Test Coverage

  • 11 unit tests covering all maturity levels (D1-D4) and edge cases
  • 8 integration tests covering maturity progression, demotion, persistence, API integration, and audit logging
  • Tests are well-structured with clear test names and comprehensive scenarios

2. Well-Documented Code

  • Comprehensive docstrings with clear explanations of the scoring formula
  • Inline comments explaining each step of the assessment process
  • Good examples in docstrings (worker_agent.py:686-691)

3. Clean Architecture

  • Separation of concerns with private helper methods (_get_current_maturity, _update_agent_maturity, _log_maturity_assessment)
  • Proper repository pattern usage with get_tasks_by_agent() query
  • Good abstraction in the API layer

4. Robust Error Handling

  • Graceful handling of missing metrics JSON (agents.py:208-209)
  • Try-catch blocks for database operations and audit logging
  • Validation that database is initialized (worker_agent.py:695-696)

5. UI Integration

  • Clean React component with Nova design system compliance
  • Performance optimizations with React.memo and custom comparison function
  • Responsive metrics display with proper formatting

@claude
Copy link
Copy Markdown

claude Bot commented Dec 26, 2025

🔍 Issues & Recommendations

HIGH PRIORITY: Inefficient Database Queries (N+1 Problem)

Location: worker_agent.py:737-757

Issue:
You are making N database queries inside a loop (one for each completed task):

for task in completed_tasks:
    test_results = self.db.get_test_results_by_task(task.id)  # N queries
    # ...
    correction_attempts = self.db.get_correction_attempts_by_task(task.id)  # N queries

For an agent with 100 completed tasks, this results in 200+ database queries.

Recommendation:
Add bulk query methods to repositories:

# In test_repository.py
def get_test_results_by_tasks(self, task_ids: List[int]) -> Dict[int, List[dict]]:
    """Get test results for multiple tasks in one query."""
    # Single query with WHERE task_id IN (...)

# In correction_repository.py
def get_correction_attempts_by_tasks(self, task_ids: List[int]) -> Dict[int, List[dict]]:
    """Get correction attempts for multiple tasks in one query."""

This reduces queries from O(N) to O(1) and significantly improves performance for agents with large task histories.

Priority: High (performance)


MEDIUM: Self-Correction Rate Logic May Be Too Lenient

Location: worker_agent.py:757-762

Issue:
The current logic counts a task as "first-attempt success" if:

  1. No correction attempts exist, OR
  2. Only one correction attempt with attempt_number=1

This seems inconsistent. If a correction attempt record exists with attempt_number=1, it suggests there WAS an error that needed correction.

Recommendation:
Clarify the business logic - either remove the "only one attempt" clause or add a comment explaining why a single correction attempt counts as first-try success.

Priority: Medium (business logic clarity)

@claude
Copy link
Copy Markdown

claude Bot commented Dec 26, 2025

📝 Minor Issues & Suggestions

LOW: Potential Division by Zero (Defensive Programming)

Location: worker_agent.py:747

While you check if total_tests > 0: on line 745, consider making this more explicit:

if total_tests > 0:
    task_pass_rate = passed / total_tests
    test_pass_rates.append(task_pass_rate)

LOW: Missing Schema Validation in API Response

Location: agents.py:198-211

The JSON parsing is wrapped in try-catch, but there is no validation that the parsed metrics match the expected schema. Consider adding validation for expected keys.


Suggestion: Add Maturity Score Thresholds as Constants

# worker_agent.py - top of file
MATURITY_THRESHOLDS = {
    AgentMaturity.D4: 0.9,
    AgentMaturity.D3: 0.7,
    AgentMaturity.D2: 0.5,
    AgentMaturity.D1: 0.0,
}

Benefit: Easier to adjust thresholds later, clearer intent


Future Enhancement: Consider Weighted Task Age

Currently, all tasks contribute equally to the score regardless of age. Consider giving more weight to recent tasks in a future PR:

def _calculate_task_weight(task_created_at: datetime) -> float:
    days_ago = (datetime.now(timezone.utc) - task_created_at).days
    return math.exp(-days_ago / 30)  # Half-life of 30 days

🔒 Security Considerations

No Major Security Issues Found

  • Audit logging is properly implemented for maturity changes
  • No SQL injection risks (using parameterized queries)
  • No XSS risks in frontend (React escapes by default)
  • API endpoint has proper authorization checks (user_has_project_access)

🎯 Performance Considerations

  1. Database Query Optimization Needed (see N+1 issue above)
  2. Memory Consideration: For agents with 1000+ tasks, the get_tasks_by_agent() default limit of 100 may need tuning. Consider documenting the impact of the limit on maturity score accuracy.

@claude
Copy link
Copy Markdown

claude Bot commented Dec 26, 2025

✅ Conclusion

Overall Assessment: APPROVE with suggestions

This is a well-implemented feature with:

  • ✅ Comprehensive test coverage (11 unit + 8 integration tests)
  • ✅ Clean, maintainable code following repository pattern
  • ✅ Good documentation with clear docstrings
  • ✅ Proper error handling and audit logging
  • ✅ Nice UI integration with Nova design system
  • ⚠️ One performance optimization recommended (database query batching for N+1 issue)
  • ⚠️ One business logic clarification needed (self-correction rate definition)

The code is production-ready. The performance optimization for database queries would be beneficial but is not a blocking issue unless you expect agents to have hundreds of tasks.

Files Changed:

  • codeframe/agents/worker_agent.py: +317 lines (core implementation)
  • tests/agents/test_worker_agent.py: +509 lines (unit tests)
  • tests/integration/test_agent_maturity.py: +535 lines (integration tests)
  • web-ui/src/components/AgentCard.tsx: +72 lines (UI display)
  • Various repository and API files: +80 lines total

Recommended Action:

  • Merge as-is if agents typically have < 50 tasks
  • Or address the N+1 query optimization first if agents will have 100+ tasks

Great work on this comprehensive feature! The maturity system will provide valuable insights into agent performance. 🎉


Test Coverage Summary:

Unit Tests (11):
✓ No tasks → D1
✓ Low scores → D1  
✓ Moderate scores → D2
✓ High scores → D3
✓ Excellent scores → D4
✓ Database updates
✓ Audit logging
✓ should_assess_maturity() triggers
✓ Error handling

Integration Tests (8):
✓ Progression D1→D2→D3→D4
✓ Demotion on degraded performance
✓ Persistence across instances
✓ API metrics response
✓ Audit log entries
✓ Task filtering
✓ Project-level filtering

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
codeframe/ui/routers/agents.py (1)

183-184: Move import json to module level.

The local import works but placing it at the top of the file follows Python conventions and avoids repeated import lookups in hot paths.

🔎 Proposed fix

Add at the top of the file with other imports:

import json

Then remove the local import on line 183.

codeframe/agents/worker_agent.py (4)

756-769: Clarify self-correction rate semantics.

The current logic counts tasks with no correction attempts OR exactly one attempt at attempt_number=1 as "first attempt success." However, if a task has correction_attempts=[] because attempts weren't tracked (rather than succeeding on first try), this inflates the self-correction rate. Consider verifying that the absence of correction attempts genuinely indicates first-attempt success.


890-910: Audit log failure is silently swallowed.

The method catches all exceptions and logs a warning, which is appropriate for a non-critical audit path. However, consider whether the caller should be informed of audit failures for observability purposes (e.g., returning a success flag or including in the result dict).


955-964: Timezone parsing is robust but could use a utility function.

The replace("Z", "+00:00") pattern for ISO timestamp parsing is correct but repeated logic. Consider extracting to a helper function if used elsewhere, or use datetime.fromisoformat with Python 3.11+ which handles "Z" suffix natively.

🔎 Python 3.11+ simplification (per coding guidelines)
         try:
-            last_assessed = datetime.fromisoformat(last_assessed_str.replace("Z", "+00:00"))
+            last_assessed = datetime.fromisoformat(last_assessed_str)
             hours_since_assessment = (
                 datetime.now(timezone.utc) - last_assessed
             ).total_seconds() / 3600

Note: Python 3.11+ fromisoformat() supports the "Z" suffix natively. Since the coding guidelines specify Python 3.11+, this simplification should work.


930-930: Consolidate repeated imports at module level or once at method entry.

json is imported in both _update_agent_maturity (line 858) and should_assess_maturity (line 930). Similarly, TaskStatus is imported in both assess_maturity (line 693) and should_assess_maturity (line 970). While this works, consider moving these to the top of the file or to a single location for clarity.

Also applies to: 970-970

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1169cb6 and a356208.

📒 Files selected for processing (9)
  • codeframe/agents/worker_agent.py
  • codeframe/persistence/database.py
  • codeframe/persistence/repositories/agent_repository.py
  • codeframe/persistence/repositories/task_repository.py
  • codeframe/ui/models.py
  • codeframe/ui/routers/agents.py
  • tests/agents/test_worker_agent.py
  • tests/integration/test_agent_maturity.py
  • web-ui/src/components/AgentCard.tsx
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use TypeScript 5.3+ with strict mode for frontend development

Files:

  • web-ui/src/components/AgentCard.tsx
web-ui/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

web-ui/src/**/*.{ts,tsx}: Use React 18 with TypeScript and Context + useReducer pattern for state management
Use shadcn/ui components from @/components/ui/ directory
Use Hugeicons (@hugeicons/react) for all icons instead of lucide-react
Implement WebSocket automatic reconnection with exponential backoff (1s → 30s)

Files:

  • web-ui/src/components/AgentCard.tsx
web-ui/**/*.{css,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Tailwind CSS with Nova design system template for styling

Files:

  • web-ui/src/components/AgentCard.tsx
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}: Use WebSockets for real-time updates between frontend and backend
Use last-write-wins strategy with backend timestamps for timestamp conflict resolution in multi-agent scenarios

Files:

  • web-ui/src/components/AgentCard.tsx
  • codeframe/persistence/repositories/task_repository.py
  • codeframe/ui/routers/agents.py
  • codeframe/persistence/database.py
  • codeframe/persistence/repositories/agent_repository.py
  • codeframe/ui/models.py
  • codeframe/agents/worker_agent.py
web-ui/src/**/*.{tsx,css}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Nova color palette variables (bg-card, text-foreground, etc.) instead of hardcoded color values

Files:

  • web-ui/src/components/AgentCard.tsx
web-ui/src/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

web-ui/src/**/*.tsx: Use cn() utility for conditional Tailwind CSS classes
Wrap AgentStateProvider with ErrorBoundary component for graceful error handling
Use useMemo for derived state calculations in React components

Files:

  • web-ui/src/components/AgentCard.tsx
web-ui/src/components/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

Implement React.memo on all Dashboard sub-components for performance optimization

Files:

  • web-ui/src/components/AgentCard.tsx
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Python 3.11+ with type hints and async/await for backend development

Files:

  • codeframe/persistence/repositories/task_repository.py
  • codeframe/ui/routers/agents.py
  • codeframe/persistence/database.py
  • tests/integration/test_agent_maturity.py
  • codeframe/persistence/repositories/agent_repository.py
  • codeframe/ui/models.py
  • codeframe/agents/worker_agent.py
  • tests/agents/test_worker_agent.py
codeframe/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

codeframe/**/*.py: Use FastAPI with AsyncAnthropic for backend API development
Use SQLite with aiosqlite for async database operations
Use tiktoken for token counting in the backend
Use ruff for Python code linting and formatting
Use tiered memory system (HOT/WARM/COLD) for context management to achieve 30-50% token reduction
Implement session lifecycle management with file-based storage in .codeframe/session_state.json for CLI auto-save/restore

Files:

  • codeframe/persistence/repositories/task_repository.py
  • codeframe/ui/routers/agents.py
  • codeframe/persistence/database.py
  • codeframe/persistence/repositories/agent_repository.py
  • codeframe/ui/models.py
  • codeframe/agents/worker_agent.py
codeframe/persistence/repositories/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement Repository pattern with domain-specific repositories for data access

Files:

  • codeframe/persistence/repositories/task_repository.py
  • codeframe/persistence/repositories/agent_repository.py
codeframe/persistence/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Pre-production application: use flattened v1.0 database schema with direct table creation (no migration system)

Files:

  • codeframe/persistence/repositories/task_repository.py
  • codeframe/persistence/database.py
  • codeframe/persistence/repositories/agent_repository.py
tests/**/*.{py,ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use TestSprite and Playwright for E2E testing of workflows

Files:

  • tests/integration/test_agent_maturity.py
  • tests/agents/test_worker_agent.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run pytest with coverage tracking for Python backend tests

Files:

  • tests/integration/test_agent_maturity.py
  • tests/agents/test_worker_agent.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
📚 Learning: 2025-12-24T04:24:43.804Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-24T04:24:43.804Z
Learning: Applies to web-ui/src/{contexts,reducers,hooks}/**/*.{ts,tsx} : Implement AgentStateContext with useReducer for multi-agent state management

Applied to files:

  • web-ui/src/components/AgentCard.tsx
📚 Learning: 2025-12-24T04:24:43.804Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-24T04:24:43.804Z
Learning: Applies to web-ui/src/**/*.tsx : Wrap AgentStateProvider with ErrorBoundary component for graceful error handling

Applied to files:

  • web-ui/src/components/AgentCard.tsx
📚 Learning: 2025-11-25T19:08:37.203Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4

Applied to files:

  • tests/integration/test_agent_maturity.py
  • codeframe/agents/worker_agent.py
  • tests/agents/test_worker_agent.py
📚 Learning: 2025-12-17T19:21:40.014Z
Learnt from: frankbria
Repo: frankbria/codeframe PR: 128
File: tests/agents/test_bottleneck_detection.py:486-500
Timestamp: 2025-12-17T19:21:40.014Z
Learning: In tests/agents/test_bottleneck_detection.py, test_skip_agents_below_threshold should be async and mock _get_agent_workload to return workload below AGENT_OVERLOAD_THRESHOLD (5) while providing non-empty tasks list to prevent early return in detect_bottlenecks().

Applied to files:

  • tests/integration/test_agent_maturity.py
📚 Learning: 2025-12-17T19:21:30.131Z
Learnt from: frankbria
Repo: frankbria/codeframe PR: 128
File: tests/agents/test_bottleneck_detection.py:486-500
Timestamp: 2025-12-17T19:21:30.131Z
Learning: In tests/agents/*.py, when testing bottleneck detection logic, ensure that tests exercising detect_bottlenecks are async and mock _get_agent_workload to return a value below AGENT_OVERLOAD_THRESHOLD (5) while providing a non-empty tasks list to prevent early return. This guarantees the code path for low workload is exercised and behavior under threshold is verified.

Applied to files:

  • tests/agents/test_worker_agent.py
🧬 Code graph analysis (3)
codeframe/persistence/repositories/task_repository.py (1)
codeframe/persistence/database.py (2)
  • get_tasks_by_agent (408-410)
  • _row_to_task (380-382)
codeframe/persistence/database.py (1)
codeframe/persistence/repositories/task_repository.py (1)
  • get_tasks_by_agent (543-582)
tests/agents/test_worker_agent.py (2)
tests/integration/test_agent_maturity.py (1)
  • db (18-22)
codeframe/agents/worker_agent.py (3)
  • WorkerAgent (39-1622)
  • assess_maturity (661-824)
  • should_assess_maturity (912-977)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Frontend Unit Tests
  • GitHub Check: Backend Unit Tests
  • GitHub Check: E2E Smoke Tests (Chromium)
  • GitHub Check: claude-review
  • GitHub Check: Greptile Review
🔇 Additional comments (24)
codeframe/persistence/repositories/agent_repository.py (1)

191-191: LGTM!

The addition of a.metrics to the SELECT list is correct and consistent with the existing query structure. This enables the API layer to access agent maturity metrics when retrieving agents for a project.

codeframe/ui/routers/agents.py (1)

198-212: LGTM!

The metrics parsing logic is well-structured with proper error handling. The defensive approach of falling back to None on parse failures ensures API stability while allowing the UI to gracefully handle missing metrics.

tests/agents/test_worker_agent.py (6)

925-952: LGTM!

This test correctly validates the baseline case where an agent with no task history defaults to D1 (directive/novice) maturity level with zero metrics.


953-1006: LGTM!

The test correctly validates D1 assignment for an agent with poor completion rate (20%) and no test data, verifying the weighted score calculation stays below 0.5.


1008-1089: LGTM!

The test comprehensively validates D2 (intermediate) assessment with moderate performance metrics. The score calculation aligns with the weighted formula: ~60% completion × 0.4 + ~41% test pass × 0.3 + 50% first-attempt × 0.3 ≈ 0.51.


1091-1152: LGTM!

The test validates advanced/expert maturity levels. The calculated score of 0.90 (90% completion × 0.4 + 80% test pass × 0.3 + 100% first-attempt × 0.3) correctly lands at the D4 threshold, and the flexible assertion accommodates boundary behavior.


1154-1213: LGTM!

The test correctly validates D4 (expert/delegating) assignment for an agent with perfect performance metrics: 100% completion, 100% test pass rate, and 100% first-attempt success resulting in a maturity score of 1.0.


1215-1431: Comprehensive test coverage for maturity system.

The tests thoroughly cover:

  • Database persistence of maturity updates
  • Audit trail logging
  • Graceful handling of missing test results
  • Self-correction rate calculation with correction attempts
  • Assessment trigger logic
  • Error handling for uninitialized database

Based on learnings, this aligns with the D1-D4 maturity level system for Worker Agents.

codeframe/ui/models.py (2)

165-177: LGTM!

The AgentMetricsResponse model is well-defined with appropriate field types and descriptions. Making all fields optional provides flexibility for agents that haven't been assessed yet.


191-193: LGTM!

The addition of the optional metrics field to AgentAssignmentResponse correctly integrates the new maturity metrics with the existing agent assignment data model.

codeframe/persistence/database.py (1)

408-411: LGTM!

The delegation method follows the established facade pattern, maintaining consistency with the other 50+ delegated methods in this class while exposing the new get_tasks_by_agent capability from TaskRepository.

web-ui/src/components/AgentCard.tsx (5)

3-22: LGTM!

The AgentMetrics interface and extended Agent type correctly mirror the backend AgentMetricsResponse model. The optional maturityLevel field properly restricts values to the four valid maturity states.


56-66: LGTM!

The maturity badge mapping correctly translates the backend D1-D4 levels to user-friendly labels (Novice → Expert) with appropriate Nova palette styling and intuitive icons.


114-130: LGTM!

The maturity badge renders conditionally when available, with a helpful tooltip showing the maturity score. The flex-wrap ensures clean layout when both badges are present.


156-187: LGTM!

The Performance Metrics section renders conditionally with proper undefined checks for each metric. The 2-column grid layout and percentage formatting provide a clean, scannable display. The "First Try" label for self_correction_rate is an intuitive user-facing term.


207-221: LGTM!

The memoization comparison correctly includes the new maturityLevel and metrics fields. Using JSON.stringify for comparing metrics is acceptable given the shallow structure and small size of the metrics object. As per coding guidelines, React.memo is properly applied to the Dashboard sub-component.

tests/integration/test_agent_maturity.py (4)

1-145: LGTM!

The test module is well-structured with appropriate fixtures. The test_maturity_progression_novice_to_expert test comprehensively validates the full D1→D4 maturity journey across four phases with progressively improving performance metrics. Based on learnings, this aligns with the D1-D4 maturity level system for Worker Agents.


146-213: LGTM!

The demotion test validates an important behavior: agent maturity can decrease when performance degrades. The flexible assertion (level < initial_level OR score < initial_score) correctly handles boundary cases where maturity thresholds may not be crossed despite score reduction.


215-322: LGTM!

The persistence tests validate two critical flows:

  1. Maturity level persists across agent instance recreations
  2. Metrics are properly accessible through the API layer (get_agents_for_project)

This ensures end-to-end data integrity for the maturity system.


324-535: LGTM!

The remaining test classes provide thorough coverage:

  • Audit trail: Validates audit log creation with correct event types and metadata
  • Triggers: Tests should_assess_maturity logic with task count thresholds
  • Task queries: Validates get_tasks_by_agent with and without project filtering

This completes the integration test coverage for the maturity system, covering 8 tests across persistence, auditing, triggers, and data access.

codeframe/agents/worker_agent.py (4)

826-847: LGTM!

The _get_current_maturity() method properly handles all edge cases: missing database, missing agent, missing maturity value, and invalid enum values with appropriate logging.


661-692: Well-structured method with clear documentation.

The docstring clearly explains the weighted scoring formula, maturity level thresholds, and return structure. The method signature change from None to Dict[str, Any] aligns with the PR objectives for exposing maturity metrics. Based on learnings, this implements the D1-D4 maturity levels as specified for Worker Agents.


736-751: No issues found. The code correctly accesses test_results as a list of dictionaries with "passed" and "failed" keys. The get_test_results_by_task() method explicitly returns List[Dict[str, Any]] as confirmed by the database tests, which verify dictionary keys like results[0]["passed"] and results[0]["failed"] exist. The defensive use of .get() with default values is appropriate.


864-870: The code is correct as written. Line 867 passes the maturity enum directly to update_agent, which is safe because the update_agent method at lines 111-112 of codeframe/persistence/repositories/agent_repository.py explicitly checks isinstance(value, AgentMaturity) and extracts .value before storing. No changes needed.

Comment on lines +543 to +582
def get_tasks_by_agent(
self, agent_id: str, project_id: Optional[int] = None, limit: int = 100
) -> List[Task]:
"""Get all tasks assigned to an agent.

Used for calculating agent maturity metrics based on task history.

Args:
agent_id: Agent ID to filter by (matches assigned_to field)
project_id: Optional project ID to filter by
limit: Maximum number of tasks to return (default: 100)

Returns:
List of Task objects ordered by created_at DESC (most recent first)
"""
cursor = self.conn.cursor()

if project_id is not None:
cursor.execute(
"""
SELECT * FROM tasks
WHERE assigned_to = ? AND project_id = ?
ORDER BY created_at DESC
LIMIT ?
""",
(agent_id, project_id, limit),
)
else:
cursor.execute(
"""
SELECT * FROM tasks
WHERE assigned_to = ?
ORDER BY created_at DESC
LIMIT ?
""",
(agent_id, limit),
)

rows = cursor.fetchall()
return [self._row_to_task(row) for row in rows]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the actual implementation at the specified lines
head -n 582 codeframe/persistence/repositories/task_repository.py | tail -n 40

Repository: frankbria/codeframe

Length of output: 1357


🏁 Script executed:

# Check for database schema definitions to see if indexes exist
find codeframe -name "*.py" -type f | xargs rg -l "CREATE TABLE.*tasks" -A 20

Repository: frankbria/codeframe

Length of output: 102


🏁 Script executed:

# Look for index definitions on assigned_to or tasks table
rg -n "CREATE INDEX" codeframe/persistence/ -A 3

Repository: frankbria/codeframe

Length of output: 20982


🏁 Script executed:

# Check the class definition and other methods to understand async/sync pattern
head -n 100 codeframe/persistence/repositories/task_repository.py

Repository: frankbria/codeframe

Length of output: 2454


🏁 Script executed:

# Check if aiosqlite is being used elsewhere in the file
rg -n "async\|aiosqlite\|await" codeframe/persistence/repositories/task_repository.py | head -20

Repository: frankbria/codeframe

Length of output: 45


🏁 Script executed:

# Search specifically for assigned_to index in schema
rg -n "assigned_to" codeframe/persistence/schema_manager.py

Repository: frankbria/codeframe

Length of output: 100


🏁 Script executed:

# Check if there are any async methods in TaskRepository to see the pattern
rg -n "async def" codeframe/persistence/repositories/task_repository.py | head -20

Repository: frankbria/codeframe

Length of output: 135


🏁 Script executed:

# Check the connection type - is it aiosqlite or sqlite3?
rg -n "self\.conn" codeframe/persistence/repositories/task_repository.py | head -20

Repository: frankbria/codeframe

Length of output: 796


🏁 Script executed:

# Check what BaseRepository looks like
cat codeframe/persistence/repositories/base.py

Repository: frankbria/codeframe

Length of output: 9930


Convert to async implementation and add index on assigned_to column for performance.

This method must be async per coding guidelines requiring aiosqlite for database operations in codeframe/persistence/**/*.py. Additionally, the assigned_to column lacks an index, which will cause performance issues when retrieving task history as the dataset grows.

Recommended changes:

  1. Convert method to async def using await self._execute_async() and await self._fetchall_async()
  2. Add CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON tasks(assigned_to) to schema manager (or compound index with project_id for filtered queries)
🤖 Prompt for AI Agents
In codeframe/persistence/repositories/task_repository.py around lines 543 to
582, the synchronous get_tasks_by_agent should be converted to an async
implementation that uses aiosqlite helper methods and an index added to the
schema: change the signature to async def get_tasks_by_agent(...), replace
cursor.execute/fetchall with await self._execute_async(sql, params) and rows =
await self._fetchall_async(), then return [self._row_to_task(row) for row in
rows]; additionally add a schema migration or update in the schema manager to
create the index (e.g., CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON
tasks(assigned_to) or a compound index including project_id) to improve query
performance.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Dec 26, 2025

Greptile Summary

  • Implements comprehensive agent maturity assessment system that calculates weighted scores from task performance metrics (completion rate 40%, test pass rate 30%, self-correction rate 30%) mapping to four maturity levels D1-D4
  • Adds database layer support with new get_tasks_by_agent() query method and metrics JSON parsing in API endpoints
  • Updates frontend AgentCard.tsx with maturity badges and performance metrics display using Nova design system

Important Files Changed

Filename Overview
codeframe/agents/worker_agent.py Implements full assess_maturity() method with weighted scoring algorithm and database persistence
web-ui/src/components/AgentCard.tsx Adds maturity level badges (D1-D4) and performance metrics display with proper React.memo optimization

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it implements a well-designed feature with comprehensive testing
  • Score reflects thorough testing (19 total tests), clear implementation of specified requirements, and proper integration with existing systems; minor deduction for complexity in the scoring algorithm and potential performance impact from frequent database queries
  • Pay close attention to the WorkerAgent.assess_maturity() method's database operations and ensure the assessment frequency logic in should_assess_maturity() performs well in production

Sequence Diagram

sequenceDiagram
    participant User
    participant API as "Agent API"
    participant Agent as "WorkerAgent"
    participant TaskRepo as "TaskRepository"
    participant TestRepo as "TestRepository"
    participant CorrRepo as "CorrectionRepository"
    participant AuditRepo as "AuditRepository"
    participant UI as "AgentCard UI"

    User->>API: "GET /projects/{id}/agents"
    API->>Agent: "assess_maturity()"
    
    Agent->>TaskRepo: "get_tasks_by_agent(agent_id)"
    TaskRepo-->>Agent: "task history list"
    
    Agent->>Agent: "calculate completion_rate"
    Note over Agent: "completed_tasks / total_tasks * 40%"
    
    Agent->>TestRepo: "get_test_results_by_task(task_id)"
    TestRepo-->>Agent: "test results"
    Agent->>Agent: "calculate avg_test_pass_rate"
    Note over Agent: "average pass rate * 30%"
    
    Agent->>CorrRepo: "get_correction_attempts_by_task(task_id)"
    CorrRepo-->>Agent: "correction attempts"
    Agent->>Agent: "calculate self_correction_rate"
    Note over Agent: "first attempt success * 30%"
    
    Agent->>Agent: "compute weighted maturity_score"
    Agent->>Agent: "map score to maturity level (D1-D4)"
    
    Agent->>API: "update_agent_maturity()"
    Agent->>AuditRepo: "create_audit_log()"
    
    API-->>User: "agents with maturity data"
    User->>UI: "render agent cards"
    UI->>UI: "display maturity badges and metrics"
Loading

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Raises:
HTTPException: 404 if project not found, 500 on database error
"""
import json
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Import placed inside function instead of at module level

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: codeframe/ui/routers/agents.py
Line: 183:183

Comment:
**style:** Import placed inside function instead of at module level

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

{agent.metrics.avg_test_pass_rate !== undefined && (
<div>
<span className="text-muted-foreground">Test Pass:</span>{' '}
<span className="font-medium">{(agent.metrics.avg_test_pass_rate * 100).toFixed(0)}%</span>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The test pass rate is already a percentage (0-100) but you're multiplying by 100 again - this will show 10000% for a 100% pass rate

Suggested change
<span className="font-medium">{(agent.metrics.avg_test_pass_rate * 100).toFixed(0)}%</span>
<span className="font-medium">{(agent.metrics.avg_test_pass_rate).toFixed(0)}%</span>

Should avg_test_pass_rate be stored as a decimal (0-1) like the other rates, or as a percentage (0-100)?

Prompt To Fix With AI
This is a comment left during a code review.
Path: web-ui/src/components/AgentCard.tsx
Line: 170:170

Comment:
**logic:** The test pass rate is already a percentage (0-100) but you're multiplying by 100 again - this will show 10000% for a 100% pass rate

```suggestion
                <span className="font-medium">{(agent.metrics.avg_test_pass_rate).toFixed(0)}%</span>
```

 Should `avg_test_pass_rate` be stored as a decimal (0-1) like the other rates, or as a percentage (0-100)?

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1048 to +1056
if status == TaskStatus.COMPLETED:
passed = 5 if i % 2 == 0 else 3
failed = 5 if i % 2 == 0 else 7
db.create_test_result(
task_id=task_id,
status="passed" if passed > failed else "failed",
passed=passed,
failed=failed,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Test data creates inconsistent pass rates - tasks alternate between 50% (5 pass, 5 fail) and 30% (3 pass, 7 fail) pass rates, but comment claims ~50% overall. This makes the expected score calculation less predictable. Should the test use consistent pass rates to make the expected maturity score more deterministic?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1048:1056

Comment:
**style:** Test data creates inconsistent pass rates - tasks alternate between 50% (5 pass, 5 fail) and 30% (3 pass, 7 fail) pass rates, but comment claims ~50% overall. This makes the expected score calculation less predictable. Should the test use consistent pass rates to make the expected maturity score more deterministic?

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1082 to +1086
# Score calculation:
# - Completion: 60% * 0.4 = 0.24
# - Test pass: ~41% * 0.3 = 0.12
# - Self-correction: 50% * 0.3 = 0.15
# Total: ~0.51, which is D2 (0.5-0.7)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Score calculation comment shows ~41% test pass rate, but the actual calculation with alternating 50%/30% rates would be different. The math should be verified or test data adjusted for consistency.

Suggested change
# Score calculation:
# - Completion: 60% * 0.4 = 0.24
# - Test pass: ~41% * 0.3 = 0.12
# - Self-correction: 50% * 0.3 = 0.15
# Total: ~0.51, which is D2 (0.5-0.7)
# Score calculation:
# - Completion: 60% * 0.4 = 0.24
# - Test pass: 40% * 0.3 = 0.12
# - Self-correction: 50% * 0.3 = 0.15
# Total: 0.51, which is D2 (0.5-0.7)
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1082:1086

Comment:
**logic:** Score calculation comment shows ~41% test pass rate, but the actual calculation with alternating 50%/30% rates would be different. The math should be verified or test data adjusted for consistency.

```suggestion
        # Score calculation:
        # - Completion: 60% * 0.4 = 0.24
        # - Test pass: 40% * 0.3 = 0.12
        # - Self-correction: 50% * 0.3 = 0.15
        # Total: 0.51, which is D2 (0.5-0.7)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1148 to +1152
# 90% completion * 0.4 + 80% test pass * 0.3 + 100% first attempt * 0.3
# = 0.36 + 0.24 + 0.30 = 0.90 -> D3 or D4
# Score should be around 0.7-0.9 for advanced
assert result["maturity_level"] in [AgentMaturity.D3, AgentMaturity.D4]
assert result["maturity_score"] >= 0.7
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Test accepts either D3 or D4 for the same score calculation (0.90), but the assertion should be more specific since 0.90 exactly hits the D4 threshold. Should this test specifically expect D4 since the calculated score (0.90) exactly meets the D4 threshold?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1148:1152

Comment:
**style:** Test accepts either D3 or D4 for the same score calculation (0.90), but the assertion should be more specific since 0.90 exactly hits the D4 threshold. Should this test specifically expect D4 since the calculated score (0.90) exactly meets the D4 threshold?

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Add get_tasks_by_agent_async() method to TaskRepository using aiosqlite
- Add async facade method to Database class
- Add idx_tasks_assigned_to compound index (assigned_to, project_id, created_at)
- Add 2 async integration tests for the new method
@claude
Copy link
Copy Markdown

claude Bot commented Dec 26, 2025

Pull Request Review: Agent Maturity Assessment System

Thank you for this comprehensive implementation of the agent maturity assessment system! Overall, this is a well-structured feature with excellent test coverage and documentation. Below are my findings organized by category.


Strengths

  1. Excellent Documentation: The docstrings are thorough and include examples, parameter descriptions, and clear explanations of the maturity scoring algorithm.

  2. Comprehensive Test Coverage: 19 tests (11 unit + 8 integration) covering all maturity levels and edge cases is outstanding.

  3. Performance Optimization: The addition of the idx_tasks_assigned_to compound index is a smart optimization for the maturity queries.

  4. Proper Separation of Concerns: The implementation properly delegates to repositories, maintains audit logging, and follows the existing architectural patterns.

  5. UI Integration: The frontend implementation correctly uses Nova design system colors and integrates performance metrics cleanly into the AgentCard component.

  6. Async Support: Providing both sync (get_tasks_by_agent) and async (get_tasks_by_agent_async) versions shows forward-thinking design.


⚠️ Issues & Concerns

1. Database Index Syntax Error (Critical)

Location: codeframe/persistence/schema_manager.py:506

The index definition is invalid:

CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON tasks(assigned_to, project_id, created_at DESC)

Problem: SQLite doesn't support DESC in index column definitions. While the index will be created, the DESC keyword will be silently ignored, potentially affecting query performance.

Fix: Remove DESC from the index definition:

CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON tasks(assigned_to, project_id, created_at)

The query optimizer will still use the index efficiently for ORDER BY created_at DESC.

Verification: Run sqlite3 your.db ".schema tasks" to confirm the index was created correctly.


2. Logic Error in Self-Correction Rate Calculation

Location: codeframe/agents/worker_agent.py:761

The current logic treats a single correction attempt with attempt_number=1 as a success:

elif len(correction_attempts) == 1 and correction_attempts[0].get("attempt_number", 1) == 1:
    # Only one attempt and it was the first one
    first_attempt_success_count += 1

Problem: The existence of a correction attempt record typically indicates something went wrong and needed correction. A task with correction_attempts records should generally NOT count as first-attempt success.

Expected Behavior:

  • First-attempt success: Task completed with NO correction_attempts records
  • Self-corrected: Task completed WITH correction_attempts records (multiple attempts)

Fix: Simplify the logic to only count tasks with no correction attempts:

# Step 4: Calculate self-correction rate
first_attempt_success_count = 0
for task in completed_tasks:
    correction_attempts = self.db.get_correction_attempts_by_task(task.id)
    if not correction_attempts:
        # No correction attempts means first attempt succeeded
        first_attempt_success_count += 1

Impact: This bug artificially inflates the self-correction rate, which could promote agents to higher maturity levels prematurely.


3. Missing Import for AgentMaturity Enum

Location: codeframe/agents/worker_agent.py:706

The code references AgentMaturity enum but doesn't import it at the module level:

new_maturity = AgentMaturity.D1

Current State: The import is missing from the top of the file.

Fix: Add to imports at the top of worker_agent.py:

from codeframe.core.models import AgentMaturity

Note: The code currently imports TaskStatus inline (line 693), but AgentMaturity should be imported at module level for consistency.


4. Potential N+1 Query Performance Issue

Location: codeframe/agents/worker_agent.py:737-757

The maturity assessment performs multiple database queries in loops:

for task in completed_tasks:
    test_results = self.db.get_test_results_by_task(task.id)  # Query 1
    # ...
    correction_attempts = self.db.get_correction_attempts_by_task(task.id)  # Query 2

Problem: For an agent with 100 completed tasks, this results in 200 additional database queries (2 per task).

Recommendation: Add batch query methods to repositories:

  • get_test_results_by_tasks(task_ids: List[int])
  • get_correction_attempts_by_tasks(task_ids: List[int])

Then fetch all data upfront:

task_ids = [t.id for t in completed_tasks]
all_test_results = self.db.get_test_results_by_tasks(task_ids)
all_correction_attempts = self.db.get_correction_attempts_by_tasks(task_ids)

Priority: Medium - This becomes critical as agent task history grows.


5. Inconsistent Metrics Field Handling

Location: codeframe/persistence/repositories/agent_repository.py:191

The get_agents_for_project query was updated to include a.metrics, but other agent queries may not include this field.

Check Needed: Verify these methods also select the metrics column:

  • get_agent()
  • get_all_agents()
  • Any other agent query methods

Recommendation: Ensure consistent column selection across all agent queries to prevent None values when metrics should be populated.


6. Frontend Type Safety Issue

Location: web-ui/src/components/AgentCard.tsx:64-66

The maturity badge lookup doesn't handle undefined maturity levels gracefully:

const maturityBadge = agent.maturityLevel
  ? maturityBadges[agent.maturityLevel]
  : null;

Problem: If agent.maturityLevel is an unexpected value (e.g., due to database corruption or API changes), this will render incorrectly.

Fix: Add fallback handling:

const maturityBadge = agent.maturityLevel && maturityBadges[agent.maturityLevel]
  ? maturityBadges[agent.maturityLevel]
  : null;

Priority: Low - Defensive programming best practice.


🔍 Code Quality Observations

  1. Magic Numbers: The weights (0.4, 0.3, 0.3) and thresholds (0.5, 0.7, 0.9) are hardcoded. Consider extracting as class constants for easier tuning:

    MATURITY_WEIGHTS = {
        'completion_rate': 0.4,
        'test_pass_rate': 0.3,
        'self_correction_rate': 0.3
    }
    MATURITY_THRESHOLDS = {
        'D2': 0.5,  # Intermediate
        'D3': 0.7,  # Advanced  
        'D4': 0.9   # Expert
    }
  2. Error Handling: The _update_agent_maturity and _log_maturity_assessment methods swallow exceptions silently. Consider adding error metrics or raising warnings to a monitoring system.

  3. Duplicate JSON Parsing: The pattern of checking isinstance(x, str) then json.loads() appears multiple times. Consider a helper method:

    def _parse_metrics_json(self, metrics_json: Any) -> Optional[Dict[str, Any]]:
        if not metrics_json:
            return None
        try:
            return json.loads(metrics_json) if isinstance(metrics_json, str) else metrics_json
        except (json.JSONDecodeError, TypeError):
            return None
  4. Frontend Memoization: The custom memo comparison (AgentCard.tsx:209-219) uses JSON.stringify for deep comparison. This works but could be expensive for large metrics objects. Consider using a library like react-fast-compare or accepting shallow comparison.


🔒 Security Review

No security concerns identified. The implementation:

  • ✅ Properly sanitizes inputs via parameterized queries
  • ✅ Doesn't expose sensitive data in audit logs
  • ✅ Uses appropriate permissions (system-initiated, no user context needed)

🧪 Test Coverage Assessment

Excellent coverage! Tests verify:

  • ✅ All 4 maturity levels (D1-D4)
  • ✅ Edge cases (no tasks, no tests, no correction attempts)
  • ✅ Boundary conditions (score thresholds at 0.5, 0.7, 0.9)
  • ✅ Database integration and audit logging
  • ✅ Async operations

Suggestion: Add a test for the N+1 query scenario with a large number of tasks (e.g., 100+) to validate performance.


📝 Recommendations Summary

Priority Issue Action
🔴 Critical Database index syntax Remove DESC from index definition
🔴 Critical Self-correction logic bug Only count tasks with zero correction attempts
🟡 High Missing import Add AgentMaturity to module imports
🟡 Medium N+1 query performance Add batch query methods
🟢 Low Inconsistent metrics field Verify all agent queries include metrics
🟢 Low Frontend type safety Add fallback for invalid maturity levels

Final Verdict

This is a high-quality implementation with strong documentation and testing. The critical issues (index syntax and self-correction logic) should be fixed before merging, but they're straightforward to address.

Once the critical issues are resolved, this PR will be an excellent addition to the codebase!

Recommendation: Request changes to fix the critical issues, then approve.


Great work on this feature! 🎉

- Remove DESC from index definition (SQLite ignores it in column specs)
- Fix self-correction rate: tasks with any correction_attempts records
  should NOT count as first-attempt successes
- Move TaskStatus import to module level for consistency
- Remove inline imports of TaskStatus (was duplicated 3 times)
@claude
Copy link
Copy Markdown

claude Bot commented Dec 27, 2025

Code Review: Agent Maturity System

I've reviewed PR #152 implementing the agent maturity assessment system. Overall, this is a well-designed feature with comprehensive testing. Here's my detailed feedback:

✅ Strengths

  1. Excellent Documentation

    • Clear docstrings with usage examples (e.g., assess_maturity())
    • Well-commented code explaining the scoring formula
    • PR description clearly outlines the maturity levels and scoring
  2. Comprehensive Test Coverage

    • 11 unit tests + 8 integration tests covering all maturity levels (D1-D4)
    • Tests cover edge cases: no tasks, progression, demotion, and boundary conditions
    • Integration tests verify the full workflow including DB persistence and API responses
  3. Database Design

    • Added idx_tasks_assigned_to index for optimal query performance (schema_manager.py:504)
    • Both sync and async versions of get_tasks_by_agent() provided
    • Proper delegation pattern through Database facade
  4. Audit Trail

    • _log_maturity_assessment() creates audit logs for all assessments (worker_agent.py:849-870)
    • Metadata includes old/new maturity, score, and metrics
  5. UI Integration

    • AgentCard.tsx displays maturity badges with clear visual hierarchy
    • Performance metrics shown in a clean, readable format
    • Proper React.memo optimization with custom comparison

🐛 Potential Issues

1. Score Calculation Logic Error (worker_agent.py:753-761)

The self-correction rate calculation may be misleading:

for task in completed_tasks:
    correction_attempts = self.db.get_correction_attempts_by_task(task.id)
    if not correction_attempts:
        first_attempt_success_count += 1

Problem: Tasks without correction attempt records count as "first attempt success" even if they failed or are still pending. This could inflate the self-correction rate for agents that don't use the correction system.

Suggestion: Only count completed tasks that explicitly succeeded on the first attempt:

for task in completed_tasks:
    correction_attempts = self.db.get_correction_attempts_by_task(task.id)
    if len(correction_attempts) == 0:
        # Only count if task is COMPLETED (already filtered)
        first_attempt_success_count += 1

2. Missing Error Handling in API (routers/agents.py:195-212)

The metrics JSON parsing happens after get_agents_for_project() but doesn't handle all failure modes:

for agent in agents:
    metrics_json = agent.get("metrics")
    if metrics_json:
        try:
            agent["metrics"] = (
                json.loads(metrics_json)
                if isinstance(metrics_json, str)
                else metrics_json
            )
        except (json.JSONDecodeError, TypeError):
            agent["metrics"] = None

Issue: If an agent has invalid JSON, it silently sets metrics to None. This could hide data corruption issues.

Suggestion: Log a warning when JSON parsing fails:

except (json.JSONDecodeError, TypeError) as e:
    logger.warning(f"Failed to parse metrics for agent {agent.get('id')}: {e}")
    agent["metrics"] = None

3. Race Condition Risk (worker_agent.py:702)

The maturity assessment reads old maturity, calculates new maturity, then updates:

old_maturity = self._get_current_maturity()  # Read
tasks = self.db.get_tasks_by_agent(self.agent_id)
# ... calculations ...
self._update_agent_maturity(new_maturity, metrics)  # Write

Issue: If two processes assess maturity concurrently, they could overwrite each other's updates.

Suggestion: Add optimistic locking or timestamp-based conflict detection. Alternatively, document that maturity assessment should be single-threaded per agent.

⚠️ Performance Considerations

1. N+1 Query Problem (worker_agent.py:735-747)

The test pass rate calculation iterates through completed tasks and makes a DB query for each:

for task in completed_tasks:
    test_results = self.db.get_test_results_by_task(task.id)  # One query per task

Impact: For an agent with 100 completed tasks, this makes 100+ database queries.

Suggestion: Add a batch query method:

def get_test_results_by_tasks(self, task_ids: List[int]) -> Dict[int, List[dict]]:
    # Single query: SELECT * FROM test_results WHERE task_id IN (...)

Same issue exists for correction attempts (line 757).

2. Large Task Lists (worker_agent.py:702)

get_tasks_by_agent() has a default limit of 100, but for long-running agents this might not represent their full history.

Suggestion: Consider adding configurable lookback windows:

  • Last 30 days of tasks
  • Last 100 tasks (current)
  • All-time (opt-in)

🔒 Security Concerns

No significant security issues found. The system properly:

  • Validates agent_id before queries
  • Uses parameterized queries (SQL injection protection)
  • Doesn't expose sensitive data in API responses

📝 Best Practices

1. Magic Numbers (worker_agent.py:771-781)

The maturity thresholds are hardcoded:

if maturity_score >= 0.9:
    new_maturity = AgentMaturity.D4
elif maturity_score >= 0.7:
    new_maturity = AgentMaturity.D3

Suggestion: Extract to class constants or config:

MATURITY_THRESHOLDS = {
    AgentMaturity.D4: 0.9,
    AgentMaturity.D3: 0.7,
    AgentMaturity.D2: 0.5,
    AgentMaturity.D1: 0.0,
}

2. Incomplete Type Hints (worker_agent.py:894)

should_assess_maturity() uses Dict and Any but could be more specific:

def should_assess_maturity(self, min_tasks_since_last: int = 5) -> bool:

Already has proper type hints ✅

3. Frontend Type Safety (AgentCard.tsx:3-22)

The AgentMetrics interface is well-defined, but the mapping could be type-safe:

const maturityBadges: Record<string, { bg: string; text: string; icon: string; label: string }> = {

Suggestion: Use a union type instead of string:

type MaturityLevel = 'directive' | 'coaching' | 'supporting' | 'delegating';
const maturityBadges: Record<MaturityLevel, { ... }> = { ... };

📊 Test Coverage Gaps

While test coverage is excellent, consider adding:

  1. Concurrent Assessment Test: Verify behavior when multiple processes assess maturity simultaneously
  2. Large Dataset Test: Test with 1000+ tasks to verify performance
  3. Invalid Data Test: Test with corrupted metrics JSON in database
  4. Time-Based Assessment Test: Verify should_assess_maturity() time-based triggering

✨ Recommendations

Priority 1 (Must Fix)

  1. Fix the self-correction rate logic to avoid counting non-correction tasks incorrectly
  2. Add logging for JSON parsing failures in API

Priority 2 (Should Fix)

  1. Implement batch queries for test results and correction attempts
  2. Add documentation about thread-safety expectations
  3. Extract maturity thresholds to constants

Priority 3 (Nice to Have)

  1. Add configurable lookback windows for task history
  2. Improve TypeScript type safety in frontend
  3. Add performance tests for large datasets

📖 Documentation

Consider adding:

  • A design document explaining the maturity model rationale
  • Performance characteristics and recommended assessment frequency
  • Migration guide if existing agents need initial assessment

Overall Assessment: This is high-quality code that follows the codebase conventions well. The scoring algorithm is sound, testing is thorough, and the UI integration is clean. Address the N+1 query issue and self-correction logic, and this will be production-ready.

Recommendation: ✅ Approve with minor changes requested

Great work on this feature! The maturity system will provide valuable insights into agent performance.

@frankbria frankbria merged commit d76babf into main Dec 27, 2025
6 of 9 checks passed
@frankbria frankbria deleted the claude/agent-maturity-system-C8WZ0 branch December 27, 2025 05:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
codeframe/agents/worker_agent.py (1)

663-823: Well-structured maturity assessment implementation.

The weighted scoring formula and maturity level mapping align with the PR objectives (D1-D4 levels). The self-correction rate logic correctly counts only tasks with no correction attempts as first-attempt successes, per the commit fix.

Minor performance consideration: Lines 737-747 and 756-760 execute separate DB queries for each completed task (N+1 pattern). This is acceptable for small task counts, but if agents accumulate many tasks over time, consider batch-fetching test results and correction attempts in single queries.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63733fa and 1c14157.

📒 Files selected for processing (2)
  • codeframe/agents/worker_agent.py
  • codeframe/persistence/schema_manager.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Python 3.11+ with type hints and async/await for backend development

Files:

  • codeframe/persistence/schema_manager.py
  • codeframe/agents/worker_agent.py
codeframe/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

codeframe/**/*.py: Use FastAPI with AsyncAnthropic for backend API development
Use SQLite with aiosqlite for async database operations
Use tiktoken for token counting in the backend
Use ruff for Python code linting and formatting
Use tiered memory system (HOT/WARM/COLD) for context management to achieve 30-50% token reduction
Implement session lifecycle management with file-based storage in .codeframe/session_state.json for CLI auto-save/restore

Files:

  • codeframe/persistence/schema_manager.py
  • codeframe/agents/worker_agent.py
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}: Use WebSockets for real-time updates between frontend and backend
Use last-write-wins strategy with backend timestamps for timestamp conflict resolution in multi-agent scenarios

Files:

  • codeframe/persistence/schema_manager.py
  • codeframe/agents/worker_agent.py
codeframe/persistence/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Pre-production application: use flattened v1.0 database schema with direct table creation (no migration system)

Files:

  • codeframe/persistence/schema_manager.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
📚 Learning: 2025-11-25T19:08:37.203Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4

Applied to files:

  • codeframe/agents/worker_agent.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Frontend Unit Tests
  • GitHub Check: Backend Unit Tests
  • GitHub Check: claude-review
  • GitHub Check: E2E Smoke Tests (Chromium)
🔇 Additional comments (5)
codeframe/persistence/schema_manager.py (1)

504-507: LGTM! Well-designed index for agent maturity queries.

The composite index on (assigned_to, project_id, created_at) correctly supports the get_tasks_by_agent query pattern. The column ordering is optimal for filtering by agent first, then optionally by project, with efficient ordering by creation time.

codeframe/agents/worker_agent.py (4)

23-25: LGTM!

Module-level import of TaskStatus is a clean refactor that eliminates inline imports and follows Python best practices.


873-909: LGTM!

Audit logging is well-implemented with appropriate metadata and graceful error handling that won't block the maturity assessment flow.


911-975: LGTM! Well-designed assessment trigger logic.

The three conditions (24-hour threshold, minimum completed tasks since last assessment, and first-time assessment) provide sensible controls for when to re-evaluate maturity. The timezone handling at line 955 correctly handles both Z and +00:00 ISO format suffixes.

Note: The inline import json at line 929 should be addressed along with the same issue in _update_agent_maturity().


848-871: Move json import to module level.

The inline import json at line 857 should be moved to the module-level imports, consistent with best practices. This import is also used in should_assess_maturity() at line 929.

The enum serialization concern is not applicable—update_agent() already handles AgentMaturity enum-to-string conversion automatically (see codeframe/persistence/repositories/agent_repository.py lines 110-114).

+import json
 from codeframe.core.models import (
     Task, TaskStatus, AgentMaturity, ContextItemType, ContextTier, CallType
 )
     def _update_agent_maturity(
         self, maturity: AgentMaturity, metrics: Dict[str, Any]
     ) -> None:
-        import json
-
         if not self.db:
             return

         try:
             self.db.update_agent(
                 self.agent_id,
                 {
                     "maturity_level": maturity,
                     "metrics": json.dumps(metrics),
                 },
             )

Likely an incorrect or invalid review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P2] Implement WorkerAgent.assess_maturity() system

2 participants