Implement agent maturity level system#152
Conversation
Add comprehensive maturity assessment for worker agents based on task performance history. The system calculates a weighted score from completion rate (40%), test pass rate (30%), and self-correction rate (30%), mapping to four maturity levels (D1-D4). Key changes: - Add assess_maturity() method to WorkerAgent with metrics calculation - Add get_tasks_by_agent() query to TaskRepository - Update AgentCard.tsx with maturity badges and performance metrics - Add AgentMetricsResponse model for API responses - Include metrics JSON in agent API endpoint responses - Add 11 unit tests and 8 integration tests for maturity system Maturity levels: - D1 (directive/novice): score < 0.5 - D2 (coaching/intermediate): 0.5 <= score < 0.7 - D3 (supporting/advanced): 0.7 <= score < 0.9 - D4 (delegating/expert): score >= 0.9
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughImplements agent maturity assessment: computes weighted maturity score (completion, test pass, self-correction), maps to D1–D4, persists metrics to DB, emits audit logs, adds assessment trigger logic, repository/query support, API/UI model updates, and unit + integration tests. Changes
Sequence DiagramsequenceDiagram
actor User
participant UI as Web UI
participant API as Agents Router
participant Agent as WorkerAgent
participant DB as Database
participant TaskRepo as TaskRepository
participant Audit as AuditLog
User->>UI: View agents list / open agent
UI->>API: GET /projects/{id}/agents
API->>DB: get_agents_for_project()
DB-->>API: agents (raw metrics)
alt Assessment trigger (should_assess_maturity == true)
API->>Agent: trigger assess_maturity(agent_id)
Agent->>DB: get_tasks_by_agent(agent_id)
DB->>TaskRepo: execute query (assigned_to[, project_id])
TaskRepo-->>DB: return task rows
DB-->>Agent: task history
rect rgba(220,235,250,0.9)
Note over Agent: compute metrics\ncompletion(40%), test_pass(30%), self_correct(30%)
end
Agent->>DB: _update_agent_maturity(metrics, score, level)
Agent->>Audit: _log_maturity_assessment(event)
Audit-->>Agent: ack
Agent-->>API: assessment result (level, score, changed)
end
API->>UI: AgentAssignmentResponse (with parsed metrics)
UI->>User: Render maturity badge & metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Code Review: Agent Maturity Level SystemSummaryThis PR implements a comprehensive maturity assessment system for worker agents based on task performance. The implementation is solid with excellent test coverage (19 tests total). Below are my findings. ✅ Strengths1. Excellent Test Coverage
2. Well-Documented Code
3. Clean Architecture
4. Robust Error Handling
5. UI Integration
|
🔍 Issues & RecommendationsHIGH PRIORITY: Inefficient Database Queries (N+1 Problem)Location: Issue: for task in completed_tasks:
test_results = self.db.get_test_results_by_task(task.id) # N queries
# ...
correction_attempts = self.db.get_correction_attempts_by_task(task.id) # N queriesFor an agent with 100 completed tasks, this results in 200+ database queries. Recommendation: # In test_repository.py
def get_test_results_by_tasks(self, task_ids: List[int]) -> Dict[int, List[dict]]:
"""Get test results for multiple tasks in one query."""
# Single query with WHERE task_id IN (...)
# In correction_repository.py
def get_correction_attempts_by_tasks(self, task_ids: List[int]) -> Dict[int, List[dict]]:
"""Get correction attempts for multiple tasks in one query."""This reduces queries from O(N) to O(1) and significantly improves performance for agents with large task histories. Priority: High (performance) MEDIUM: Self-Correction Rate Logic May Be Too LenientLocation: Issue:
This seems inconsistent. If a correction attempt record exists with attempt_number=1, it suggests there WAS an error that needed correction. Recommendation: Priority: Medium (business logic clarity) |
📝 Minor Issues & SuggestionsLOW: Potential Division by Zero (Defensive Programming)Location: While you check if total_tests > 0:
task_pass_rate = passed / total_tests
test_pass_rates.append(task_pass_rate)LOW: Missing Schema Validation in API ResponseLocation: The JSON parsing is wrapped in try-catch, but there is no validation that the parsed metrics match the expected schema. Consider adding validation for expected keys. Suggestion: Add Maturity Score Thresholds as Constants# worker_agent.py - top of file
MATURITY_THRESHOLDS = {
AgentMaturity.D4: 0.9,
AgentMaturity.D3: 0.7,
AgentMaturity.D2: 0.5,
AgentMaturity.D1: 0.0,
}Benefit: Easier to adjust thresholds later, clearer intent Future Enhancement: Consider Weighted Task AgeCurrently, all tasks contribute equally to the score regardless of age. Consider giving more weight to recent tasks in a future PR: def _calculate_task_weight(task_created_at: datetime) -> float:
days_ago = (datetime.now(timezone.utc) - task_created_at).days
return math.exp(-days_ago / 30) # Half-life of 30 days🔒 Security Considerations✅ No Major Security Issues Found
🎯 Performance Considerations
|
✅ ConclusionOverall Assessment: APPROVE with suggestions This is a well-implemented feature with:
The code is production-ready. The performance optimization for database queries would be beneficial but is not a blocking issue unless you expect agents to have hundreds of tasks. Files Changed:
Recommended Action:
Great work on this comprehensive feature! The maturity system will provide valuable insights into agent performance. 🎉 Test Coverage Summary: |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
codeframe/ui/routers/agents.py (1)
183-184: Moveimport jsonto module level.The local import works but placing it at the top of the file follows Python conventions and avoids repeated import lookups in hot paths.
🔎 Proposed fix
Add at the top of the file with other imports:
import jsonThen remove the local import on line 183.
codeframe/agents/worker_agent.py (4)
756-769: Clarify self-correction rate semantics.The current logic counts tasks with no correction attempts OR exactly one attempt at
attempt_number=1as "first attempt success." However, if a task hascorrection_attempts=[]because attempts weren't tracked (rather than succeeding on first try), this inflates the self-correction rate. Consider verifying that the absence of correction attempts genuinely indicates first-attempt success.
890-910: Audit log failure is silently swallowed.The method catches all exceptions and logs a warning, which is appropriate for a non-critical audit path. However, consider whether the caller should be informed of audit failures for observability purposes (e.g., returning a success flag or including in the result dict).
955-964: Timezone parsing is robust but could use a utility function.The
replace("Z", "+00:00")pattern for ISO timestamp parsing is correct but repeated logic. Consider extracting to a helper function if used elsewhere, or usedatetime.fromisoformatwith Python 3.11+ which handles "Z" suffix natively.🔎 Python 3.11+ simplification (per coding guidelines)
try: - last_assessed = datetime.fromisoformat(last_assessed_str.replace("Z", "+00:00")) + last_assessed = datetime.fromisoformat(last_assessed_str) hours_since_assessment = ( datetime.now(timezone.utc) - last_assessed ).total_seconds() / 3600Note: Python 3.11+
fromisoformat()supports the "Z" suffix natively. Since the coding guidelines specify Python 3.11+, this simplification should work.
930-930: Consolidate repeated imports at module level or once at method entry.
jsonis imported in both_update_agent_maturity(line 858) andshould_assess_maturity(line 930). Similarly,TaskStatusis imported in bothassess_maturity(line 693) andshould_assess_maturity(line 970). While this works, consider moving these to the top of the file or to a single location for clarity.Also applies to: 970-970
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
codeframe/agents/worker_agent.pycodeframe/persistence/database.pycodeframe/persistence/repositories/agent_repository.pycodeframe/persistence/repositories/task_repository.pycodeframe/ui/models.pycodeframe/ui/routers/agents.pytests/agents/test_worker_agent.pytests/integration/test_agent_maturity.pyweb-ui/src/components/AgentCard.tsx
🧰 Additional context used
📓 Path-based instructions (13)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use TypeScript 5.3+ with strict mode for frontend development
Files:
web-ui/src/components/AgentCard.tsx
web-ui/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
web-ui/src/**/*.{ts,tsx}: Use React 18 with TypeScript and Context + useReducer pattern for state management
Use shadcn/ui components from @/components/ui/ directory
Use Hugeicons (@hugeicons/react) for all icons instead of lucide-react
Implement WebSocket automatic reconnection with exponential backoff (1s → 30s)
Files:
web-ui/src/components/AgentCard.tsx
web-ui/**/*.{css,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Tailwind CSS with Nova design system template for styling
Files:
web-ui/src/components/AgentCard.tsx
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}: Use WebSockets for real-time updates between frontend and backend
Use last-write-wins strategy with backend timestamps for timestamp conflict resolution in multi-agent scenarios
Files:
web-ui/src/components/AgentCard.tsxcodeframe/persistence/repositories/task_repository.pycodeframe/ui/routers/agents.pycodeframe/persistence/database.pycodeframe/persistence/repositories/agent_repository.pycodeframe/ui/models.pycodeframe/agents/worker_agent.py
web-ui/src/**/*.{tsx,css}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Nova color palette variables (bg-card, text-foreground, etc.) instead of hardcoded color values
Files:
web-ui/src/components/AgentCard.tsx
web-ui/src/**/*.tsx
📄 CodeRabbit inference engine (CLAUDE.md)
web-ui/src/**/*.tsx: Use cn() utility for conditional Tailwind CSS classes
Wrap AgentStateProvider with ErrorBoundary component for graceful error handling
Use useMemo for derived state calculations in React components
Files:
web-ui/src/components/AgentCard.tsx
web-ui/src/components/**/*.tsx
📄 CodeRabbit inference engine (CLAUDE.md)
Implement React.memo on all Dashboard sub-components for performance optimization
Files:
web-ui/src/components/AgentCard.tsx
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Python 3.11+ with type hints and async/await for backend development
Files:
codeframe/persistence/repositories/task_repository.pycodeframe/ui/routers/agents.pycodeframe/persistence/database.pytests/integration/test_agent_maturity.pycodeframe/persistence/repositories/agent_repository.pycodeframe/ui/models.pycodeframe/agents/worker_agent.pytests/agents/test_worker_agent.py
codeframe/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
codeframe/**/*.py: Use FastAPI with AsyncAnthropic for backend API development
Use SQLite with aiosqlite for async database operations
Use tiktoken for token counting in the backend
Use ruff for Python code linting and formatting
Use tiered memory system (HOT/WARM/COLD) for context management to achieve 30-50% token reduction
Implement session lifecycle management with file-based storage in .codeframe/session_state.json for CLI auto-save/restore
Files:
codeframe/persistence/repositories/task_repository.pycodeframe/ui/routers/agents.pycodeframe/persistence/database.pycodeframe/persistence/repositories/agent_repository.pycodeframe/ui/models.pycodeframe/agents/worker_agent.py
codeframe/persistence/repositories/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement Repository pattern with domain-specific repositories for data access
Files:
codeframe/persistence/repositories/task_repository.pycodeframe/persistence/repositories/agent_repository.py
codeframe/persistence/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Pre-production application: use flattened v1.0 database schema with direct table creation (no migration system)
Files:
codeframe/persistence/repositories/task_repository.pycodeframe/persistence/database.pycodeframe/persistence/repositories/agent_repository.py
tests/**/*.{py,ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use TestSprite and Playwright for E2E testing of workflows
Files:
tests/integration/test_agent_maturity.pytests/agents/test_worker_agent.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Run pytest with coverage tracking for Python backend tests
Files:
tests/integration/test_agent_maturity.pytests/agents/test_worker_agent.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
📚 Learning: 2025-12-24T04:24:43.804Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-24T04:24:43.804Z
Learning: Applies to web-ui/src/{contexts,reducers,hooks}/**/*.{ts,tsx} : Implement AgentStateContext with useReducer for multi-agent state management
Applied to files:
web-ui/src/components/AgentCard.tsx
📚 Learning: 2025-12-24T04:24:43.804Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-24T04:24:43.804Z
Learning: Applies to web-ui/src/**/*.tsx : Wrap AgentStateProvider with ErrorBoundary component for graceful error handling
Applied to files:
web-ui/src/components/AgentCard.tsx
📚 Learning: 2025-11-25T19:08:37.203Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
Applied to files:
tests/integration/test_agent_maturity.pycodeframe/agents/worker_agent.pytests/agents/test_worker_agent.py
📚 Learning: 2025-12-17T19:21:40.014Z
Learnt from: frankbria
Repo: frankbria/codeframe PR: 128
File: tests/agents/test_bottleneck_detection.py:486-500
Timestamp: 2025-12-17T19:21:40.014Z
Learning: In tests/agents/test_bottleneck_detection.py, test_skip_agents_below_threshold should be async and mock _get_agent_workload to return workload below AGENT_OVERLOAD_THRESHOLD (5) while providing non-empty tasks list to prevent early return in detect_bottlenecks().
Applied to files:
tests/integration/test_agent_maturity.py
📚 Learning: 2025-12-17T19:21:30.131Z
Learnt from: frankbria
Repo: frankbria/codeframe PR: 128
File: tests/agents/test_bottleneck_detection.py:486-500
Timestamp: 2025-12-17T19:21:30.131Z
Learning: In tests/agents/*.py, when testing bottleneck detection logic, ensure that tests exercising detect_bottlenecks are async and mock _get_agent_workload to return a value below AGENT_OVERLOAD_THRESHOLD (5) while providing a non-empty tasks list to prevent early return. This guarantees the code path for low workload is exercised and behavior under threshold is verified.
Applied to files:
tests/agents/test_worker_agent.py
🧬 Code graph analysis (3)
codeframe/persistence/repositories/task_repository.py (1)
codeframe/persistence/database.py (2)
get_tasks_by_agent(408-410)_row_to_task(380-382)
codeframe/persistence/database.py (1)
codeframe/persistence/repositories/task_repository.py (1)
get_tasks_by_agent(543-582)
tests/agents/test_worker_agent.py (2)
tests/integration/test_agent_maturity.py (1)
db(18-22)codeframe/agents/worker_agent.py (3)
WorkerAgent(39-1622)assess_maturity(661-824)should_assess_maturity(912-977)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Frontend Unit Tests
- GitHub Check: Backend Unit Tests
- GitHub Check: E2E Smoke Tests (Chromium)
- GitHub Check: claude-review
- GitHub Check: Greptile Review
🔇 Additional comments (24)
codeframe/persistence/repositories/agent_repository.py (1)
191-191: LGTM!The addition of
a.metricsto the SELECT list is correct and consistent with the existing query structure. This enables the API layer to access agent maturity metrics when retrieving agents for a project.codeframe/ui/routers/agents.py (1)
198-212: LGTM!The metrics parsing logic is well-structured with proper error handling. The defensive approach of falling back to
Noneon parse failures ensures API stability while allowing the UI to gracefully handle missing metrics.tests/agents/test_worker_agent.py (6)
925-952: LGTM!This test correctly validates the baseline case where an agent with no task history defaults to D1 (directive/novice) maturity level with zero metrics.
953-1006: LGTM!The test correctly validates D1 assignment for an agent with poor completion rate (20%) and no test data, verifying the weighted score calculation stays below 0.5.
1008-1089: LGTM!The test comprehensively validates D2 (intermediate) assessment with moderate performance metrics. The score calculation aligns with the weighted formula: ~60% completion × 0.4 + ~41% test pass × 0.3 + 50% first-attempt × 0.3 ≈ 0.51.
1091-1152: LGTM!The test validates advanced/expert maturity levels. The calculated score of 0.90 (90% completion × 0.4 + 80% test pass × 0.3 + 100% first-attempt × 0.3) correctly lands at the D4 threshold, and the flexible assertion accommodates boundary behavior.
1154-1213: LGTM!The test correctly validates D4 (expert/delegating) assignment for an agent with perfect performance metrics: 100% completion, 100% test pass rate, and 100% first-attempt success resulting in a maturity score of 1.0.
1215-1431: Comprehensive test coverage for maturity system.The tests thoroughly cover:
- Database persistence of maturity updates
- Audit trail logging
- Graceful handling of missing test results
- Self-correction rate calculation with correction attempts
- Assessment trigger logic
- Error handling for uninitialized database
Based on learnings, this aligns with the D1-D4 maturity level system for Worker Agents.
codeframe/ui/models.py (2)
165-177: LGTM!The
AgentMetricsResponsemodel is well-defined with appropriate field types and descriptions. Making all fields optional provides flexibility for agents that haven't been assessed yet.
191-193: LGTM!The addition of the optional
metricsfield toAgentAssignmentResponsecorrectly integrates the new maturity metrics with the existing agent assignment data model.codeframe/persistence/database.py (1)
408-411: LGTM!The delegation method follows the established facade pattern, maintaining consistency with the other 50+ delegated methods in this class while exposing the new
get_tasks_by_agentcapability fromTaskRepository.web-ui/src/components/AgentCard.tsx (5)
3-22: LGTM!The
AgentMetricsinterface and extendedAgenttype correctly mirror the backendAgentMetricsResponsemodel. The optionalmaturityLevelfield properly restricts values to the four valid maturity states.
56-66: LGTM!The maturity badge mapping correctly translates the backend D1-D4 levels to user-friendly labels (Novice → Expert) with appropriate Nova palette styling and intuitive icons.
114-130: LGTM!The maturity badge renders conditionally when available, with a helpful tooltip showing the maturity score. The flex-wrap ensures clean layout when both badges are present.
156-187: LGTM!The Performance Metrics section renders conditionally with proper undefined checks for each metric. The 2-column grid layout and percentage formatting provide a clean, scannable display. The "First Try" label for
self_correction_rateis an intuitive user-facing term.
207-221: LGTM!The memoization comparison correctly includes the new
maturityLevelandmetricsfields. UsingJSON.stringifyfor comparingmetricsis acceptable given the shallow structure and small size of the metrics object. As per coding guidelines,React.memois properly applied to the Dashboard sub-component.tests/integration/test_agent_maturity.py (4)
1-145: LGTM!The test module is well-structured with appropriate fixtures. The
test_maturity_progression_novice_to_experttest comprehensively validates the full D1→D4 maturity journey across four phases with progressively improving performance metrics. Based on learnings, this aligns with the D1-D4 maturity level system for Worker Agents.
146-213: LGTM!The demotion test validates an important behavior: agent maturity can decrease when performance degrades. The flexible assertion (
level < initial_level OR score < initial_score) correctly handles boundary cases where maturity thresholds may not be crossed despite score reduction.
215-322: LGTM!The persistence tests validate two critical flows:
- Maturity level persists across agent instance recreations
- Metrics are properly accessible through the API layer (
get_agents_for_project)This ensures end-to-end data integrity for the maturity system.
324-535: LGTM!The remaining test classes provide thorough coverage:
- Audit trail: Validates audit log creation with correct event types and metadata
- Triggers: Tests
should_assess_maturitylogic with task count thresholds- Task queries: Validates
get_tasks_by_agentwith and without project filteringThis completes the integration test coverage for the maturity system, covering 8 tests across persistence, auditing, triggers, and data access.
codeframe/agents/worker_agent.py (4)
826-847: LGTM!The
_get_current_maturity()method properly handles all edge cases: missing database, missing agent, missing maturity value, and invalid enum values with appropriate logging.
661-692: Well-structured method with clear documentation.The docstring clearly explains the weighted scoring formula, maturity level thresholds, and return structure. The method signature change from
NonetoDict[str, Any]aligns with the PR objectives for exposing maturity metrics. Based on learnings, this implements the D1-D4 maturity levels as specified for Worker Agents.
736-751: No issues found. The code correctly accessestest_resultsas a list of dictionaries with "passed" and "failed" keys. Theget_test_results_by_task()method explicitly returnsList[Dict[str, Any]]as confirmed by the database tests, which verify dictionary keys likeresults[0]["passed"]andresults[0]["failed"]exist. The defensive use of.get()with default values is appropriate.
864-870: The code is correct as written. Line 867 passes thematurityenum directly toupdate_agent, which is safe because theupdate_agentmethod at lines 111-112 ofcodeframe/persistence/repositories/agent_repository.pyexplicitly checksisinstance(value, AgentMaturity)and extracts.valuebefore storing. No changes needed.
| def get_tasks_by_agent( | ||
| self, agent_id: str, project_id: Optional[int] = None, limit: int = 100 | ||
| ) -> List[Task]: | ||
| """Get all tasks assigned to an agent. | ||
|
|
||
| Used for calculating agent maturity metrics based on task history. | ||
|
|
||
| Args: | ||
| agent_id: Agent ID to filter by (matches assigned_to field) | ||
| project_id: Optional project ID to filter by | ||
| limit: Maximum number of tasks to return (default: 100) | ||
|
|
||
| Returns: | ||
| List of Task objects ordered by created_at DESC (most recent first) | ||
| """ | ||
| cursor = self.conn.cursor() | ||
|
|
||
| if project_id is not None: | ||
| cursor.execute( | ||
| """ | ||
| SELECT * FROM tasks | ||
| WHERE assigned_to = ? AND project_id = ? | ||
| ORDER BY created_at DESC | ||
| LIMIT ? | ||
| """, | ||
| (agent_id, project_id, limit), | ||
| ) | ||
| else: | ||
| cursor.execute( | ||
| """ | ||
| SELECT * FROM tasks | ||
| WHERE assigned_to = ? | ||
| ORDER BY created_at DESC | ||
| LIMIT ? | ||
| """, | ||
| (agent_id, limit), | ||
| ) | ||
|
|
||
| rows = cursor.fetchall() | ||
| return [self._row_to_task(row) for row in rows] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual implementation at the specified lines
head -n 582 codeframe/persistence/repositories/task_repository.py | tail -n 40Repository: frankbria/codeframe
Length of output: 1357
🏁 Script executed:
# Check for database schema definitions to see if indexes exist
find codeframe -name "*.py" -type f | xargs rg -l "CREATE TABLE.*tasks" -A 20Repository: frankbria/codeframe
Length of output: 102
🏁 Script executed:
# Look for index definitions on assigned_to or tasks table
rg -n "CREATE INDEX" codeframe/persistence/ -A 3Repository: frankbria/codeframe
Length of output: 20982
🏁 Script executed:
# Check the class definition and other methods to understand async/sync pattern
head -n 100 codeframe/persistence/repositories/task_repository.pyRepository: frankbria/codeframe
Length of output: 2454
🏁 Script executed:
# Check if aiosqlite is being used elsewhere in the file
rg -n "async\|aiosqlite\|await" codeframe/persistence/repositories/task_repository.py | head -20Repository: frankbria/codeframe
Length of output: 45
🏁 Script executed:
# Search specifically for assigned_to index in schema
rg -n "assigned_to" codeframe/persistence/schema_manager.pyRepository: frankbria/codeframe
Length of output: 100
🏁 Script executed:
# Check if there are any async methods in TaskRepository to see the pattern
rg -n "async def" codeframe/persistence/repositories/task_repository.py | head -20Repository: frankbria/codeframe
Length of output: 135
🏁 Script executed:
# Check the connection type - is it aiosqlite or sqlite3?
rg -n "self\.conn" codeframe/persistence/repositories/task_repository.py | head -20Repository: frankbria/codeframe
Length of output: 796
🏁 Script executed:
# Check what BaseRepository looks like
cat codeframe/persistence/repositories/base.pyRepository: frankbria/codeframe
Length of output: 9930
Convert to async implementation and add index on assigned_to column for performance.
This method must be async per coding guidelines requiring aiosqlite for database operations in codeframe/persistence/**/*.py. Additionally, the assigned_to column lacks an index, which will cause performance issues when retrieving task history as the dataset grows.
Recommended changes:
- Convert method to
async defusingawait self._execute_async()andawait self._fetchall_async() - Add
CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON tasks(assigned_to)to schema manager (or compound index with project_id for filtered queries)
🤖 Prompt for AI Agents
In codeframe/persistence/repositories/task_repository.py around lines 543 to
582, the synchronous get_tasks_by_agent should be converted to an async
implementation that uses aiosqlite helper methods and an index added to the
schema: change the signature to async def get_tasks_by_agent(...), replace
cursor.execute/fetchall with await self._execute_async(sql, params) and rows =
await self._fetchall_async(), then return [self._row_to_task(row) for row in
rows]; additionally add a schema migration or update in the schema manager to
create the index (e.g., CREATE INDEX IF NOT EXISTS idx_tasks_assigned_to ON
tasks(assigned_to) or a compound index including project_id) to improve query
performance.
Greptile Summary
Important Files Changed
Confidence score: 4/5
Sequence DiagramsequenceDiagram
participant User
participant API as "Agent API"
participant Agent as "WorkerAgent"
participant TaskRepo as "TaskRepository"
participant TestRepo as "TestRepository"
participant CorrRepo as "CorrectionRepository"
participant AuditRepo as "AuditRepository"
participant UI as "AgentCard UI"
User->>API: "GET /projects/{id}/agents"
API->>Agent: "assess_maturity()"
Agent->>TaskRepo: "get_tasks_by_agent(agent_id)"
TaskRepo-->>Agent: "task history list"
Agent->>Agent: "calculate completion_rate"
Note over Agent: "completed_tasks / total_tasks * 40%"
Agent->>TestRepo: "get_test_results_by_task(task_id)"
TestRepo-->>Agent: "test results"
Agent->>Agent: "calculate avg_test_pass_rate"
Note over Agent: "average pass rate * 30%"
Agent->>CorrRepo: "get_correction_attempts_by_task(task_id)"
CorrRepo-->>Agent: "correction attempts"
Agent->>Agent: "calculate self_correction_rate"
Note over Agent: "first attempt success * 30%"
Agent->>Agent: "compute weighted maturity_score"
Agent->>Agent: "map score to maturity level (D1-D4)"
Agent->>API: "update_agent_maturity()"
Agent->>AuditRepo: "create_audit_log()"
API-->>User: "agents with maturity data"
User->>UI: "render agent cards"
UI->>UI: "display maturity badges and metrics"
|
| Raises: | ||
| HTTPException: 404 if project not found, 500 on database error | ||
| """ | ||
| import json |
There was a problem hiding this comment.
style: Import placed inside function instead of at module level
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: codeframe/ui/routers/agents.py
Line: 183:183
Comment:
**style:** Import placed inside function instead of at module level
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| {agent.metrics.avg_test_pass_rate !== undefined && ( | ||
| <div> | ||
| <span className="text-muted-foreground">Test Pass:</span>{' '} | ||
| <span className="font-medium">{(agent.metrics.avg_test_pass_rate * 100).toFixed(0)}%</span> |
There was a problem hiding this comment.
logic: The test pass rate is already a percentage (0-100) but you're multiplying by 100 again - this will show 10000% for a 100% pass rate
| <span className="font-medium">{(agent.metrics.avg_test_pass_rate * 100).toFixed(0)}%</span> | |
| <span className="font-medium">{(agent.metrics.avg_test_pass_rate).toFixed(0)}%</span> |
Should avg_test_pass_rate be stored as a decimal (0-1) like the other rates, or as a percentage (0-100)?
Prompt To Fix With AI
This is a comment left during a code review.
Path: web-ui/src/components/AgentCard.tsx
Line: 170:170
Comment:
**logic:** The test pass rate is already a percentage (0-100) but you're multiplying by 100 again - this will show 10000% for a 100% pass rate
```suggestion
<span className="font-medium">{(agent.metrics.avg_test_pass_rate).toFixed(0)}%</span>
```
Should `avg_test_pass_rate` be stored as a decimal (0-1) like the other rates, or as a percentage (0-100)?
How can I resolve this? If you propose a fix, please make it concise.| if status == TaskStatus.COMPLETED: | ||
| passed = 5 if i % 2 == 0 else 3 | ||
| failed = 5 if i % 2 == 0 else 7 | ||
| db.create_test_result( | ||
| task_id=task_id, | ||
| status="passed" if passed > failed else "failed", | ||
| passed=passed, | ||
| failed=failed, | ||
| ) |
There was a problem hiding this comment.
style: Test data creates inconsistent pass rates - tasks alternate between 50% (5 pass, 5 fail) and 30% (3 pass, 7 fail) pass rates, but comment claims ~50% overall. This makes the expected score calculation less predictable. Should the test use consistent pass rates to make the expected maturity score more deterministic?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1048:1056
Comment:
**style:** Test data creates inconsistent pass rates - tasks alternate between 50% (5 pass, 5 fail) and 30% (3 pass, 7 fail) pass rates, but comment claims ~50% overall. This makes the expected score calculation less predictable. Should the test use consistent pass rates to make the expected maturity score more deterministic?
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Score calculation: | ||
| # - Completion: 60% * 0.4 = 0.24 | ||
| # - Test pass: ~41% * 0.3 = 0.12 | ||
| # - Self-correction: 50% * 0.3 = 0.15 | ||
| # Total: ~0.51, which is D2 (0.5-0.7) |
There was a problem hiding this comment.
logic: Score calculation comment shows ~41% test pass rate, but the actual calculation with alternating 50%/30% rates would be different. The math should be verified or test data adjusted for consistency.
| # Score calculation: | |
| # - Completion: 60% * 0.4 = 0.24 | |
| # - Test pass: ~41% * 0.3 = 0.12 | |
| # - Self-correction: 50% * 0.3 = 0.15 | |
| # Total: ~0.51, which is D2 (0.5-0.7) | |
| # Score calculation: | |
| # - Completion: 60% * 0.4 = 0.24 | |
| # - Test pass: 40% * 0.3 = 0.12 | |
| # - Self-correction: 50% * 0.3 = 0.15 | |
| # Total: 0.51, which is D2 (0.5-0.7) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1082:1086
Comment:
**logic:** Score calculation comment shows ~41% test pass rate, but the actual calculation with alternating 50%/30% rates would be different. The math should be verified or test data adjusted for consistency.
```suggestion
# Score calculation:
# - Completion: 60% * 0.4 = 0.24
# - Test pass: 40% * 0.3 = 0.12
# - Self-correction: 50% * 0.3 = 0.15
# Total: 0.51, which is D2 (0.5-0.7)
```
How can I resolve this? If you propose a fix, please make it concise.| # 90% completion * 0.4 + 80% test pass * 0.3 + 100% first attempt * 0.3 | ||
| # = 0.36 + 0.24 + 0.30 = 0.90 -> D3 or D4 | ||
| # Score should be around 0.7-0.9 for advanced | ||
| assert result["maturity_level"] in [AgentMaturity.D3, AgentMaturity.D4] | ||
| assert result["maturity_score"] >= 0.7 |
There was a problem hiding this comment.
style: Test accepts either D3 or D4 for the same score calculation (0.90), but the assertion should be more specific since 0.90 exactly hits the D4 threshold. Should this test specifically expect D4 since the calculated score (0.90) exactly meets the D4 threshold?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/agents/test_worker_agent.py
Line: 1148:1152
Comment:
**style:** Test accepts either D3 or D4 for the same score calculation (0.90), but the assertion should be more specific since 0.90 exactly hits the D4 threshold. Should this test specifically expect D4 since the calculated score (0.90) exactly meets the D4 threshold?
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Add get_tasks_by_agent_async() method to TaskRepository using aiosqlite - Add async facade method to Database class - Add idx_tasks_assigned_to compound index (assigned_to, project_id, created_at) - Add 2 async integration tests for the new method
Pull Request Review: Agent Maturity Assessment SystemThank you for this comprehensive implementation of the agent maturity assessment system! Overall, this is a well-structured feature with excellent test coverage and documentation. Below are my findings organized by category. ✅ Strengths
|
| Priority | Issue | Action |
|---|---|---|
| 🔴 Critical | Database index syntax | Remove DESC from index definition |
| 🔴 Critical | Self-correction logic bug | Only count tasks with zero correction attempts |
| 🟡 High | Missing import | Add AgentMaturity to module imports |
| 🟡 Medium | N+1 query performance | Add batch query methods |
| 🟢 Low | Inconsistent metrics field | Verify all agent queries include metrics |
| 🟢 Low | Frontend type safety | Add fallback for invalid maturity levels |
✨ Final Verdict
This is a high-quality implementation with strong documentation and testing. The critical issues (index syntax and self-correction logic) should be fixed before merging, but they're straightforward to address.
Once the critical issues are resolved, this PR will be an excellent addition to the codebase!
Recommendation: Request changes to fix the critical issues, then approve.
Great work on this feature! 🎉
- Remove DESC from index definition (SQLite ignores it in column specs) - Fix self-correction rate: tasks with any correction_attempts records should NOT count as first-attempt successes - Move TaskStatus import to module level for consistency - Remove inline imports of TaskStatus (was duplicated 3 times)
Code Review: Agent Maturity SystemI've reviewed PR #152 implementing the agent maturity assessment system. Overall, this is a well-designed feature with comprehensive testing. Here's my detailed feedback: ✅ Strengths
🐛 Potential Issues1. Score Calculation Logic Error (worker_agent.py:753-761)The self-correction rate calculation may be misleading: for task in completed_tasks:
correction_attempts = self.db.get_correction_attempts_by_task(task.id)
if not correction_attempts:
first_attempt_success_count += 1Problem: Tasks without correction attempt records count as "first attempt success" even if they failed or are still pending. This could inflate the self-correction rate for agents that don't use the correction system. Suggestion: Only count completed tasks that explicitly succeeded on the first attempt: for task in completed_tasks:
correction_attempts = self.db.get_correction_attempts_by_task(task.id)
if len(correction_attempts) == 0:
# Only count if task is COMPLETED (already filtered)
first_attempt_success_count += 12. Missing Error Handling in API (routers/agents.py:195-212)The metrics JSON parsing happens after for agent in agents:
metrics_json = agent.get("metrics")
if metrics_json:
try:
agent["metrics"] = (
json.loads(metrics_json)
if isinstance(metrics_json, str)
else metrics_json
)
except (json.JSONDecodeError, TypeError):
agent["metrics"] = NoneIssue: If an agent has invalid JSON, it silently sets metrics to None. This could hide data corruption issues. Suggestion: Log a warning when JSON parsing fails: except (json.JSONDecodeError, TypeError) as e:
logger.warning(f"Failed to parse metrics for agent {agent.get('id')}: {e}")
agent["metrics"] = None3. Race Condition Risk (worker_agent.py:702)The maturity assessment reads old maturity, calculates new maturity, then updates: old_maturity = self._get_current_maturity() # Read
tasks = self.db.get_tasks_by_agent(self.agent_id)
# ... calculations ...
self._update_agent_maturity(new_maturity, metrics) # WriteIssue: If two processes assess maturity concurrently, they could overwrite each other's updates. Suggestion: Add optimistic locking or timestamp-based conflict detection. Alternatively, document that maturity assessment should be single-threaded per agent.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
codeframe/agents/worker_agent.py (1)
663-823: Well-structured maturity assessment implementation.The weighted scoring formula and maturity level mapping align with the PR objectives (D1-D4 levels). The self-correction rate logic correctly counts only tasks with no correction attempts as first-attempt successes, per the commit fix.
Minor performance consideration: Lines 737-747 and 756-760 execute separate DB queries for each completed task (N+1 pattern). This is acceptable for small task counts, but if agents accumulate many tasks over time, consider batch-fetching test results and correction attempts in single queries.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
codeframe/agents/worker_agent.pycodeframe/persistence/schema_manager.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Python 3.11+ with type hints and async/await for backend development
Files:
codeframe/persistence/schema_manager.pycodeframe/agents/worker_agent.py
codeframe/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
codeframe/**/*.py: Use FastAPI with AsyncAnthropic for backend API development
Use SQLite with aiosqlite for async database operations
Use tiktoken for token counting in the backend
Use ruff for Python code linting and formatting
Use tiered memory system (HOT/WARM/COLD) for context management to achieve 30-50% token reduction
Implement session lifecycle management with file-based storage in .codeframe/session_state.json for CLI auto-save/restore
Files:
codeframe/persistence/schema_manager.pycodeframe/agents/worker_agent.py
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{codeframe/**/*.py,web-ui/src/**/*.{ts,tsx}}: Use WebSockets for real-time updates between frontend and backend
Use last-write-wins strategy with backend timestamps for timestamp conflict resolution in multi-agent scenarios
Files:
codeframe/persistence/schema_manager.pycodeframe/agents/worker_agent.py
codeframe/persistence/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Pre-production application: use flattened v1.0 database schema with direct table creation (no migration system)
Files:
codeframe/persistence/schema_manager.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
📚 Learning: 2025-11-25T19:08:37.203Z
Learnt from: CR
Repo: frankbria/codeframe PR: 0
File: docs/CLAUDE.md:0-0
Timestamp: 2025-11-25T19:08:37.203Z
Learning: Implement Lead Agent for orchestration and Worker Agents for specialization (Backend, Frontend, Test, Review) with maturity levels D1-D4
Applied to files:
codeframe/agents/worker_agent.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Frontend Unit Tests
- GitHub Check: Backend Unit Tests
- GitHub Check: claude-review
- GitHub Check: E2E Smoke Tests (Chromium)
🔇 Additional comments (5)
codeframe/persistence/schema_manager.py (1)
504-507: LGTM! Well-designed index for agent maturity queries.The composite index on
(assigned_to, project_id, created_at)correctly supports theget_tasks_by_agentquery pattern. The column ordering is optimal for filtering by agent first, then optionally by project, with efficient ordering by creation time.codeframe/agents/worker_agent.py (4)
23-25: LGTM!Module-level import of
TaskStatusis a clean refactor that eliminates inline imports and follows Python best practices.
873-909: LGTM!Audit logging is well-implemented with appropriate metadata and graceful error handling that won't block the maturity assessment flow.
911-975: LGTM! Well-designed assessment trigger logic.The three conditions (24-hour threshold, minimum completed tasks since last assessment, and first-time assessment) provide sensible controls for when to re-evaluate maturity. The timezone handling at line 955 correctly handles both
Zand+00:00ISO format suffixes.Note: The inline
import jsonat line 929 should be addressed along with the same issue in_update_agent_maturity().
848-871: Movejsonimport to module level.The inline
import jsonat line 857 should be moved to the module-level imports, consistent with best practices. This import is also used inshould_assess_maturity()at line 929.The enum serialization concern is not applicable—
update_agent()already handlesAgentMaturityenum-to-string conversion automatically (seecodeframe/persistence/repositories/agent_repository.pylines 110-114).+import json from codeframe.core.models import ( Task, TaskStatus, AgentMaturity, ContextItemType, ContextTier, CallType )def _update_agent_maturity( self, maturity: AgentMaturity, metrics: Dict[str, Any] ) -> None: - import json - if not self.db: return try: self.db.update_agent( self.agent_id, { "maturity_level": maturity, "metrics": json.dumps(metrics), }, )Likely an incorrect or invalid review comment.
Add comprehensive maturity assessment for worker agents based on task performance history. The system calculates a weighted score from completion rate (40%), test pass rate (30%), and self-correction rate (30%), mapping to four maturity levels (D1-D4).
Key changes:
Maturity levels:
This closes issue #112
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.