Skip to content

Rewrite factory evals with meaningful metrics#10

Merged
akashgit merged 1 commit into
mainfrom
experiment/2
Apr 11, 2026
Merged

Rewrite factory evals with meaningful metrics#10
akashgit merged 1 commit into
mainfrom
experiment/2

Conversation

@akashgit
Copy link
Copy Markdown
Owner

Summary

  • Replace 4 binary exit-code eval dimensions with 6 parsed, metric-aware dimensions
  • tests (0.30): parses pass/fail counts from pytest output via regex
  • lint (0.10): binary pass + partial credit parsing "Found X error" on failure
  • type_check (0.10): binary pass + partial credit parsing mypy error count
  • coverage (0.25): parses actual percentage from TOTAL line, fails if < 80%
  • guard_patterns (0.15): tests _glob_match against 4 real scope patterns from factory.md
  • config_parser (0.10): verifies ExperimentStore.reparse_config() extracts goal, scope, eval_command, eval_threshold correctly

Coverage now correctly reports 75% and scores 0.75 (previously scored 1.0 with binary check).

Factory experiment #2. Closes #5.

Test plan

  • uv run python eval/score.py produces valid JSON with all 6 dimensions
  • Coverage dimension parses actual percentage (75%) and fails threshold (< 80%)
  • Guard patterns dimension tests _glob_match with real patterns (4/4 pass)
  • Config parser dimension verifies reparse_config correctness (4/4 fields)

🤖 Generated with Claude Code

Replace binary exit-code checks with 6 parsed dimensions: tests
(pass/fail ratio), lint (error count), type_check (error count),
coverage (actual percentage with 80% threshold), guard_patterns
(_glob_match correctness), and config_parser (reparse_config
field validation). Closes #5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@akashgit akashgit merged commit 8089d37 into main Apr 11, 2026
@akashgit akashgit deleted the experiment/2 branch April 11, 2026 22:35
akashgit added a commit that referenced this pull request Apr 24, 2026
Replace binary exit-code checks with 6 parsed dimensions: tests
(pass/fail ratio), lint (error count), type_check (error count),
coverage (actual percentage with 80% threshold), guard_patterns
(_glob_match correctness), and config_parser (reparse_config
field validation). Closes #5.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
akashgit added a commit that referenced this pull request Apr 25, 2026
Replace binary exit-code checks with 6 parsed dimensions: tests
(pass/fail ratio), lint (error count), type_check (error count),
coverage (actual percentage with 80% threshold), guard_patterns
(_glob_match correctness), and config_parser (reparse_config
field validation). Closes #5.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@RohanAwhad RohanAwhad mentioned this pull request May 7, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rewrite factory's own evals with meaningful metrics

1 participant