A focused evaluation harness built to expose the real failure modes of LLM code reasoning. This isn’t a pass/fail scoreboard; it’s a diagnostic layer for models that are pretending to understand requirements.
Benchmarks like HumanEval, MBPP, and SWE-Bench measure surface accuracy. xFail is designed to classify failure behavior and tie it to concrete model breakdowns.
Target audience:
- Human Data teams refining training sets
- Evaluation engineers exploring model weaknesses
- Researchers building more resilient code assistance
pip install -e .
cp .env.example .env
# Add API keys to .envxfail run --task-set pocRun a specific model:
xfail run --models grok --task-set pocRun a single task:
xfail run --task xfail/tasks/poc/deceptive_001.yamlxfail report --run-id <TIMESTAMP> --format htmlMarkdown output:
xfail report --run-id <TIMESTAMP> --format markdownBoth:
xfail report --run-id <TIMESTAMP> --format bothxfail diff --model-a grok --model-b gemini --run <TIMESTAMP>Every result is tagged with one or more failure codes.
| Code | Name | Description |
|---|---|---|
SPEC-MIS |
Specification misread | Model implements a plausible but incorrect interpretation |
INV-LOSS |
Invariant loss | Model violates interacting constraints or invariants |
EDGE-OBO |
Edge case / off-by-one | Model handles the happy path but fails boundaries |
HALL-CON |
Hallucinated constraint | Model invents requirements that are not present |
ABS-FAIL |
Poor abstraction | Model overfits a specific case instead of generalizing |
BIZ-FRAME |
Business framing gap | Model fails when intent is expressed as an outcome rather than a spec |
CTX-DROP |
Context drop | Model loses constraints during multi-turn interactions |
xfail/
├── models/ # API clients for Grok and Gemini
├── harness/ # task execution, scoring, and classification
├── reports/ # report generation logic
├── tasks/ # task definitions
│ └── poc/ # proof-of-concept tasks
├── cli.py # command-line entrypoint
└── __init__.py
results/ # generated execution logs
reports/ # generated report artifacts
Tasks are defined in YAML.
task_id: my_task
category: deceptive # one of: swe, deceptive, sysdesign, algo, multiturn
description: Brief description
difficulty: medium # easy, medium, hard
prompt: |
The task description goes here.
Can be multi-line.
reference_solution: |
def solution():
pass
test_cases:
- input: "[1, 2, 3]"
expected_output: "[3, 2, 1]"
name: "basic_test"
scoring:
auto_tests: 60
contradiction_flag: 25
reasoning_quality: 15Use the adversarial generator to create tricky variants from a base task.
from xfail.harness.adversary import AdversaryGenerator
from xfail.harness.task import load_task
generator = AdversaryGenerator()
base_task = load_task("tasks/poc/deceptive_001.yaml")
variant = generator.generate_deceptive_variant(base_task, variant_num=1)Reports include:
- Executive summary with core findings
- Model comparison with pass rates and failure codes
- Failure mode analysis showing code frequency per model
- Task-level details with classifier output
- Taxonomy definitions for all failure categories
Supported output formats: HTML and Markdown.
pip install -e .# Tests currently use real API calls.
# Mocking is not yet implemented.Required environment variables:
XAI_API_KEY— xAI Grok API keyGEMINI_API_KEY— Google Gemini API key
Copy .env.example to .env and fill in the keys.
Failure mode classification is performed by an LLM analyzing prompt, output, and expected behavior, then assigning taxonomy codes. Low-confidence predictions should be validated manually.
The runner executes submitted code against test cases using exec(), captures output, errors, and metadata, and records the execution trace. There is no sandboxing yet; use only trusted inputs in production.
Each turn includes full conversation state and prior outputs. This makes it possible to test whether the model retains constraints across interactions.
- No code sandboxing yet
- Execution is sequential, not asynchronous
- Currently focused on Python tasks
- Future work: language-specific runners for additional languages
- Future work: richer scoring functions per task
- Future work: finer-grained prompt control per model
Follow the repo’s code style and commit habits:
- small, focused commits
- readable code
- minimal comments
- no generated documentation
This project is licensed under the MIT License. See LICENSE for details.
