Skip to content

test(docs): add code example extraction and syntax validation#1250

Merged
kovtcharov-amd merged 3 commits into
mainfrom
feat/doc-code-validation-244
May 29, 2026
Merged

test(docs): add code example extraction and syntax validation#1250
kovtcharov-amd merged 3 commits into
mainfrom
feat/doc-code-validation-244

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

@kovtcharov-amd kovtcharov-amd commented May 29, 2026

Doc code examples had no automated syntax checking — broken Python snippets could ship to users without anyone noticing until they copy-pasted from the docs. Now util/check_doc_code.py extracts fenced code blocks from all MDX/MD files and runs compile() on Python blocks, reporting file and line number for each syntax error. Handles real-world MDX patterns: indented blocks inside <Tabs>/<Steps>/<Accordion>, top-level await (async doc examples), return/yield (method body excerpts), and ... placeholders. Pseudo-code in design docs (docs/spec/, docs/plans/) and non-runnable patterns (function signatures, flow diagrams, mixed-language blocks) are detected and skipped. CI runs it automatically on doc changes alongside the existing link checker.

Test plan

  • python -m pytest tests/unit/test_doc_code_validation.py -xvs — 47 tests pass
  • python util/check_doc_code.py --lang python — 464 blocks validated, 0 errors, exits 0
  • python util/check_doc_code.py (no filter) — shows correct OK/skipped counts even without --verbose

Closes #244

Doc code examples had no automated syntax checking — broken Python snippets
could ship to users without anyone noticing until they copy-pasted. Now
`util/check_doc_code.py` extracts fenced code blocks from all MDX/MD docs
and runs `compile()` on Python blocks, reporting file/line for each error.

Handles MDX patterns: indented blocks inside <Tabs>/<Steps>/<Accordion>,
top-level `await` (async doc examples), `return`/`yield` (method body
excerpts), and `...` placeholders.

Closes #244
@github-actions github-actions Bot added devops DevOps/infrastructure changes tests Test changes labels May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Good infrastructure addition — compile()-based Python syntax checking is lightweight and the MDX normalization handles the real patterns (indented component nesting, ... placeholders, top-level await). Three issues need addressing before this is safe to merge: the CI job will fail immediately on the existing docs, the statistics output is wrong without --verbose, and one error handler silently eats failures.


Issues

🟡 CI job will fail on every doc-touching PR after merge

The new check-code-examples step runs python util/check_doc_code.py --lang python which exits 1 because 104 existing blocks in docs/spec/ and docs/plans/ are pseudo-code with intentional syntax violations. Merging this PR as-is means every subsequent PR that touches docs/** will have a failing required check until all 104 are resolved.

Three viable fixes — pick one:

Option A — Exclude spec/plans directories (scope the check to user-facing docs only):

          echo "Validating Python code examples in docs..."
          python util/check_doc_code.py --lang python --exclude docs/spec docs/plans

(Requires adding --exclude support to main() and find_doc_files().)

Option B — Mark the step continue-on-error with a documented reason (CLAUDE.md explicitly permits this for non-fatal CI steps):

      - name: Check documentation code examples
        continue-on-error: true  # 104 pseudo-code blocks in spec/plans fail; tracked in #244
        run: |
          echo "Validating Python code examples in docs..."
          python util/check_doc_code.py --lang python

Option C — Fix the 104 existing issues before or alongside this PR (preferred long-term, but likely a separate PR given the scope).


🟡 "Code blocks checked" count is wrong without --verbose (util/check_doc_code.py:861)

ok results are only appended to results when verbose=True (check_code_blocks, line 812–814). Without --verbose, ok_count = 0, so the summary prints:

Code blocks checked: 104      ← looks like only 104 blocks exist
  OK:       0                  ← misleads author into thinking nothing passed
  Errors:   104

The actual numbers (confirmed locally) are 1157 blocks checked, 1053 OK, 104 errors. The check counter needs to be maintained independently of the results list:

def check_code_blocks(
    repo_root: str,
    check_imports_flag: bool = False,
    verbose: bool = False,
    lang_filter: Optional[str] = None,
) -> List[CodeResult]:
    """Check all code blocks in documentation files."""
    root = Path(repo_root)
    results: List[CodeResult] = []
    _ok_total = 0  # track totals independent of verbosity

    for filepath in find_doc_files(repo_root):
        for block in extract_code_blocks(filepath, root):
            if lang_filter and block.lang != lang_filter.lower():
                continue

            if block.lang in PYTHON_LANGS:
                err = check_python_syntax(block.source, f"{block.file}:{block.line}")
                if err:
                    results.append(CodeResult(
                        block.file, block.line, block.lang,
                        "error", err, block.title,
                    ))
                else:
                    _ok_total += 1
                    if check_imports_flag:
                        for w in check_imports(block.source):
                            results.append(CodeResult(
                                block.file, block.line, block.lang,
                                "warning", w, block.title,
                            ))
                    if verbose:
                        results.append(CodeResult(
                            block.file, block.line, block.lang,
                            "ok", "syntax ok", block.title,
                        ))
            ...
    # attach totals so format_results can use them
    # (simplest approach: return a second value, or attach as an attribute)

Simplest fix without changing the return type: carry the verbose-independent ok count through a wrapper or a dataclass. Alternatively, always append ok results and filter them in format_results/the summary counter — that changes memory behaviour for large doc trees but is simpler to reason about.


🟡 Silent exception in extract_code_blocks (util/check_doc_code.py:687)

except Exception: return blocks silently swallows file-read failures — CLAUDE.md explicitly prohibits this pattern. A corrupted or permission-denied file produces no diagnostic.

    try:
        content = filepath.read_text(encoding="utf-8", errors="replace")
    except OSError as e:
        raise RuntimeError(f"Could not read {filepath}: {e}") from e

If soft failure is intentional (e.g. for network-mounted docs dirs), at minimum:

    try:
        content = filepath.read_text(encoding="utf-8", errors="replace")
    except OSError as e:
        print(f"WARNING: skipping {filepath}: {e}", file=sys.stderr)
        return blocks

🟢 format_results verbose parameter is dead code (util/check_doc_code.py:833)

The parameter is accepted but never read inside the function — the filtering already happened at the check_code_blocks level.

def format_results(results: List[CodeResult]) -> str:

(And update the two call sites accordingly.)


🟢 _setup_docs duplicated in TestCheckCodeBlocks and TestMain

Identical helper appears in both test classes. Extract once at module level or into a shared fixture in conftest.py.


Strengths

  • The MDX normalization in _normalize_python_source is well-thought-out: ...pass, top-level await wrapping, and textwrap.dedent for indented component blocks are exactly the patterns that would cause false positives without them.
  • Test coverage is solid — extraction, normalization, syntax checking, integration, formatting, and CLI exit codes all have dedicated test classes. The test_mixed_valid_and_invalid and test_indented_code_in_mdx_component cases cover the tricky real-world patterns.
  • _KNOWN_MODULES allowlist prevents the --check-imports mode from producing false positives in a bare CI environment where GAIA's dependencies aren't installed — good defensive design.

Verdict

Request changes. The CI job will fail immediately after merge (104 existing issues in spec/plans), and the statistics bug means the default output is misleading. Both are quick fixes. Address the CI strategy decision (exclude dirs, continue-on-error, or a companion cleanup PR) and the ok_count counting before merging.

…ntation

The validator was flagging 104 blocks that are function signatures, flow
diagrams, mixed-language blocks, or spec pseudo-code — not runnable Python.
Now it detects and skips: bare signatures without bodies, arrow notation,
TOML/shell mixed blocks, trailing-ellipsis placeholders, and indented
fragments. Design docs (docs/spec/, docs/plans/) are skipped entirely.

Also fixes a real indentation bug in the chat-agent playbook where the
"# Use it" section was at column 0 instead of the code block's indent
level.

Result: 464 Python blocks validated, 0 errors — CI passes.
@github-actions github-actions Bot added the documentation Documentation changes label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Good addition to the CI toolchain — doc-code validation closes a real gap (broken snippets shipping to users undetected). The pseudo-code heuristics cover the tricky MDX patterns well, and the test suite is thorough. Two issues worth fixing before merge.


Issues

🟡 Silent exception swallow in extract_code_blocks (util/check_doc_code.py:770–771)

    except Exception:
        return blocks

This violates CLAUDE.md's "no silent fallbacks" rule. A file that can't be read (permission error, OS-level race after directory scan) is silently skipped — the CI scanner continues and exits 0, giving a false "all clear". Should print to stderr and re-raise or skip with a logged warning:

    except OSError as e:
        print(f"warning: could not read {filepath}: {e}", file=sys.stderr)
        return blocks

The errors="replace" on read_text already handles encoding faults, so OSError is the only realistic subclass left.


🟢 global anti-pattern in autouse fixture (tests/unit/test_doc_code_validation.py:23)

    global check_doc_code
    check_doc_code = importlib.import_module("check_doc_code")

Module-level global in a fixture makes test isolation fragile — if two test files ever share a session, the reference can bleed. The sys.path manipulation + importlib.import_module dance is also unusual when a simpler approach works. Since util/ only needs to be on the path once, this is cleaner at module scope:

import importlib
import sys
from pathlib import Path

_util_dir = str(Path(__file__).resolve().parents[2] / "util")
if _util_dir not in sys.path:
    sys.path.insert(0, _util_dir)

check_doc_code = importlib.import_module("check_doc_code")

Then remove the autouse=True fixture entirely — it's no longer needed.


🟢 Duplicated _setup_docs helper (tests/unit/test_doc_code_validation.py:411–419 and 633–640)

TestCheckCodeBlocks._setup_docs and TestMain._setup_docs are identical. Extract to a module-level fixture:

@pytest.fixture()
def setup_docs(tmp_path):
    def _inner(files: dict) -> str:
        docs_dir = tmp_path / "docs"
        docs_dir.mkdir()
        for name, content in files.items():
            p = docs_dir / name
            p.parent.mkdir(parents=True, exist_ok=True)
            p.write_text(textwrap.dedent(content), encoding="utf-8")
        return str(tmp_path)
    return _inner

Affected: all test methods in TestCheckCodeBlocks and TestMain that call self._setup_docs(tmp_path, ...).


🟢 warnings variable name in check_imports (util/check_doc_code.py:969)

warnings: List[str] = [] shadows the stdlib warnings module name. Not a runtime error here (it isn't imported), but it confuses readers and linters.

    issues: List[str] = []

Update all references in the function accordingly.


🟢 Test plan description vs. actual tool behavior

The PR description says --lang python "surfaces ~104 genuine syntax issues in specs/plans (pseudo-code), exits non-zero". But PSEUDO_CODE_DIRS skips docs/spec and docs/plans entirely — those 104 issues would come from other directories (guides, SDK docs), or would exit 0 if all non-spec pseudo-code is caught by _is_pseudo_code(). The test plan checkbox wording creates confusion about what "verified" means. Worth updating the PR description to clarify, even though it doesn't affect the code.


Strengths

  • Heuristic design is solid. The layered _is_pseudo_code checks — arrows, TOML headers, signature stubs, all-indented fragments, mixed-shell blocks — cover the real MDX patterns without false-positiving on normal code. The priority order (pseudo-code check → normalize → compile) is the right pipeline.
  • Self-contained stdlib-only tool. No pip install step needed in CI, and the existing actions/checkout@v6 + actions/setup-python@v6 versions used throughout the repo are correctly reused.
  • Playbook fix is a genuine improvement. Replacing the triple-quoted f-string (indentation-sensitive in MDX) with explicit string concatenation removes a real fragility, and the "Use it" block indentation fix corrects broken indentation the validator itself would have caught.
  • Test coverage. 46 tests cover extraction, normalization, pseudo-code detection, integration, output formatting, and the CLI exit codes. Edge cases like top-level await, return outside a function, and ... placeholders all have dedicated tests.

Verdict

Approve with suggestions. The silent exception swallow is the only issue worth blocking on; everything else is style/maintainability. Fix the OSError handler before merge; the global fixture and duplicated helper can follow in a cleanup commit.

- Replace silent `except Exception` with `OSError` + stderr warning
  (CLAUDE.md prohibits swallowing errors)
- Fix statistics bug: ok_count and skipped_count now tracked independently
  of --verbose, so default output shows correct totals (was showing 0/0)
- Return CheckSummary NamedTuple instead of bare list to carry counts
- Remove dead `verbose` param from format_results()
- Rename `warnings` variable to `issues` in check_imports() to avoid
  shadowing the stdlib module
- Replace `global` anti-pattern in test fixture with module-level import
- Extract duplicated `_setup_docs` helper into shared pytest fixture
- Add test_ok_count_without_verbose to verify the statistics fix
@github-actions
Copy link
Copy Markdown
Contributor

This PR closes a real gap: broken Python snippets in MDX docs could ship to users with no automated catch. The implementation is solid across the board — clean data modelling, correct handling of real-world MDX quirks, and a thorough test suite. Two minor nits worth fixing before this becomes the pattern other utilities follow.


Issues

🟢 Minor — _is_pseudo_code angle-bracket check produces false negatives (util/check_doc_code.py:817)

if re.search(r"<\w+[^>]*>", stripped) and "f'" not in stripped and 'f"' not in stripped:
    return True

Any code block containing a non-f-string literal with an XML/HTML-like pattern — e.g. text.replace("<thinking>", "") or response = "<answer>" — will be silently classified as pseudo-code and skipped. This is a false-negative (valid syntax goes unchecked), not a false-positive (CI won't break). It's an acceptable pragmatic trade-off for now, but worth a comment so future readers know why the check is there and what its limits are:

    # Heuristic: XML/HTML-like tags (<Tag>, <response>) signal template pseudo-code.
    # False-negative: string literals like text.replace("<tag>", "") are also skipped.
    if re.search(r"<\w+[^>]*>", stripped) and "f'" not in stripped and 'f"' not in stripped:
        return True

🟢 Minor — Windows backslash paths in PSEUDO_CODE_DIRS are dead code in CI (util/check_doc_code.py:717)

PSEUDO_CODE_DIRS = {"docs/spec", "docs/plans", "docs\\spec", "docs\\plans"}

CI always runs on Ubuntu where pathlib produces /-separated relative paths, so docs\\spec and docs\\plans never match. If the intent is local Windows support, add a comment; if not, drop the dead entries.

# Use both separators so the check works on Windows local dev too
PSEUDO_CODE_DIRS = {"docs/spec", "docs/plans", "docs\\spec", "docs\\plans"}

Or, normalize at the point of comparison so the set stays clean:

PSEUDO_CODE_DIRS = {"docs/spec", "docs/plans"}
...
# in check_code_blocks:
normalized_file = block.file.replace("\\", "/")
if any(normalized_file.startswith(d) for d in PSEUDO_CODE_DIRS):

Strengths

  • Test suite is genuinely thorough. The 47 tests exercise the edge cases that matter in practice — top-level await, method-body excerpts with return/yield, MDX component indentation, mixed-language blocks, empty blocks, ... placeholders. These are exactly the cases where naive compile() would produce spurious failures.
  • Pseudo-code heuristics avoid the CI-breaks-on-docs trap. _is_pseudo_code + _is_signature_only_block handle the common doc patterns (function signatures, flow diagrams, TOML-mixed blocks) that look like Python but aren't runnable. Getting this right is the hardest part of this kind of tool, and the coverage here is solid.
  • Clean architecture. NamedTuple data classes (CodeBlock, CodeResult, CheckSummary) make each stage composable and independently testable. The playbook fix (aligning the "Use it" block to 4-space indent so textwrap.dedent() can normalise the whole snippet) is correct and necessary.

Verdict

Approve. No blocking issues. The two nits above are cosmetic/documentation-level — they don't affect correctness or CI behaviour.

@kovtcharov-amd kovtcharov-amd self-assigned this May 29, 2026
@kovtcharov-amd kovtcharov-amd added the p2 low priority label May 29, 2026
Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three blockers from the first review are resolved in head — extract_code_blocks now catches OSError and warns to stderr (no silent swallow), check_code_blocks returns a verbose-independent ok_count so the default summary is accurate, and the spec/plans pseudo-code blocks are skipped so CI won't fail the repo on merge. Approving. Minor, non-blocking: the description's "~104 genuine syntax issues" line is stale relative to the current skip behavior — worth a one-line edit so the test-plan wording matches what the tool now does.


Generated by Claude Code

@kovtcharov-amd kovtcharov-amd added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit 6d4711d May 29, 2026
30 checks passed
@kovtcharov-amd kovtcharov-amd deleted the feat/doc-code-validation-244 branch May 29, 2026 16:29
@kovtcharov-amd kovtcharov-amd mentioned this pull request Jun 1, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops DevOps/infrastructure changes documentation Documentation changes p2 low priority tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Testing] Documentation Code Validation

2 participants