Skip to content

fix(command-parser)!: drop the char-level tokenizer in favor of an AST-only path#330

Merged
fohte merged 6 commits into
mainfrom
fohte/refactor-tree-sitter-bash
May 4, 2026
Merged

fix(command-parser)!: drop the char-level tokenizer in favor of an AST-only path#330
fohte merged 6 commits into
mainfrom
fohte/refactor-tree-sitter-bash

Conversation

@fohte
Copy link
Copy Markdown
Owner

@fohte fohte commented May 4, 2026

Purpose

  • Commit commands that contain a HEREDOC inside a command substitution (the kind Claude Code's /commit skill produces) were rejected with command parse error: unclosed quote

Reproduction steps

Piping the following command into runok check --input-format claude-code-hook is rejected with command parse error: unclosed quote, even though the existing git [-C *] commit -m * rule should allow it.

git add path && git commit -m "$(cat <<'EOF'
subject

body line 1 with 'apostrophes' inside
EOF
)"

Approach

  • Drop the char-level fallback tokenizer and route all bash parsing through tree-sitter-bash
  • Resolve quoting and escapes per AST node so a HEREDOC body never re-enters a tokenizer
  • Treat HEREDOCs whose delimiter is quoted (<<'EOF', <<"EOF", <<\EOF) as literal — skip the $(cmd) extraction inside their bodies, matching bash semantics
Design decisions

Fallback for inputs tree-sitter cannot parse

Decision Design Pros Cons
Chosen Fall back to shlex::split only when tree-sitter returns has_error Real-world inputs that bash itself rejects (e.g. graphql query bodies as flag values) keep working; HEREDOCs stay on the AST path Need to distinguish has_error from "not a single command" so pipelines and && lists are not silently flattened by the fallback
Rejected Drop the fallback entirely and return SyntaxError Simplest possible spec Breaks practical inputs already covered by tests, e.g. gh api graphql -f query=mutation{...}
Rejected Keep the char-level tokenizer and just add HEREDOC detection on top Preserves existing behaviour Leaves the dual implementation in place, so the original bug (tokenizer and AST disagreeing) reappears for other inputs

Breaking changes

  • runok no longer extracts $(cmd) from inside <<'EOF' / <<"EOF" / <<\EOF HEREDOC bodies. Bash treats these bodies as literal, so any rule that relied on matching a substitution inside one needs to be rewritten to target the command outside the HEREDOC.
  • Library consumers only:
    • CommandParseError::UnclosedQuote is removed and folded into SyntaxError.
    • A bare assignment (FOO=bar alone) and a trailing backslash (echo \) now return SyntaxError.

…REDOC + command substitution parsing

intent(command-parser): make `git commit -m "$(cat <<'EOF' ... EOF)"` evaluate
  correctly — it was rejected with `unclosed quote` because the char-level
  fallback tokenizer scanned the literal HEREDOC body as live shell, hit a
  stray quote in the prose, and bailed
decision(command-parser): replace the char-level `tokenize` with an AST-only
  `tokenize_command` and resolve quoting per tree-sitter node (`string`,
  `raw_string`, `concatenation`, `command_name` recurses into its named
  child) so a HEREDOC body never re-enters the shell scanner; fall back to
  `shlex::split` only when tree-sitter cannot parse the input at all
rejected(command-parser): patching the existing fallback tokenizer to
  understand HEREDOCs — it would re-implement bash's quoting rules a third
  time on top of the AST walk it already does, and the two paths drifting
  is what caused this bug in the first place
decision(command-parser): skip recursing into HEREDOC bodies whose delimiter
  is quoted (`<<'EOF'`, `<<"EOF"`, `<<\EOF`); bash treats those bodies as
  literal so a `$(cmd)` inside them is prose, not a real command — extracting
  it caused false ask/deny on commit messages
constraint(command-parser): tree-sitter-bash refuses some real-world inputs
  like `gh api graphql -f query=mutation{...}` whose flag value contains
  unbalanced shell metacharacters; AstTokenizeOutcome distinguishes "parse
  error" (use shlex) from "not a single command" (caller should have split
  it first) so pipelines never silently flatten through the fallback
learned(command-parser): tree-sitter-bash exposes `string_content` separately
  from interpolation children, so escape resolution can run on
  `string_content` only and `$(...)` / `$X` stay as source text — wrapper
  patterns like `bash -c <cmd>` still see the substitution intact
fohte added 2 commits May 4, 2026 11:18
…ated section

intent(release-notes): make the release notes scannable for the common
  reader (CLI / runok.yml authors) by surfacing the breaking change that
  affects them first and pushing the library-API-only ones to the bottom
intent(release-notes): drop the `</content></invoke>` artefact left by a
  bad copy-paste so the page renders without trailing literal tags
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 88.53211% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.04%. Comparing base (4d0467b) to head (223d798).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/rules/command_parser.rs 88.26% 25 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #330      +/-   ##
==========================================
- Coverage   89.12%   89.04%   -0.08%     
==========================================
  Files          53       53              
  Lines       11771    11871     +100     
==========================================
+ Hits        10491    10571      +80     
- Misses       1280     1300      +20     
Flag Coverage Δ
Linux 88.85% <88.53%> (-0.09%) ⬇️
macOS 90.26% <88.53%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the command parser to a tree-sitter-bash AST-based implementation, improving tokenization accuracy and shell-quoting resolution. It introduces breaking changes by removing the CommandParseError::UnclosedQuote variant and changing how bare assignments and trailing backslashes are handled, while also fixing bugs related to quoted command names and HEREDOC processing. Review feedback recommends enhancing the HEREDOC delimiter check to detect quotes at any position within the string and simplifying the node retrieval logic for command names.

Comment thread src/rules/command_parser.rs Outdated
Comment thread src/rules/command_parser.rs Outdated
fohte added 3 commits May 4, 2026 11:31
…t of the delimiter is quoted

intent(command-parser): match bash semantics — `<<EO'F'`, `<<E\OF`, and
  similar partially-quoted delimiters disable body expansion the same
  way `<<'EOF'` does, so runok must not extract `$(cmd)` from inside
  them
learned(command-parser): tree-sitter-bash sets has_error on `<<EO'F'`
  but parses `<<E\OF` cleanly (heredoc_start = `E\OF`); previously the
  quote check only looked at the first byte, so the `\OF` case slipped
  through and runok extracted the body's `$(cmd)` as a real command
…ed_child

intent(command-parser): use the tree-sitter helper that already exists
  for "first named child" instead of hand-rolling a `child_count` loop —
  same behaviour, easier to read
…itter-bash

# Conflicts:
#	docs/src/content/docs/releases/next.md
@fohte fohte merged commit 5dc8483 into main May 4, 2026
10 checks passed
@fohte fohte deleted the fohte/refactor-tree-sitter-bash branch May 4, 2026 03:49
@fohte-bot fohte-bot Bot mentioned this pull request May 4, 2026
fohte added a commit that referenced this pull request May 4, 2026
intent(releases): the regression originated in #330 (AST-only
  tokenizer) which is itself unreleased, so users will never see
  the broken behaviour. A "fixed" entry in next.md would describe
  a state that does not exist from their perspective.
fohte added a commit that referenced this pull request May 5, 2026
intent(releases): avoid documenting a fix for a regression whose source PR has not shipped yet — #330 itself is still in next.md, so listing the regression separately would describe a bug that no released version ever exhibited
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant