Skip to content

Bugfix: Ensure valid regex patterns in tool schemas#169

Merged
rhennigan merged 11 commits into
mainfrom
bugfix/ensure-valid-regex-in-tool-schemas
Apr 23, 2026
Merged

Bugfix: Ensure valid regex patterns in tool schemas#169
rhennigan merged 11 commits into
mainfrom
bugfix/ensure-valid-regex-in-tool-schemas

Conversation

@rhennigan
Copy link
Copy Markdown
Member

@rhennigan rhennigan commented Apr 23, 2026

Summary

  • Tool schema "pattern" fields were emitted in ICU/PCRE forms (POSIX character classes, (?flags)pattern, \A/\z/\Z, \x{HEX}, etc.) that JSON Schema / ECMA 262 validators reject — most visibly, every plain-string parameter shipped (?ms).*, which was wrapped as /.*/ms and treated as literal forward slashes.
  • Add a new toJSRegex utility in Kernel/Utilities.wl that pipelines the output of StringPattern`PatternConvert into JS-compatible regex (strips leading (?flags), drops scope-less (?-m-s) modifiers, maps all 12 POSIX classes including nested [[:a:][:b:]], converts \A/\z/\Z, normalizes \x{HEX} to \xNN/\uNNNN, and rewrites unescaped . under dotall), and a toolSchema helper in StartMCPServer.wl that routes pattern fields through it and drops the redundant (?ms).* entirely.
  • Both helpers are hardened with Enclose/Confirm so unexpected inputs surface as internal failures instead of silently producing invalid schemas. Added unit tests in Tests/Utilities.wlt and Tests/StartMCPServer.wlt, and referenced the new module in AGENTS.md and docs/tools.md.

Test plan

  • TestReport on Tests/Utilities.wlt passes (covers toJSRegex across POSIX classes, flag stripping, anchor conversion, hex escapes, dotall rewriting)
  • TestReport on Tests/StartMCPServer.wlt passes (covers toolSchema for plain String, Number/Integer/Boolean, optional params, Help-as-description, enumerations, PUA escaping, and the (?ms).* regression)
  • CodeInspector clean on Kernel/Utilities.wl, Kernel/StartMCPServer.wl, and the test files
  • Smoke-test an MCP client (e.g. a JSON Schema validator) accepts the generated tool schemas end-to-end

🤖 Generated with Claude Code

rhennigan and others added 8 commits April 23, 2026 10:11
…ibility

Wrap schema retrieval in a new toolSchema helper that converts POSIX
character classes and (?flags)pattern syntax into JavaScript-compatible
forms, and ensures all strings are valid UTF-8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite toJSRegex as a pipeline over the patterns StringPattern`PatternConvert
produces: strip leading (?flags), drop scope-less inner (?-m-s) modifiers,
map all 12 POSIX character classes (including the nested [[:a:][:b:]] form),
convert \A/\z/\Z anchors, normalize \x{HEX} escapes to \xNN/\uNNNN, and
rewrite unescaped dots to [\s\S] when dotall was set.

Fixes the previous behavior where the common "(?ms).*" pattern emitted for
every plain-string tool parameter was wrapped in literal-regex delimiters
("/.*/ms"), which JSON Schema validators treat as literal forward slashes
and reject.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move toJSRegex and its helper functions into Kernel/Utilities.wl and
expose it through the shared Common context, since it's a general-purpose
regex utility rather than MCP-server-specific. Tests are relocated to a
new Tests/Utilities.wlt file accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the conversion pipeline in Enclose/Confirm to surface any unexpected
step failures, expand the header comment with the accepted non-goals
(multiline anchors, PCRE-only constructs, missing u-flag), and record a
follow-up to warn on unhandled patterns during server creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `(?ms).*` pattern matches any string, so it adds no validation
constraint. Drop it entirely instead of routing through toJSRegex, which
can't meaningfully convert inline-flag syntax to JavaScript.

Also corrects the comment describing `safeString` — it filters
private-use characters, not UTF-8 validity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap the body in Enclose/throwInternalFailure and ConfirmBy the toJSRegex
result, so a non-string return surfaces as an internal failure rather than
silently producing an invalid schema. Add inline comments explaining why
the `(?ms).*` pattern is safe to drop and what the remaining branch handles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cover the regression from the recent regex-sanitization bugfix plus
behavior for plain String, Number/Integer/Boolean, optional parameters,
Help-as-description, enumerations, and PUA-character escaping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Kernel/Utilities.wl` to the project structure listing in AGENTS.md
and to the Related Files section of docs/tools.md, noting that
`toolSchema` routes tool schema "pattern" fields through `toJSRegex`
for JSON Schema / JavaScript compatibility.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 16:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes invalid JSON Schema "pattern" strings emitted for MCP tool input schemas by adding a conversion pipeline that rewrites ICU/PCRE-style regex output into a JS/ECMA-262-compatible form, and by routing tool schema generation through that sanitizer.

Changes:

  • Added toJSRegex (Common-exported, implemented in Kernel/Utilities.wl) to convert common ICU/PCRE constructs into JS-compatible regex strings.
  • Added toolSchema in Kernel/StartMCPServer.wl and updated tool listing to sanitize "pattern" fields (and drop the redundant (?ms).* match-any pattern).
  • Added unit tests for both utilities and server behavior, plus documentation and spelling dictionary updates.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
TODO.md Added a follow-up TODO about warning on unhandled regex constructs.
Tests/Utilities.wlt New test coverage for toJSRegex conversions (POSIX classes, anchors, escapes, dotall rewriting).
Tests/StartMCPServer.wlt New regression/behavior tests for toolSchema output patterns and schema structure.
Kernel/Utilities.wl Implements toJSRegex and helper conversion routines.
Kernel/StartMCPServer.wl Uses toolSchema for MCP tool inputSchema, sanitizing "pattern" values.
Kernel/CommonSymbols.wl Exports toJSRegex from the Common context.
docs/tools.md Documents the new sanitizer path (toolSchema) and toJSRegex.
AGENTS.md Adds Utilities module mention including toJSRegex.
.cspell.json Adds regex-related words used in code/tests/docs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Kernel/Utilities.wl
Comment thread Kernel/Utilities.wl Outdated
rhennigan and others added 2 commits April 23, 2026 12:24
Match only the "(?:(?-...)" wrapper form emitted by PatternConvert around
RegularExpression[] contents. The previous pattern would strip any
"(?-m-s)"-style token anywhere in the regex, which could silently change
semantics of user-supplied patterns that intentionally use inline
modifiers mid-pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "\u{NNNNN}" escape (which is only valid under the JS regex "u"
flag) with a UTF-16 surrogate pair of "\uXXXX" escapes. JSON Schema
validators that do not set the "u" flag still match the composed
supplementary character via the surrogate pair, so converted patterns
stay valid without requiring any flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Kernel/Utilities.wl Outdated
convertHexEscape routed zero-padded escapes like "\x{0000A0}" into the
surrogate-pair path because it branched on the hex string's length. The
SupplementaryRange assert in supplementaryToSurrogatePair then failed.
Parse the code point once and branch on its numeric value so any
length/zero-padding form of \x{...} is handled correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rhennigan rhennigan merged commit 74a3fe7 into main Apr 23, 2026
1 of 2 checks passed
@rhennigan rhennigan deleted the bugfix/ensure-valid-regex-in-tool-schemas branch April 23, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants