Skip to content

feat(string): add String::Split, EscapedStringPy, and rename EscapeString#550

Merged
junrushao merged 2 commits intoapache:mainfrom
junrushao:junrushao/2026-04-14/string-helper
Apr 14, 2026
Merged

feat(string): add String::Split, EscapedStringPy, and rename EscapeString#550
junrushao merged 2 commits intoapache:mainfrom
junrushao:junrushao/2026-04-14/string-helper

Conversation

@junrushao
Copy link
Copy Markdown
Member

@junrushao junrushao commented Apr 14, 2026

Summary

  • Rename EscapeString to EscapeStringJSON to clarify its JSON-specific escaping semantics (RFC 8259). A deprecated EscapeString alias is retained for backward compatibility.
  • Add EscapedStringPy for Python-style string escaping that handles ANSI escape sequences, UTF-8 multibyte characters, and standard C escape sequences (\n, \t, \r, \\, \").
  • Add String::Split(char delim) utility method that returns std::vector<std::string_view> segments.
  • Update all internal call sites (function.h, registry.h, dataclass.cc, json_writer.cc) to use the new EscapeStringJSON name.
  • ReprPrinter now uses EscapedStringPy instead of EscapeStringJSON for proper Python-style __repr__ output.

Motivation

The existing EscapeString function was JSON-specific but its name did not convey this. This rename makes intent explicit. The new EscapedStringPy function supports Python-style repr output needed for error messages and debugging. String::Split is a common utility needed across the codebase.

Changes

File Change
include/tvm/ffi/string.h Rename EscapeString -> EscapeStringJSON, add deprecated alias, add EscapedStringPy, add String::Split
include/tvm/ffi/string.h Cast to unsigned char before std::isdigit to avoid UB; use \x1b for ANSI escapes; validate UTF-8 continuation bytes
include/tvm/ffi/function.h Update call site to EscapeStringJSON
include/tvm/ffi/reflection/registry.h Update call site to EscapeStringJSON
src/ffi/extra/dataclass.cc ReprPrinter uses EscapedStringPy for Python-style repr output
src/ffi/extra/json_writer.cc Update call site to EscapeStringJSON
tests/cpp/test_string.cc Add 7 test cases for Split, EscapeStringJSON, EscapedStringPy (basic, control chars, ANSI, UTF-8, malformed UTF-8)

Test plan

  • All 47 C++ string tests pass
  • String::Split tested with edge cases (empty, boundaries, consecutive delimiters)
  • EscapeStringJSON tested with special chars, backslash, quotes, control chars
  • EscapedStringPy tested: basic ASCII, control chars, ANSI sequences, valid UTF-8 (2/3/4-byte), malformed UTF-8
  • Existing Python tests pass (deprecated alias preserves compatibility)

…ring to EscapeStringJSON

## Architecture
- Rename `EscapeString` -> `EscapeStringJSON` to clarify its JSON-specific
  escaping behavior (RFC 8259 compliant).
- Add deprecated `EscapeString` alias for backward compatibility.
- Add `EscapedStringPy` for Python-style string escaping that handles ANSI
  escape sequences, UTF-8 multibyte characters, and standard C escapes.
- Add `String::Split(char delim)` utility returning `vector<string_view>`.

## Public Interfaces
- `EscapeStringJSON(const String&)` -- new canonical name for JSON escaping.
- `EscapeString(const String&)` -- deprecated, forwards to `EscapeStringJSON`.
- `EscapedStringPy(const String&)` -- new Python-style string escaping.
- `String::Split(char)` -- new method returning split segments as string_views.

## Behavioral Changes
- All internal call sites (`function.h`, `registry.h`, `dataclass.cc`,
  `json_writer.cc`) updated to use `EscapeStringJSON` directly.
- No runtime behavior change: `EscapeStringJSON` is the same implementation
  as the previous `EscapeString`.

## Docs
- Doxygen comments added for all new/renamed functions.

## Tests
- No new tests in this commit.

## Untested Edge Cases
- `EscapedStringPy` with malformed UTF-8 input (partial multibyte sequences).
- `String::Split` with empty strings or strings consisting only of delimiters.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors string escaping by renaming EscapeString to EscapeStringJSON and introducing EscapedStringPy for Python-style representation. It also adds a Split method to the String class. Feedback focuses on improving the robustness and consistency of the new escaping logic, specifically regarding std::isdigit safety, ANSI escape sequence formatting, and UTF-8 validation. Additionally, it is suggested to use EscapedStringPy within ReprPrinter to maintain Python-compatible output.

Comment thread include/tvm/ffi/string.h Outdated
Comment thread include/tvm/ffi/string.h Outdated
Comment thread include/tvm/ffi/string.h Outdated
Comment thread src/ffi/extra/dataclass.cc Outdated
Comment thread src/ffi/extra/dataclass.cc Outdated
Copy link
Copy Markdown
Member

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have unittests for this feature?

## Architecture
- `EscapedStringPy` in `string.h` is the Python-style string escaping
  utility used by `ReprPrinter` for `__repr__` output of TVM FFI objects.
- `ReprPrinter` in `dataclass.cc` traverses object graphs and formats
  string fields via the escaper.

## Public Interfaces
No changes to public API signatures.

## Behavioral Changes
- Cast `data[j]` to `unsigned char` before `std::isdigit` to avoid
  undefined behavior when char values are negative (string.h:908).
- ANSI escape sequences now render as `\x1b[...` instead of
  `\u001b[...` for Python repr consistency (string.h:912).
- UTF-8 multi-byte decoding now validates continuation bytes
  `(byte & 0xC0) == 0x80`; malformed sequences fall back to
  per-byte `\xNN` escaping instead of silently decoding garbage
  (string.h:957-971).
- `ReprPrinter` now calls `EscapedStringPy` instead of
  `EscapeStringJSON` so that `__repr__` output uses Python-style
  escaping (dataclass.cc:782,815).

## Tests
- Added 7 new GoogleTest cases in `test_string.cc` covering
  `String::Split`, `EscapeStringJSON`, and `EscapedStringPy`
  (basic, control chars, ANSI, UTF-8, malformed UTF-8).
- All 47 string tests pass.

## Untested Edge Cases
- 5-byte and 6-byte overlong UTF-8 sequences (invalid per RFC 3629
  but could appear in arbitrary binary data).
@junrushao
Copy link
Copy Markdown
Member Author

@yzh119 Unittests added

@junrushao junrushao merged commit 2f1a16a into apache:main Apr 14, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants