feat(token): add end_offset field, remove parser source rescanning#403
Conversation
) Move exact token end offsets into the lexer so the parser no longer rescans source text. Each push site in the lexer now records the precise byte offset at which the token ends: - Operator tokens (push-before-advance): `loc.offset + literal_length` - All other tokens (regex, template, escaped identifiers, numbers, strings): `offset` (already past the token at push time) Token struct gains `end_offset : Int` with a constructor that accepts `end_offset?` (default -1 derives `loc.offset + raw.length()`, which is correct for the verbatim-raw case and is only hit by synthetic tokens in tests). Parser.advance now does `self.last_end = tok.end_offset` in one line, replacing ~130 lines of scan helpers (scan_identifier_end, scan_regex_end, scan_template_segment_end, scan_number_end, scan_source_token_end, token_end_offset). Tests: 6 lexer unit tests verify raw-token spans; 2 new Function.prototype.toString integration tests exercise the full advance→last_end→consume_source_text path for TemplateTail and numeric separator literals, the two cases where raw≠source that were not yet covered end-to-end. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 36 minutes and 53 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds an ChangesToken end_offset: lexer-owned source spans
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Benchmark ResultsRun: https://github.com/dowdiness/js_engine/actions/runs/27815620496
Stage summary
Focused bytecode base-vs-head comparison
Base-vs-head comparison
Mean-time chart (log scale)
Closure-conversion comparison
|
…ructor
All 47 tokens.push({ ... }) calls converted to @token.Token() constructor
calls, following the MoonBit convention of using custom constructors over
record literals.
Key simplifications:
- Drops lex_form: LexNormal from 46 sites (it is the default)
- Drops end_offset: loc.offset + N from 33 operator push sites (the
constructor sentinel -1 derives the same value as loc.offset + raw.length())
- 3 hex/binary/octal number sites also drop end_offset (raw = raw_slice so
the default is correct)
- 5 template + 1 regex sites keep explicit end_offset=offset (raw="" so
default gives loc.offset ≠ offset)
- 1 escaped-identifier site keeps end_offset=offset (raw=decoded may be
shorter than source span for unicode-escape sequences)
- 1 sep-number site keeps end_offset=offset (raw=num_str has underscores
stripped)
- 2 decimal-number/string sites pass lex_form~ (variable, not always LexNormal)
and drop end_offset (raw=raw_slice, default is correct)
Net: -329 lines / +69 lines in lexer.mbt. No semantic change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and substring check - Plain identifier test now verifies unsafe_substring recovers source text, matching the structural depth of the unicode-escape sibling test. - New test for numeric separator (1_000): raw="1000" (stripped) but end_offset must span the full source "1_000"; this is the one push site where the default sentinel loc.offset+raw.length() would be wrong, making the explicit end_offset=offset argument load-bearing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27a656921f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resolve_keyword(decoded) | ||
| } | ||
| tokens.push({ kind, loc, raw: decoded, lex_form: LexNormal }) | ||
| tokens.push(@token.Token(kind, loc, decoded, end_offset=offset)) |
There was a problem hiding this comment.
Use UTF-16 end offsets for direct astral identifiers
For identifiers that contain a direct supplementary-plane character, e.g. var 𝒜 = 1; console.log((() => 𝒜).toString()), offset has only advanced by one because it is counted over source.to_array() Unicode characters, while consume_source_text later passes end_offset to String::unsafe_substring, whose offsets are UTF-16 code units. Before this change, the parser fell back to loc.offset + tok.raw.length() for direct identifiers and kept both UTF-16 units; now end_offset=offset truncates the source text for such identifiers, so Function.prototype.toString can return a malformed or shortened body.
Useful? React with 👍 / 👎.
…tifiers
For identifiers that contain direct astral-plane characters (U+10000–U+10FFFF,
e.g. 𝒜), the lexer's `offset` variable counts code points (one per Unicode
scalar), but `String::unsafe_substring` uses UTF-16 code-unit indices.
A surrogate-pair char like 𝒜 occupies 1 code point but 2 UTF-16 units,
so `end_offset=offset` was short by 1, causing `consume_source_text` to
return a half-surrogate or truncated body for Function.prototype.toString.
Fix: when there are no escape sequences (`has_escape=false`), compute
end_offset as `loc.offset + decoded.length()` where `decoded.length()` gives
UTF-16 code units — matching the index space of `unsafe_substring`. When
`has_escape=true`, all escape sequences are ASCII (\uXXXX / \u{XXXXX}) so
code-point count == UTF-16 count and `offset` remains correct.
Adds a lexer unit test (direct astral char, verifies end_offset=2 and full
substring recovery) and an integration test for Function.prototype.toString.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94fdd219cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Regex(pattern_buf.to_string(), flags_buf.to_string()), | ||
| rloc, | ||
| "", | ||
| end_offset=offset, |
There was a problem hiding this comment.
Convert regex end offsets to UTF-16 units
When a regex literal's pattern contains a direct supplementary-plane character, offset has advanced by one source character while consume_source_text later uses String::unsafe_substring with UTF-16 offsets. For example, an arrow body ending in /𝒜/u records an end_offset one code unit short, so Function.prototype.toString() drops the final flag or closing slash instead of preserving the original source.
Useful? React with 👍 / 👎.
| lex_form: LexNormal, | ||
| }) | ||
| tokens.push( | ||
| @token.Token(NoSubTemplate(text), tloc, "", end_offset=offset), |
There was a problem hiding this comment.
Convert template end offsets to UTF-16 units
When a template literal segment contains a direct supplementary-plane character, scan_template_string advances offset by one code point, but end_offset is later consumed as a UTF-16 offset. For example, (() => 𝒜).toString() records the NoSubTemplate end one code unit too early and loses the closing backtick; the same end_offset=offset pattern is used for the other template token branches too.
Useful? React with 👍 / 👎.
Closes #376
Summary
end_offset : InttoTokenstruct intoken/token.mbt. The lexer now records the precise exclusive byte offset at which each token ends in the source string.tokens.pushsite in the lexer assignsend_offsetusing one of two safe patterns:offsetis advanced):loc.offset + literal_lengthoffset(already past the token at push time)Parser::advancenow setsself.last_end = tok.end_offsetin one line, replacing ~130 lines of scan-helper code that re-derived end positions by rescanning source characters.Removed entirely:
scan_identifier_end,scan_regex_end,scan_template_segment_end,scan_number_end,scan_source_token_end,token_end_offset.Tests
lexer/lexer_test.mbtverify thatsrc.unsafe_substring(start=tok.loc.offset, end=tok.end_offset)recovers the exact source slice for regex, NoSubTemplate, TemplateHead, TemplateTail, unicode-escaped identifiers, and plain identifiers.interpreter/interpreter_test.mbtexercise the fulladvance → last_end → consume_source_text → Function.prototype.toStringpath for:`x${y}`—rawis empty, so oldloc.offset + raw.length()would be wrong)1_000—rawhas underscores stripped, so same issue)Test plan
moon check— 0 warnings, 0 errorsmoon test— 2114/2114 pass (2112 pre-existing + 2 new)Summary by CodeRabbit
Bug Fixes
Function.prototype.toString()to correctly handle arrow functions with template literal bodies and numeric separators.Tests