Skip to content

feat(token): add end_offset field, remove parser source rescanning#403

Merged
dowdiness merged 4 commits into
mainfrom
feat/token-end-offset
Jun 19, 2026
Merged

feat(token): add end_offset field, remove parser source rescanning#403
dowdiness merged 4 commits into
mainfrom
feat/token-end-offset

Conversation

@dowdiness

@dowdiness dowdiness commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Closes #376

Summary

  • Adds end_offset : Int to Token struct in token/token.mbt. The lexer now records the precise exclusive byte offset at which each token ends in the source string.
  • Each tokens.push site in the lexer assigns end_offset using one of two safe patterns:
    • Operator tokens (push happens before offset is advanced): loc.offset + literal_length
    • All other tokens (regex, template, escaped identifiers, numbers, strings): offset (already past the token at push time)
  • Parser::advance now sets self.last_end = tok.end_offset in one line, replacing ~130 lines of scan-helper code that re-derived end positions by rescanning source characters.

Removed entirely: scan_identifier_end, scan_regex_end, scan_template_segment_end, scan_number_end, scan_source_token_end, token_end_offset.

Tests

  • 6 new lexer unit tests in lexer/lexer_test.mbt verify that src.unsafe_substring(start=tok.loc.offset, end=tok.end_offset) recovers the exact source slice for regex, NoSubTemplate, TemplateHead, TemplateTail, unicode-escaped identifiers, and plain identifiers.
  • 2 new integration tests in interpreter/interpreter_test.mbt exercise the full advance → last_end → consume_source_text → Function.prototype.toString path for:
    • Arrow whose body ends with a TemplateTail (`x${y}`raw is empty, so old loc.offset + raw.length() would be wrong)
    • Arrow whose body ends with a numeric separator literal (1_000raw has underscores stripped, so same issue)

Test plan

  • moon check — 0 warnings, 0 errors
  • moon test — 2114/2114 pass (2112 pre-existing + 2 new)
  • CI for full test262 conformance

Summary by CodeRabbit

  • Bug Fixes

    • Fixed Function.prototype.toString() to correctly handle arrow functions with template literal bodies and numeric separators.
  • Tests

    • Added comprehensive test coverage for token position tracking and lexer accuracy.

)

Move exact token end offsets into the lexer so the parser no longer
rescans source text. Each push site in the lexer now records the precise
byte offset at which the token ends:

- Operator tokens (push-before-advance): `loc.offset + literal_length`
- All other tokens (regex, template, escaped identifiers, numbers,
  strings): `offset` (already past the token at push time)

Token struct gains `end_offset : Int` with a constructor that accepts
`end_offset?` (default -1 derives `loc.offset + raw.length()`, which is
correct for the verbatim-raw case and is only hit by synthetic tokens in
tests).

Parser.advance now does `self.last_end = tok.end_offset` in one line,
replacing ~130 lines of scan helpers (scan_identifier_end,
scan_regex_end, scan_template_segment_end, scan_number_end,
scan_source_token_end, token_end_offset).

Tests: 6 lexer unit tests verify raw-token spans; 2 new
Function.prototype.toString integration tests exercise the full
advance→last_end→consume_source_text path for TemplateTail and numeric
separator literals, the two cases where raw≠source that were not yet
covered end-to-end.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@dowdiness, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 36 minutes and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1a491372-7b2f-45af-b577-2fa23255f559

📥 Commits

Reviewing files that changed from the base of the PR and between 27a6569 and 94fdd21.

📒 Files selected for processing (3)
  • interpreter/interpreter_test.mbt
  • lexer/lexer.mbt
  • lexer/lexer_test.mbt
📝 Walkthrough

Walkthrough

Adds an end_offset : Int field to the Token struct, populated by the lexer at every token emission site using the new @token.Token(...) constructor (with explicit offsets for regex, templates, and identifiers). The parser's advance method is simplified to read tok.end_offset directly, removing its rescan helper. Tests cover lexer offset correctness and Function.prototype.toString regressions.

Changes

Token end_offset: lexer-owned source spans

Layer / File(s) Summary
Token struct: end_offset field and constructors
token/token.mbt, token/pkg.generated.mbti
Token gains end_offset : Int; Token::Token and Token::new accept optional end_offset? (derived from loc.offset + raw.length() when absent); Token::eof sets end_offset = loc.offset. Generated interface updated to match.
Lexer emission: switch to Token constructor with end_offset
lexer/lexer.mbt
All token emission sites in tokenize are converted from inline object literals to @token.Token(...) calls. end_offset is explicitly passed for regex, template head/middle/tail, identifiers, and dot-leading numerics; operators and most numerics derive it automatically from raw.length().
Parser::advance uses tok.end_offset directly
parser/parser.mbt
Parser::advance sets last_end from tok.end_offset instead of calling the token_end_offset rescan helper, removing ~141 lines of parser-side offset recomputation.
Tests: lexer offsets and interpreter toString
lexer/lexer_test.mbt, interpreter/interpreter_test.mbt
Lexer tests validate tok.end_offset by slicing source across regex, template, identifier, and numeric separator cases. Interpreter regression tests confirm Function.prototype.toString captures arrow bodies ending in a template tail (closing backtick included) and preserves numeric separators.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through the token stream I go,
Each end_offset noted, so the parser will know,
No more rescanning — the lexer keeps score,
Template tails, separators, and backticks galore!
The source text is whole, from the start to the end,
Every span owned by the lexer, my friend. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding an end_offset field to Token and removing parser source rescanning, which aligns with the primary objective.
Linked Issues check ✅ Passed The PR fully addresses all acceptance criteria from #376: Token exposes end_offset, lexer populates it for all required token types, parser uses lexer-owned offsets, existing callers remain safe, and regression tests cover template literals and numeric separators.
Out of Scope Changes check ✅ Passed All changes directly support the core objective of adding end_offset tracking and removing parser rescanning. Test additions for Function.prototype.toString cases and lexer end_offset verification are appropriately scoped to the feature.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/token-end-offset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

Run: https://github.com/dowdiness/js_engine/actions/runs/27815620496

startup/tiny_program is the PR #153 / issue #141 guardrail for built-in realm-stamping startup cost.

Stage summary

stage benchmarks total mean slowest benchmark slowest mean noisy rows
startup 3 2.540 ms startup/phase/new_interpreter 1.286 ms 0
frontend 7 0.873 ms pipeline/parse_heavy 0.493 ms 2
execution 26 14716.718 ms exec/fibonacci_30 13164.870 ms 2

Focused bytecode base-vs-head comparison

Base-vs-head deltas are reporting-only. Negative delta and PR/base < 1.00x mean the PR is faster; interpret high-CV or noisy rows cautiously.

benchmark stage base mean PR mean delta PR/base base CV PR CV noisy
baseline/bytecode/closure_factory execution 13.374 ms 15.265 ms +14.1% 1.14x 5.9% 6.9% no
pipeline/bytecode/evaluate execution 8.692 ms 9.071 ms +4.4% 1.04x 1.6% 4.6% no
isolate/bytecode/call_frame execution 7.414 ms 7.598 ms +2.5% 1.02x 1.0% 1.0% no
isolate/bytecode/runtime_helpers execution 10.640 ms 10.875 ms +2.2% 1.02x 0.6% 0.5% no
isolate/bytecode/local_access execution 37.137 ms 39.227 ms +5.6% 1.06x 2.0% 4.9% no
isolate/bytecode/env_access execution 37.733 ms 38.993 ms +3.3% 1.03x 1.9% 2.9% no
isolate/bytecode/captured_access execution 34.930 ms 37.577 ms +7.6% 1.08x 1.2% 1.5% no
isolate/bytecode/dispatch_stack execution 22.413 ms 23.721 ms +5.8% 1.06x 0.4% 1.7% no

Base-vs-head comparison

benchmark stage base mean PR mean delta PR/base base CV PR CV noisy
startup/tiny_program startup 1.130 ms 1.254 ms +10.9% 1.11x 7.2% 3.6% no
lexer/small frontend 0.031 ms 0.030 ms -2.1% 0.98x 43.2% 25.1% base, PR
lexer/large frontend 0.264 ms 0.269 ms +1.8% 1.02x 0.8% 0.7% no
exec/fibonacci_30 execution 13188.677 ms 13164.870 ms -0.2% 1.00x 0.8% 1.1% no
exec/property_chain execution 13.612 ms 16.542 ms +21.5% 1.22x 9.3% 12.1% no
startup/phase/parse_tiny frontend 0.002 ms 0.002 ms -0.4% 1.00x 2.1% 0.9% no
startup/phase/new_interpreter startup 1.114 ms 1.286 ms +15.4% 1.15x 13.0% 8.3% no
startup/phase/execute_preparsed_tiny execution 0.000 ms 0.000 ms +1.3% 1.01x 0.8% 0.9% no
startup/phase/event_loop_drain_empty startup 0.000 ms 0.000 ms +6.0% 1.06x 0.9% 1.0% no
startup/phase/result_stringify_output execution 0.000 ms 0.000 ms +2.0% 1.02x 0.5% 1.8% no
exec/array_map_filter execution 18.287 ms 20.936 ms +14.5% 1.14x 17.5% 20.5% base, PR
exec/closure_factory execution 27.776 ms 31.245 ms +12.5% 1.12x 6.0% 6.9% no
baseline/closure_legacy/closure_factory execution 27.347 ms 29.885 ms +9.3% 1.09x 9.8% 8.8% no
baseline/bytecode/closure_factory execution 13.374 ms 15.265 ms +14.1% 1.14x 5.9% 6.9% no
isolate/bytecode/dispatch_stack execution 22.413 ms 23.721 ms +5.8% 1.06x 0.4% 1.7% no
isolate/bytecode/local_access execution 37.137 ms 39.227 ms +5.6% 1.06x 2.0% 4.9% no
isolate/bytecode/env_access execution 37.733 ms 38.993 ms +3.3% 1.03x 1.9% 2.9% no
isolate/bytecode/captured_access execution 34.930 ms 37.577 ms +7.6% 1.08x 1.2% 1.5% no
isolate/bytecode/call_frame execution 7.414 ms 7.598 ms +2.5% 1.02x 1.0% 1.0% no
isolate/bytecode/runtime_helpers execution 10.640 ms 10.875 ms +2.2% 1.02x 0.6% 0.5% no
isolate/bytecode/property_get execution 43.922 ms 43.782 ms -0.3% 1.00x 2.1% 1.9% no
isolate/bytecode/property_set execution 41.968 ms 40.518 ms -3.5% 0.97x 0.9% 0.7% no
isolate/bytecode/method_call execution 8.234 ms 8.660 ms +5.2% 1.05x 0.4% 3.1% no
isolate/bytecode/object_literal execution 13.435 ms 13.318 ms -0.9% 0.99x 1.7% 3.1% no
isolate/bytecode/array_literal execution 14.680 ms 14.550 ms -0.9% 0.99x 2.7% 1.0% no
exec/for_of execution 5.365 ms 5.548 ms +3.4% 1.03x 7.8% 3.7% no
exec/arithmetic_loop execution 855.128 ms 1083.022 ms +26.7% 1.27x 0.2% 0.5% no
exec/object_construction execution 6.971 ms 7.525 ms +8.0% 1.08x 4.2% 6.2% no
exec/string_ops execution 1.898 ms 2.110 ms +11.2% 1.11x 27.0% 24.6% base, PR
pipeline/exec/lex frontend 0.028 ms 0.028 ms +1.9% 1.02x 2.6% 0.9% no
pipeline/exec/parse frontend 0.028 ms 0.028 ms -1.5% 0.99x 6.4% 2.8% no
pipeline/exec/evaluate execution 24.865 ms 26.406 ms +6.2% 1.06x 4.9% 12.9% no
pipeline/closure_legacy/evaluate execution 24.433 ms 25.475 ms +4.3% 1.04x 4.4% 4.3% no
pipeline/bytecode/compile frontend 0.022 ms 0.023 ms +4.1% 1.04x 33.7% 34.2% base, PR
pipeline/bytecode/evaluate execution 8.692 ms 9.071 ms +4.4% 1.04x 1.6% 4.6% no
pipeline/parse_heavy frontend 0.505 ms 0.493 ms -2.5% 0.98x 6.5% 4.7% no

Mean-time chart (log scale)

benchmark stage mean chart
startup/tiny_program startup 1.254 ms ##
lexer/small frontend 0.030 ms ⚠ #
lexer/large frontend 0.269 ms #
exec/fibonacci_30 execution 13164.870 ms ##############################
exec/property_chain execution 16.542 ms #########
startup/phase/parse_tiny frontend 0.002 ms #
startup/phase/new_interpreter startup 1.286 ms ##
startup/phase/execute_preparsed_tiny execution 0.000 ms #
startup/phase/event_loop_drain_empty startup 0.000 ms #
startup/phase/result_stringify_output execution 0.000 ms #
exec/array_map_filter execution 20.936 ms ⚠ #########
exec/closure_factory execution 31.245 ms ##########
baseline/closure_legacy/closure_factory execution 29.885 ms ##########
baseline/bytecode/closure_factory execution 15.265 ms ########
isolate/bytecode/dispatch_stack execution 23.721 ms ##########
isolate/bytecode/local_access execution 39.227 ms ###########
isolate/bytecode/env_access execution 38.993 ms ###########
isolate/bytecode/captured_access execution 37.577 ms ###########
isolate/bytecode/call_frame execution 7.598 ms ######
isolate/bytecode/runtime_helpers execution 10.875 ms #######
isolate/bytecode/property_get execution 43.782 ms ############
isolate/bytecode/property_set execution 40.518 ms ###########
isolate/bytecode/method_call execution 8.660 ms #######
isolate/bytecode/object_literal execution 13.318 ms ########
isolate/bytecode/array_literal execution 14.550 ms ########
exec/for_of execution 5.548 ms #####
exec/arithmetic_loop execution 1083.022 ms ######################
exec/object_construction execution 7.525 ms ######
exec/string_ops execution 2.110 ms ⚠ ###
pipeline/exec/lex frontend 0.028 ms #
pipeline/exec/parse frontend 0.028 ms #
pipeline/exec/evaluate execution 26.406 ms ##########
pipeline/closure_legacy/evaluate execution 25.475 ms ##########
pipeline/bytecode/compile frontend 0.023 ms ⚠ #
pipeline/bytecode/evaluate execution 9.071 ms #######
pipeline/parse_heavy frontend 0.493 ms #

Closure-conversion comparison

  • unavailable

dowdiness and others added 2 commits June 19, 2026 16:15
…ructor

All 47 tokens.push({ ... }) calls converted to @token.Token() constructor
calls, following the MoonBit convention of using custom constructors over
record literals.

Key simplifications:
- Drops lex_form: LexNormal from 46 sites (it is the default)
- Drops end_offset: loc.offset + N from 33 operator push sites (the
  constructor sentinel -1 derives the same value as loc.offset + raw.length())
- 3 hex/binary/octal number sites also drop end_offset (raw = raw_slice so
  the default is correct)
- 5 template + 1 regex sites keep explicit end_offset=offset (raw="" so
  default gives loc.offset ≠ offset)
- 1 escaped-identifier site keeps end_offset=offset (raw=decoded may be
  shorter than source span for unicode-escape sequences)
- 1 sep-number site keeps end_offset=offset (raw=num_str has underscores
  stripped)
- 2 decimal-number/string sites pass lex_form~ (variable, not always LexNormal)
  and drop end_offset (raw=raw_slice, default is correct)

Net: -329 lines / +69 lines in lexer.mbt. No semantic change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and substring check

- Plain identifier test now verifies unsafe_substring recovers source text,
  matching the structural depth of the unicode-escape sibling test.
- New test for numeric separator (1_000): raw="1000" (stripped) but
  end_offset must span the full source "1_000"; this is the one push site
  where the default sentinel loc.offset+raw.length() would be wrong, making
  the explicit end_offset=offset argument load-bearing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27a656921f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lexer/lexer.mbt Outdated
resolve_keyword(decoded)
}
tokens.push({ kind, loc, raw: decoded, lex_form: LexNormal })
tokens.push(@token.Token(kind, loc, decoded, end_offset=offset))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use UTF-16 end offsets for direct astral identifiers

For identifiers that contain a direct supplementary-plane character, e.g. var 𝒜 = 1; console.log((() => 𝒜).toString()), offset has only advanced by one because it is counted over source.to_array() Unicode characters, while consume_source_text later passes end_offset to String::unsafe_substring, whose offsets are UTF-16 code units. Before this change, the parser fell back to loc.offset + tok.raw.length() for direct identifiers and kept both UTF-16 units; now end_offset=offset truncates the source text for such identifiers, so Function.prototype.toString can return a malformed or shortened body.

Useful? React with 👍 / 👎.

…tifiers

For identifiers that contain direct astral-plane characters (U+10000–U+10FFFF,
e.g. 𝒜), the lexer's `offset` variable counts code points (one per Unicode
scalar), but `String::unsafe_substring` uses UTF-16 code-unit indices.
A surrogate-pair char like 𝒜 occupies 1 code point but 2 UTF-16 units,
so `end_offset=offset` was short by 1, causing `consume_source_text` to
return a half-surrogate or truncated body for Function.prototype.toString.

Fix: when there are no escape sequences (`has_escape=false`), compute
end_offset as `loc.offset + decoded.length()` where `decoded.length()` gives
UTF-16 code units — matching the index space of `unsafe_substring`. When
`has_escape=true`, all escape sequences are ASCII (\uXXXX / \u{XXXXX}) so
code-point count == UTF-16 count and `offset` remains correct.

Adds a lexer unit test (direct astral char, verifies end_offset=2 and full
substring recovery) and an integration test for Function.prototype.toString.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94fdd219cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lexer/lexer.mbt
Regex(pattern_buf.to_string(), flags_buf.to_string()),
rloc,
"",
end_offset=offset,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert regex end offsets to UTF-16 units

When a regex literal's pattern contains a direct supplementary-plane character, offset has advanced by one source character while consume_source_text later uses String::unsafe_substring with UTF-16 offsets. For example, an arrow body ending in /𝒜/u records an end_offset one code unit short, so Function.prototype.toString() drops the final flag or closing slash instead of preserving the original source.

Useful? React with 👍 / 👎.

Comment thread lexer/lexer.mbt
lex_form: LexNormal,
})
tokens.push(
@token.Token(NoSubTemplate(text), tloc, "", end_offset=offset),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert template end offsets to UTF-16 units

When a template literal segment contains a direct supplementary-plane character, scan_template_string advances offset by one code point, but end_offset is later consumed as a UTF-16 offset. For example, (() => 𝒜).toString() records the NoSubTemplate end one code unit too early and loses the closing backtick; the same end_offset=offset pattern is used for the other template token branches too.

Useful? React with 👍 / 👎.

@dowdiness dowdiness merged commit b06d3d6 into main Jun 19, 2026
14 checks passed
@dowdiness dowdiness deleted the feat/token-end-offset branch June 19, 2026 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lexer should expose exact token end offsets for source-text capture

1 participant