Add ES2023 tokenization #9

arhadthedev · 2024-03-24T16:34:23Z

For starters, it can be Tokenizer struct right inside the crate.

The struct should recognize lexical grammar described in https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar.

The struct should provide methods to switch between goal symbols on the fly. The spec section linked above gives the following reasoning for such a feature:

There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements. This requires multiple goal symbols for the lexical grammar. The InputElementRegExpOrTemplateTail goal is used in syntactic grammar contexts where a RegularExpressionLiteral, a TemplateMiddle, or a TemplateTail is permitted.

Syntactic grammar contexts is defined by a parser state so it's the parser that needs to switch a current lexical grammar goal symbol to adjust tokenization.

The user entry point is Tokenizer object with its get_next_symbol method:

/// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
///
/// > There are several situations where the identification of lexical input
/// > elements is sensitive to the syntactic grammar context that is consuming
/// > the input elements. This requires multiple goal symbols for the lexical
/// > grammar.
pub enum GoalSymbols {
    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used at the start of a Script or Module
    InputElementHashbangOrRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in syntactic grammar contexts where a `RegularExpressionLiteral`,
    /// > a `TemplateMiddle`, or a `TemplateTail` is permitted.
    InputElementRegExpOrTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a
    /// > `RegularExpressionLiteral` is permitted but neither a
    /// > `TemplateMiddle`, nor a TemplateTail is permitted.
    InputElementRegExp,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > Used in all syntactic grammar contexts where a `TemplateMiddle` or a
    /// > `TemplateTail` is permitted but a `RegularExpressionLiteral` is
    /// > not permitted
    InputElementTemplateTail,

    /// From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>:
    ///
    /// > In all other contexts, [...] used as the lexical goal symbol.
    InputElementDiv
}

#[derive(Debug)]
pub struct Tokenizer {
    /// A current set of lexical grammar
    current_goal: GoalSymbols,
}

Grammar parameter implementation

For notation like Production[Param1, Param2] add the following comment into the code:

Notes on how specification features are mapped into YACC features:

grammatical parameters (like Nonterminal[Param1, Param2]) are implemented as a special FecerBrowser_...KeywordUsage static semantics returning a boolean, one for each pair of a left side parameter and its right side usage

lookahead restriction

[No Line Terminator Here]

automatic semicolon insertion

Cover grammar

Also add a note on what cover grammar is.

Todo list:

The text was updated successfully, but these errors were encountered:

We start with the way to report syntax errors.

Now we have a global sample table of what tokens the whole lexer (tokenizer) can process, and all tests refer it as their parameter type for substitution.

Replace `|foo| foo.1` with `|(_, bar) bar|`. No longer opaque meaning behind numeric fields; we name these fields in place of usage.

`get_next_token` tries to get the unparsed tail by chopping of a recognized token no matter whether we got the token or a parse error. As a result, the error yields to a panic. This PR moves the tail extraction into processing of parse success. Also, it adds a regression test that uses `claims` crate.

Prepare the parser for exposure of `InputElementRegExp`, `InputElementRegExpOrTemplateTail`, and `InputElementHashbangOrRegExp` extra goal symbols by implementing the unified logic of processing them all. Since all goal symbols mostly share non-terminals on their right side, we can create a enum listing all possible right-side non-terminals. It allows us to write a per-goal logic copying into the unified enum and slightly modify existing `extract_token` to make it a shared processor of the unified enum. For more details and the ultimate purpose, see description of `GoalSymbols` in the parent issue.

According to the ECMASccript specification, ReservedWord always follows IdentifierName (included in CommonToken) in grammar definition (<https://262.ecma-international.org/14.0/#sec-names-and-keywords>): > The syntactic grammar defines Identifier as an IdentifierName that > is not a ReservedWord. So we make every goal symbol supporting CommonToken into supporting ReservedWord as well.

This is a follow-up for gh-106 (7690f85).

From <https://262.ecma-international.org/14.0/#sec-ecmascript-language-lexical-grammar>: > There are several situations where the identification of lexical input > elements is sensitive to the syntactic grammar context that is > consuming the input elements. This requires multiple goal symbols > for the lexical grammar.

Some rules match for a certain goal symbols and error for others. Check this.

Some rules match for a certain goal symbols and error for others. Now we check this.

For some reason, such a format is easier to read.

It allows us to get rid of a mut-argument after caller-side cloning.

arhadthedev added feature Addition of something new spec: es2023 Conformance to https://262.ecma-international.org/14.0/ part: parser File-to-tree conversion labels Mar 24, 2024

arhadthedev changed the title ~~Add tokenization~~ Add ES2023 tokenization Mar 24, 2024

arhadthedev mentioned this issue Mar 26, 2024

gh-9: Add SourceCodeError struct #10

Merged

arhadthedev added a commit that referenced this issue Mar 26, 2024

gh-9: Add SourceCodeError struct (#10)

68e7662

We start with the way to report syntax errors.

arhadthedev self-assigned this Mar 27, 2024

arhadthedev added this to the Lexer and parser modules only milestone Mar 27, 2024

arhadthedev mentioned this issue Apr 6, 2024

gh-9: Move with_term test utility to _tokenizer #47

Merged

arhadthedev added a commit that referenced this issue Apr 6, 2024

gh-9: Move with_term test utility to _tokenizer (#47)

8721992

arhadthedev mentioned this issue Apr 7, 2024

gh-9: Merge all TerminalCase into a single, global one #49

Merged

arhadthedev added a commit that referenced this issue Apr 7, 2024

gh-9: Merge all TerminalCase into a single, global one (#49)

af8bfd4

Now we have a global sample table of what tokens the whole lexer (tokenizer) can process, and all tests refer it as their parameter type for substitution.

arhadthedev mentioned this issue Apr 7, 2024

gh-9: Replace push-based with_term with pull-based generate_cases #50

Merged

arhadthedev added a commit that referenced this issue Apr 7, 2024

gh-9: Replace push-based with_term with pull-based generate_cases (#50)

6b6d186

arhadthedev mentioned this issue Apr 11, 2024

gh-9: Simplify tuple parameter usage #57

Merged

arhadthedev added a commit that referenced this issue Apr 11, 2024

gh-9: Simplify tuple parameter usage (#57)

aef4909

Replace `|foo| foo.1` with `|(_, bar) bar|`. No longer opaque meaning behind numeric fields; we name these fields in place of usage.

arhadthedev mentioned this issue Apr 14, 2024

gh-9: Add explanation of ECMA-262 grammar terminology #66

Merged

arhadthedev added a commit that referenced this issue Apr 14, 2024

gh-9: Add explanation of ECMA-262 grammar terminology (#66)

b362787

This was referenced Jun 5, 2024

Convert manual lexer bits to pest #76

Closed

gh-9: Fix tail extraction attempt on parse error in get_next_token #86

Merged

arhadthedev added a commit that referenced this issue Jun 18, 2024

gh-9: Add more goal symbols as an internal detail

d72f912

arhadthedev mentioned this issue Jun 18, 2024

gh-9: Add more goal symbols as an internal detail #106

Merged

arhadthedev mentioned this issue Jun 21, 2024

gh-9: Add ReservedWord everywhere after CommonToken #107

Merged

arhadthedev added a commit that referenced this issue Jun 21, 2024

gh-9: Add forgotten InputElementTemplateTail

79aba70

This is a follow-up for gh-106 (7690f85).

arhadthedev mentioned this issue Jun 21, 2024

gh-9: Add forgotten InputElementTemplateTail #108

Merged

arhadthedev added a commit that referenced this issue Jun 21, 2024

gh-9: Add forgotten InputElementTemplateTail (#108)

7cd88b8

This is a follow-up for gh-106 (7690f85).

arhadthedev mentioned this issue Jun 21, 2024

gh-9: Add support for all lexical goals #109

Merged

arhadthedev added a commit that referenced this issue Jun 22, 2024

gh-9: Add failure tests for single-goal rules

6170157

Some rules match for a certain goal symbols and error for others. Check this.

arhadthedev mentioned this issue Jun 22, 2024

gh-9: Add failure tests for single-goal rules #110

Merged

arhadthedev added a commit that referenced this issue Jun 22, 2024

gh-9: Add failure tests for single-goal rules (#110)

cb5d35c

Some rules match for a certain goal symbols and error for others. Now we check this.

arhadthedev mentioned this issue Jun 25, 2024

Move pest_ast classes for lexical_grammar.pest into lexical_grammar.rs #113

Closed

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Make Ecma262Parser invocation functional

c25fd68

For some reason, such a format is easier to read.

arhadthedev mentioned this issue Jun 26, 2024

gh-9: Make Ecma262Parser invocation functional #127

Merged

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Make Ecma262Parser invocation functional (#127)

1aebb5b

For some reason, such a format is easier to read.

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Change get_unprocessed_tail argument ownership

fce2861

It allows us to get rid of a mut-argument after caller-side cloning.

arhadthedev mentioned this issue Jun 26, 2024

gh-9: Change get_unprocessed_tail argument ownership #128

Merged

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Change get_unprocessed_tail argument ownership (#128)

8192d9c

It allows us to get rid of a mut-argument after caller-side cloning.

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Clarify tree object variable name in get_next_token

a6002ba

arhadthedev mentioned this issue Jun 26, 2024

gh-9: Clarify tree object variable name in get_next_token #129

Merged

arhadthedev added a commit that referenced this issue Jun 26, 2024

gh-9: Clarify tree object variable name in get_next_token (#129)

5fd2fd1

This was referenced Jun 27, 2024

Separate lexing/parsing and execution with serialized tree hints #130

Open

Merge lexical and syntax grammar into a single PEG tree #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ES2023 tokenization #9

Add ES2023 tokenization #9

arhadthedev commented Mar 24, 2024 •

edited

Loading

Add ES2023 tokenization #9

Add ES2023 tokenization #9

Comments

arhadthedev commented Mar 24, 2024 • edited Loading

Grammar parameter implementation

Cover grammar

arhadthedev commented Mar 24, 2024 •

edited

Loading