Custom Token Patterns to Support Lexer Context. #360

bd82 · 2017-02-01T00:44:55Z

The information about the previous tokens already matched should be provided to the custom token matcher.

This will allow implementing lexing patterns which require information about the previous successfully matched tokens, for example:

Python style whitespace indentation:
- Whitespace can only be lexed as indentation after newlines or at the start of the input.
Lexing JavaScript - ambiguity on the meaning of "/"
- http://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determine

bd82 · 2017-02-01T19:40:23Z

An initial implementation is available here:
#361

Performance on V8

However invoking .exec with the additional context arguments
seem to cause a 5-6% performance degradation even if those arguments are not used.

Creating wrapper functions dynamically which either invoke the .exec with these arguments or without causes and even worse 20-30% performance degradation.

Currently the only solution I can think of will mean additional code duplication.
And there is already quite a bit of code duplication in the lexer code.
perhaps the overhead maintenance of the code duplication can be mitigated using some automatic validations (as part of the tests).

Another option is trying to create the wrapper matcher functions dynamically but outside the scope
of the Tokenize method, this means they will only be created ONCE. and thus may be better optimized.

Fixes #360

bd82 · 2017-02-01T23:21:46Z

The performance issues can be mitigated by using a runtime if-else.

                if (hasCustomTokens) {
                    match = currModePatterns[i].exec(text, matchedTokens, groups)
                } else {
                    match = currModePatterns[i].exec(text)
                }

This still has some overhead (~1% on JSON benchmark) but it is a small one.
The worse problem is that it is quite ugly and repeats multiple times in the lexer's code.

Perhaps some of the lexer source code should be created dynamically using templates
to avoid manually managing the duplication. This approach would even allow increased performance
in some use cases compared to the existing one because there would be no limit on creating
the most optimal source code for the current situation.

Fixes #360

bd82 · 2017-02-03T13:57:26Z

Merged version using ugly duplicated "if-else" reduces the performance penalty to 1-2% which is acceptable for now.

These 1-2% can be gained back (and more?) If and when the Lexer runtime code will be refactored to be auto generated.

There are simply too many different combinations of lexing features combined with the fact that for maximum performance each combination of these features must have its own tokenizeInternal method.

bd82 pushed a commit that referenced this issue Feb 1, 2017

Custom tokens context support.

81db4e3

Fixes #360

bd82 pushed a commit that referenced this issue Feb 2, 2017

Custom tokens context support.

4148593

Fixes #360

bd82 pushed a commit that referenced this issue Feb 2, 2017

Custom tokens context support.

bd609f8

Fixes #360

bd82 pushed a commit that referenced this issue Feb 3, 2017

Custom tokens context support.

7d15561

Fixes #360

bd82 changed the title ~~Custom Token Patterns with previous context.~~ Custom Token Patterns to Support Lexer Context. Feb 3, 2017

bd82 closed this as completed in c492d03 Feb 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Token Patterns to Support Lexer Context. #360

Custom Token Patterns to Support Lexer Context. #360

bd82 commented Feb 1, 2017

bd82 commented Feb 1, 2017

bd82 commented Feb 1, 2017 •

edited

Loading

bd82 commented Feb 3, 2017

Custom Token Patterns to Support Lexer Context. #360

Custom Token Patterns to Support Lexer Context. #360

Comments

bd82 commented Feb 1, 2017

bd82 commented Feb 1, 2017

Performance on V8

bd82 commented Feb 1, 2017 • edited Loading

bd82 commented Feb 3, 2017

bd82 commented Feb 1, 2017 •

edited

Loading