Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Token Patterns to Support Lexer Context. #360

Closed
bd82 opened this issue Feb 1, 2017 · 3 comments
Closed

Custom Token Patterns to Support Lexer Context. #360

bd82 opened this issue Feb 1, 2017 · 3 comments

Comments

@bd82
Copy link
Member

bd82 commented Feb 1, 2017

The information about the previous tokens already matched should be provided to the custom token matcher.

This will allow implementing lexing patterns which require information about the previous successfully matched tokens, for example:

@bd82
Copy link
Member Author

bd82 commented Feb 1, 2017

An initial implementation is available here:
#361

Performance on V8

However invoking .exec with the additional context arguments
seem to cause a 5-6% performance degradation even if those arguments are not used.

Creating wrapper functions dynamically which either invoke the .exec with these arguments or without causes and even worse 20-30% performance degradation.

Currently the only solution I can think of will mean additional code duplication.
And there is already quite a bit of code duplication in the lexer code.
perhaps the overhead maintenance of the code duplication can be mitigated using some automatic validations (as part of the tests).

Another option is trying to create the wrapper matcher functions dynamically but outside the scope
of the Tokenize method, this means they will only be created ONCE. and thus may be better optimized.

bd82 pushed a commit that referenced this issue Feb 1, 2017
@bd82
Copy link
Member Author

bd82 commented Feb 1, 2017

The performance issues can be mitigated by using a runtime if-else.

                if (hasCustomTokens) {
                    match = currModePatterns[i].exec(text, matchedTokens, groups)
                } else {
                    match = currModePatterns[i].exec(text)
                }

This still has some overhead (~1% on JSON benchmark) but it is a small one.
The worse problem is that it is quite ugly and repeats multiple times in the lexer's code.

Perhaps some of the lexer source code should be created dynamically using templates
to avoid manually managing the duplication. This approach would even allow increased performance
in some use cases compared to the existing one because there would be no limit on creating
the most optimal source code for the current situation.

bd82 pushed a commit that referenced this issue Feb 2, 2017
bd82 pushed a commit that referenced this issue Feb 2, 2017
bd82 pushed a commit that referenced this issue Feb 3, 2017
@bd82 bd82 changed the title Custom Token Patterns with previous context. Custom Token Patterns to Support Lexer Context. Feb 3, 2017
@bd82 bd82 closed this as completed in c492d03 Feb 3, 2017
@bd82
Copy link
Member Author

bd82 commented Feb 3, 2017

Merged version using ugly duplicated "if-else" reduces the performance penalty to 1-2% which is acceptable for now.

These 1-2% can be gained back (and more?) If and when the Lexer runtime code will be refactored to be auto generated.

There are simply too many different combinations of lexing features combined with the fact that for maximum performance each combination of these features must have its own tokenizeInternal method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant