Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RegExp sticky flag for lexer performance boost. #380

Closed
11 tasks done
bd82 opened this issue Mar 11, 2017 · 9 comments · Fixed by #405
Closed
11 tasks done

Use RegExp sticky flag for lexer performance boost. #380

bd82 opened this issue Mar 11, 2017 · 9 comments · Fixed by #405

Comments

@bd82
Copy link
Member

bd82 commented Mar 11, 2017

Modifying the RegExp's lastIndex instead of chopping off parts of the input string repeatedly.

  • POC.
  • Basic flows.
  • Legacy mode (IE11 & friends).
  • Fix existing tests.
  • Tests for legacy mode.
  • Validation that start of input anchor is not used.
  • 100% code coverage.
  • Modify Indentation example.
  • Update Custom Token Docs.
  • Document breaking changes.
  • Update performance benchmark safari hack custom pattern.
@bd82 bd82 changed the title Investigate Lexer performance optimization Investigate Lexer performance optimization. Mar 11, 2017
@bd82
Copy link
Member Author

bd82 commented Mar 12, 2017

A small POC showed 8-10% improvement in an E2E (lex + parse) JSON benchmark.
So that may be as much as 20% boost for the lexer itself.
Quite amazing that 1/5 of the lexer time may have been spent chopping up strings with substr...

getting this to productive quality requires handling some issues:

  • Custom Token patterns API may have to be modified.
  • How to avoid explosion of Lexer variants.
  • How to support this capability optionally only on newer JS engine where the sticky option is available.

@leeoniya
Copy link

leeoniya commented Mar 12, 2017

avoiding string allocation in hot code via substr, substring, slice is a good perf optimization. you're also better off checking string prefixes (if needed) via array access to individual chars and plain for loops or [1].

btw, native Array.forEach loops are slow.

[1] http://stackoverflow.com/a/4579228

@bd82
Copy link
Member Author

bd82 commented Mar 12, 2017

Thanks for the feedback @leeoniya

also better off checking string prefixes (if needed) via array access to individual chars

The Chevrotain Lexer is based on Regular expressions, so there is no access to individual chars.
The lexer just tries matching against the provided patterns in a loop until it find a match.

avoiding string allocation in hot code via substr, substring, slice is a good perf optimization.

I'm already using substring, The optimization I'm talking about here is to completely avoid using substring. By using the RegExp new sticky flag combined with changing the regExp lastIndex property it is possible to match from any point in the string.

Currently Chevrotain is chopping off string prefixes using substring.
So if there are 10,000 tokens, substring would have been called 10,000 times during lexing...
That is quite expensive...

btw, native Array.forEach loops are slow.

Aye, I use the good old fashioned for loops in hot spot code.
For less important (performance wise) code I use a small set of utility functions
similar to lodash/underscore style.

@leeoniya
Copy link

I'm already using substring, The optimization I'm talking about here is to completely avoid using substring.

right, i agreed that avoiding it is the right way to go. i did a similar optimization in one of my libs recently and was also surprised by how big the boost was.

@bd82 bd82 changed the title Investigate Lexer performance optimization. Use RegExp sticky flag for performance optimizations. Mar 18, 2017
@bd82 bd82 changed the title Use RegExp sticky flag for performance optimizations. Use RegExp sticky flag for performance boost. Mar 18, 2017
@bd82 bd82 changed the title Use RegExp sticky flag for performance boost. Use RegExp sticky flag for lexer performance boost. Mar 18, 2017
@bd82
Copy link
Member Author

bd82 commented Mar 19, 2017

Breaking Changes

  • Custom Token patterns used to be called with three arguments.

         function matchInteger(text, matchedTokens, groups) {}

    There are now four arguments.

         function matchInteger(text, offset, matchedTokens, groups) {}

The custom match is expected to be performed beginning from the offset argument in the text.
Previously the custom match always assumed the match must be performed from the start
of
the input text.

@bd82
Copy link
Member Author

bd82 commented Mar 19, 2017

re-opening as a reminder to update the safari custom pattern in the performance
benchmark.

@bd82 bd82 reopened this Mar 19, 2017
@bd82
Copy link
Member Author

bd82 commented Mar 19, 2017

Final performance results showed almost 30% for a simpler JSON Lexer.
More complex lexers should show lower gains as a smaller percentage of their runtime
is spent performing substrings on the input.

@bd82
Copy link
Member Author

bd82 commented Mar 19, 2017

This is only applicable to modern JS engines.
On node.js version 4 the Lexer will run in the old legacy mode without any performance benefit.
Same on IE11.

But anyone using IE11 is doomed anyways 😄

@bd82
Copy link
Member Author

bd82 commented Mar 19, 2017

Safari related performance workaround was no longer needed and was removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants