Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace regex-based lexer with character-at-a-time lexer #406

Merged
merged 14 commits into from
Apr 16, 2019
Merged

Conversation

casey
Copy link
Owner

@casey casey commented Apr 16, 2019

Fixes #241, which I've been threatening to do for a long time.

The current lexer uses regexes and is god awful.

The new lexer processes text mostly one character at a time, making decisions about which tokens to emit along the way. It's more verbose than the old lexer, but the new code is much easer to read, understand, and modify.

Also, the new lexer is 4x faster than the old lexer, when tested against a corpus of justfiles collected from github. In release mode, the change is more dramatic, with the new lexer being 15x faster.

I suspect that the speed increase is partially due to the old lexer trying to lex tokens by matching regexes in a sequence, which led to a lot of wasted work, whereas the new lexer is usually able to make a decision about which token to emit next by looking at the next character.

Since this is such a massive change, I'm testing it using a new tool called Janus, which downloads all justfiles on github and feeds them to multiple versions of just, looking for differences in behavior. Janus is of course inspired by Rust's crater, and once I finally release it we can close #251.

So far the results from Janus are encouraging. The just+new lexer produces slightly better error messages in a few cases, as well as being able to parse a previously unparsable justfile.

The only change that I need to investigate before landing the rewrite is a change in the handling of windows newlines at the end of recipe lines. For example, it looks like a text that the old lexer would extract from the line echo foo\r\n would be echo foo\r, whereas the new lexer correctly recognizes \r\n as a unit, and extracts echo foo as the line text.

Although the new lexer's behavior is correct, I'm slightly concerned that there might be cases where the new behavior might cause a shebang recipe to fail when previously it would have succeeded.

@casey casey marked this pull request as ready for review April 16, 2019 04:37
@casey
Copy link
Owner Author

casey commented Apr 16, 2019

An additional change in behavior is that the new lexer includes trailing space as part of a recipe line. I.E if a recipe line contains echo foo , those two spaces will be extracted and included. I think this is an improvement, since we definitely want to be able to execute whitespace shebang recipes, and the old lexer would break those.

@casey
Copy link
Owner Author

casey commented Apr 16, 2019

Since there were so few justfiles that saw changes (2 out of 498), and those that did won't change behavior, I'm going to merge this.

@casey casey merged commit 596ea34 into master Apr 16, 2019
@casey casey mentioned this pull request Apr 16, 2019
3 tasks
@casey casey deleted the lexer-rewrite branch April 16, 2019 05:41
@casey casey mentioned this pull request May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crater for Just Rewrite Lexer
1 participant