Replace regex-based lexer with character-at-a-time lexer #406

casey · 2019-04-16T00:31:54Z

Fixes #241, which I've been threatening to do for a long time.

The current lexer uses regexes and is god awful.

The new lexer processes text mostly one character at a time, making decisions about which tokens to emit along the way. It's more verbose than the old lexer, but the new code is much easer to read, understand, and modify.

Also, the new lexer is 4x faster than the old lexer, when tested against a corpus of justfiles collected from github. In release mode, the change is more dramatic, with the new lexer being 15x faster.

I suspect that the speed increase is partially due to the old lexer trying to lex tokens by matching regexes in a sequence, which led to a lot of wasted work, whereas the new lexer is usually able to make a decision about which token to emit next by looking at the next character.

Since this is such a massive change, I'm testing it using a new tool called Janus, which downloads all justfiles on github and feeds them to multiple versions of just, looking for differences in behavior. Janus is of course inspired by Rust's crater, and once I finally release it we can close #251.

So far the results from Janus are encouraging. The just+new lexer produces slightly better error messages in a few cases, as well as being able to parse a previously unparsable justfile.

The only change that I need to investigate before landing the rewrite is a change in the handling of windows newlines at the end of recipe lines. For example, it looks like a text that the old lexer would extract from the line echo foo\r\n would be echo foo\r, whereas the new lexer correctly recognizes \r\n as a unit, and extracts echo foo as the line text.

Although the new lexer's behavior is correct, I'm slightly concerned that there might be cases where the new behavior might cause a shebang recipe to fail when previously it would have succeeded.

casey · 2019-04-16T04:39:59Z

An additional change in behavior is that the new lexer includes trailing space as part of a recipe line. I.E if a recipe line contains echo foo , those two spaces will be extracted and included. I think this is an improvement, since we definitely want to be able to execute whitespace shebang recipes, and the old lexer would break those.

casey · 2019-04-16T05:27:54Z

Since there were so few justfiles that saw changes (2 out of 498), and those that did won't change behavior, I'm going to merge this.

casey added 12 commits April 15, 2019 16:45

Refaaaaactor

f7aebf8

index -> offset

46dfce2

Working on new lexer

db94673

Pretty much done, now moving all tests over

f56a302

Working... emoji hell still doesn't work

7e53d35

Everything works OwO

0ff8d70

Fix indented block test

d789ad7

Move tests into new_lexer

3952fc7

Remove old lexer!

420e2eb

First step to changing to usize

8daf9c2

Remove width Option

87f0cfb

Add more tests, make unclosed interpolation errors nicer

944e6f2

casey marked this pull request as ready for review April 16, 2019 04:37

casey added 2 commits April 15, 2019 22:09

Mention Janus in readme

30591e6

Interrupt tests are too flaky

ca73a3e

casey merged commit 596ea34 into master Apr 16, 2019

casey mentioned this pull request Apr 16, 2019

Rewrite Lexer #241

Closed

3 tasks

casey deleted the lexer-rewrite branch April 16, 2019 05:41

casey mentioned this pull request May 3, 2019

Reform the parser #426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace regex-based lexer with character-at-a-time lexer #406

Replace regex-based lexer with character-at-a-time lexer #406

casey commented Apr 16, 2019

casey commented Apr 16, 2019

casey commented Apr 16, 2019

Replace regex-based lexer with character-at-a-time lexer #406

Replace regex-based lexer with character-at-a-time lexer #406

Conversation

casey commented Apr 16, 2019

casey commented Apr 16, 2019

casey commented Apr 16, 2019