You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the nix-parser2 directory.
Existing pain points
Lexer
Lexer is not fully lazy when it comes to string interpolations, as it collects all of the interpolated tokens eagerly into a nested Vec. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text.
Each variant of the Token enum carries its own duplicate Span information. This should be factored out into a Token struct which contains both a Span and a TokenKind, which eliminates the need for implementing ToSpan on Token.
Lexer does some work which the parser should be doing, such as unescaping strings, trimming the leading whitespace from '' strings, converting paths into PathBuf values, converting booleans into bool values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires.
Lexer should be 100% lossless. Currently, if a path token contains a trailing slash or a string literal is left unterminated, the lexer yields a Token::Error. This confounds the parser at later stages, causing it to emit perplexing errors such as expected path, found `/foo/bar/` and unexpected token `'` , respectively. Instead, we should still emit Token::Path and Token::String so parsing can continue but mark them as invalid. This matches what rustc_lexer does.
Parser
Error management with nom is very hard to get right. It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics from nom combinators is tedious as well.
Writing an incremental parser with nom is very difficult when compared to a hand-written parser. nom is hard-wired to return Incomplete(n) when receiving incomplete input, which is decidedly not what we want when processing a programming language.
Threading state through a nom parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictive Fn bounds on combinators.
Parser should be 100% lossless. It should not produce an AST, but rather a CST (concrete syntax tree) which preserves all whitespace, comments, and invalid tokens. This would be usable in a code formatter or autocompletion engine, for example. The CST can simplified into an AST at a later stage. See the Roslyn architecture for more details.
Parser currently can't handle spaces inside string interpolations correctly nor escaped ' characters inside indented '' strings.
Implementing overflow arguments, trailing commas, and ellipses in formal function arguments is tricky. Implementing a strategy which yields high-quality diagnostics is even trickier.
Current state
The new lexer present in the master branch still relies on nom because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with custom nom error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.
The new parser will ditch nom and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based on rowan, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.
There is some potential for nom to be usable for the parser after all, given practices outlined in "Syntax error recovery in parser expression grammars" (Medeiros, S. and Fabio Mascarenhas, 2018). I've created a write-up on my blog here.
Since commit 2f2d8cf, efforts have begun to completely redesign the existing lexer and parser, with the new code currently residing in the
nix-parser2
directory.Existing pain points
Lexer
Vec
. This is a big design kludge also a performance footgun. Ideally, the lexer should be a dumb iterator yielding a flat, lazy sequence of tokens which map directly to spans in the source text.Token
enum carries its own duplicateSpan
information. This should be factored out into aToken
struct which contains both aSpan
and aTokenKind
, which eliminates the need for implementingToSpan
onToken
.''
strings, converting paths intoPathBuf
values, converting booleans intobool
values, etc. There should be a stronger separation of concerns, where the produced tokens only carry span information and little else, and the parser is free to slice and retrieve the source text based on those spans, if it so desires.Token::Error
. This confounds the parser at later stages, causing it to emit perplexing errors such asexpected path, found `/foo/bar/`
andunexpected token `'`
, respectively. Instead, we should still emitToken::Path
andToken::String
so parsing can continue but mark them as invalid. This matches whatrustc_lexer
does.Parser
nom
is very hard to get right. It leads to very convoluted code as well as performance footguns, if you're not careful. Producing high-quality diagnostics fromnom
combinators is tedious as well.nom
is very difficult when compared to a hand-written parser.nom
is hard-wired to returnIncomplete(n)
when receiving incomplete input, which is decidedly not what we want when processing a programming language.nom
parser, e.g. an arena allocator or string interner, is very cumbersome due to the restrictiveFn
bounds on combinators.'
characters inside indented''
strings.Current state
The new lexer present in the
master
branch still relies onnom
because it works very well for making small and efficient lexers. Since the new lexer is lossless and will never produce an error, there's no need to struggle with customnom
error management, and the code for it looks much simpler as a result. The redesigned lexer is currently in an MVP state and is essentially feature-complete, though a few pieces are still missing, most notably a function for checking delimiter balancing.The new parser will ditch
nom
and will most likely be hand-written for the sake of better error recovery and higher quality diagnostics. The precise design hasn't been fully fleshed out yet. Instead of producing an AST, the new parser will produce a CST (possibly based onrowan
, but could also be hand-written) whose nodes cannot be constructed by hand. However, the CST can be simplified down into an AST, whose nodes can be constructed by hand.CC @dingxiangfei2009
The text was updated successfully, but these errors were encountered: