-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace trifecta with megaparsec #268
Conversation
|
||
instance TokenParsing Parser where | ||
someSpace = | ||
Text.Parser.Token.Style.buildSomeSpaceParser | ||
(Parser someSpace) | ||
(Parser (Text.Megaparsec.skipSome (Text.Megaparsec.Char.satisfy Data.Char.isSpace))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you thing replacing satisfy isSpace
with something ala sc = L.space space1 lineComment blockComment
might make a difference performance wise ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea! You're welcome to grab my branch and see :) I won't be able to check that for a while, probably not free to hack now until Sunday.
src/Dhall/Parser.hs
Outdated
notFollowedBy = Text.Megaparsec.notFollowedBy | ||
|
||
instance Text.Parser.Char.CharParsing Parser where | ||
satisfy = Parser . Text.Megaparsec.Char.satisfy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will probably get a speed improvement if you implement the text
method of the CharParsing
class with a high-performance primitive (and probably the rest of the class, too). One of the reasons for using megaparsec
is that it provides bulk parsing primitives
The other reason I suggest this is because that is where the parsing bottleneck is on master
@ocharles: Also, if you can push your branch to this repository then I can open a pull request against your branch to get the rest of the library to build However, I think we should get at least similar performance before merging this. A 60% slowdown is still a lot |
Ok, this is now up to date against
We still have work to do. I have at least implemented all of the The branch has been pushed to this repository. |
Yeah, in my own branch based on yours I found that implementing the |
I have experienced the same ... I am bit puzzled by the |
I have a branch with an example use of It's very fast, so if lexing is the bottleneck that should improve performance a lot. There's still a lot of work to do, though (like properly handling string interpolation, which I think is still possible) |
What's the status of this? I may be able to write an |
So my inclination at this point is to accept the switch to I do think we should do a separate lexing phase at some point but the first step should be to complete the transition to |
I've updated my branch to have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that CI is only failing due to -Werror
and a redundant import
Text.Parser.Token.Style.haskellCommentStyle | ||
|
||
nesting (Parser m) = Parser (nesting m) | ||
highlight _ = id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is the default implementation of highlight
so you can omit it
src/Dhall/Parser.hs
Outdated
|
||
instance Show ParseError where | ||
show (ParseError doc) = | ||
"\n\ESC[1;31mError\ESC[0m: Invalid input\n\n" <> show doc | ||
show (err) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick: You don't need parentheses here any longer
This gives nicer error messages which contextualize the error in the context of the original source code
I also created a pull request against your branch that includes these changes and a change to use the |
Use `parseErrorPretty'` instead of `parseErrorPretty`
Awesome! Thanks for doing this :) |
Just to add some encouragement, it's really easy to write a slow parser. So even translating to megaparsec is just step one. Avoiding inefficient combinators and backtracking etc. should help. I have a similar issue for a language parser I'm writing where I have splicing like string substitution which involves parsing two languages at once. A tokenizer in attoparsec can handle this i.e. to parse into chunks of LanguageOne and LanguageTwo, I've implemented one (parses e.g. 100KB of multi-level nesting in 1.9ms~ just benchmarked it), but the error messages for a misbalanced delimiter in attoparsec are still useless. So, I'm considering either implementing the tokenizer with a manual bytestring parser, which can show useful error messages, or implementing the tokenizer in megaparsec. It's only a page of code so I'll probably try out megaparsec and see how it shapes up speed-wise against attoparsec. Ref. https://gist.github.com/chrisdone/b2966766a07bc1a8eb38d8ff693ff674 |
@chrisdone: One solution which we're considering as an intermediate approach is to use the Then after that the next step is what you suggested: using |
Right, I saw your comment on using a parsers type class. That approach sounds interesting. In the case of Dhall I imagine most code you ever parse is correct, so the successful path will happen more often. And only in dev will you actually see a parse error. So that's an interesting idea! I'd be interested to see updates on that.
Indeed, I have the same interpolation problem, trying to do a lex on syntax like this: demo = { # quoted
quoted $var;
echo "quoted ${} \"foo\" string";
${let i = [1,2,3] # unquoted
p = {ls -al; this is; # quoted
$i ${i} code}
cmd = {stack}
in unquoted text
"unquoted {} string"
p
unquoted text2};
quoted;
text2;
} requires some balancing and language switching between the "shell mode" and the "haskell-like mode" ( One nice thing about this is that I could use the library of my lang parser to provide trivial editor support for syntax highlighting by exposing the tokenizer as e.g. JSON output. I'm hesitant to go all in on a single char-by-char parser straight into the AST. I'll have a go at trying to "drop-in" megaparsec in place of parsec here. Apparently it's capable of fast Right
[ UnquotedByteString "demo = "
, UnquotedSplice
[ QuotedByteString " "
, QuotedComment " quoted comment"
, QuotedByteString "\n quoted "
, QuotedSpliceWord "var"
, QuotedByteString "; "
, QuotedComment " another"
, QuotedByteString "\n echo "
, QuotedString "quoted ${}\n \"foo\" string"
, QuotedByteString ";\n "
, QuotedSplice
[ UnquotedByteString "let i = [1,2,3] "
, UnquotedComment " unquoted"
, UnquotedByteString "\n p = "
, UnquotedSplice
[ QuotedByteString "ls -al; this is; "
, QuotedComment " quoted"
, QuotedByteString "\n "
, QuotedSpliceWord "i"
, QuotedByteString " "
, QuotedSplice [UnquotedByteString "i"]
, QuotedByteString " code"
]
, UnquotedByteString "\n cmd = "
, UnquotedSplice [QuotedByteString "stack"]
, UnquotedByteString "\n in unquoted text\n "
, UnquotedString "unquoted {} string"
, UnquotedByteString "\n p\n unquoted text2"
]
, QuotedByteString ";\n quoted;\n text2;\n"
]
, UnquotedByteString "\n"
] |
So for a word for word drop in replacement, megaparsec is about 1.5x the time of attoparsec on a 100KB file: https://gist.github.com/chrisdone/349c07bf80f89fc5482331cce20199e4
it seems like a reasonable trade-off. It's not like it's 5x or 10x slower. I don't know whether there's a lot of room for performance improvement in megaparsec. |
The difference in error messages if I omit a closing > lexFile "bad.hell" -- attoparsec
endOfInput
> lexFileM "bad.hell" -- megaparsec
27:1:
|
27 | <empty line>
| ^
unexpected end of input
expecting "${", '"', '#', '$', or '}'
|
For me tests go from 1.08s to 1.6s 😞 This isn't a finished PR (I only did enough to build
lib:dhall
andtest:test
), stopping here for feedback. I think it's probably still worth doing, asmegaparsec
has more eyes on it, and more scope for performance tuning. If you're happy with this I'll merge inmaster
and complete the work.Fixes #240.