Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace trifecta with megaparsec #268

Merged
merged 12 commits into from
Mar 27, 2018

Conversation

ocharles
Copy link
Member

@ocharles ocharles commented Feb 16, 2018

For me tests go from 1.08s to 1.6s 😞 This isn't a finished PR (I only did enough to build lib:dhall and test:test), stopping here for feedback. I think it's probably still worth doing, as megaparsec has more eyes on it, and more scope for performance tuning. If you're happy with this I'll merge in master and complete the work.

Fixes #240.


instance TokenParsing Parser where
someSpace =
Text.Parser.Token.Style.buildSomeSpaceParser
(Parser someSpace)
(Parser (Text.Megaparsec.skipSome (Text.Megaparsec.Char.satisfy Data.Char.isSpace)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you thing replacing satisfy isSpace with something ala sc = L.space space1 lineComment blockComment might make a difference performance wise ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea! You're welcome to grab my branch and see :) I won't be able to check that for a while, probably not free to hack now until Sunday.

notFollowedBy = Text.Megaparsec.notFollowedBy

instance Text.Parser.Char.CharParsing Parser where
satisfy = Parser . Text.Megaparsec.Char.satisfy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will probably get a speed improvement if you implement the text method of the CharParsing class with a high-performance primitive (and probably the rest of the class, too). One of the reasons for using megaparsec is that it provides bulk parsing primitives

The other reason I suggest this is because that is where the parsing bottleneck is on master

@Gabriella439
Copy link
Collaborator

@ocharles: Also, if you can push your branch to this repository then I can open a pull request against your branch to get the rest of the library to build

However, I think we should get at least similar performance before merging this. A 60% slowdown is still a lot

@ocharles
Copy link
Member Author

Ok, this is now up to date against master, and all components build. On my laptop:

  • master tests run in ~1s
  • ollie/megaparsec tests run in ~2.2S.

We still have work to do. I have at least implemented all of the parsers classes, but note that text and string do a lot of packing and unpacking. That might not be helping things, I'm going to look at that next.

The branch has been pushed to this repository.

@Gabriella439
Copy link
Collaborator

Yeah, in my own branch based on yours I found that implementing the text method in terms of string actually slowed down performance. The next thing I'm trying is Chris Done's suggestion of using attoparsec to tokenize the input but using megaparsec to parse based on the tokens

@PierreR
Copy link
Contributor

PierreR commented Feb 18, 2018

Yeah, in my own branch based on yours I found that implementing the text method in terms of string actually slowed down performance.

I have experienced the same ... I am bit puzzled by the IsString instance in the Parser module and the Chunk datatype. Isn't this what Stream is about in megaparsec ?

@Gabriella439
Copy link
Collaborator

I have a branch with an example use of attoparsec to tokenize the input and a benchmark: https://github.com/dhall-lang/dhall-haskell/tree/337d363151a3be60be01cc8a5636b4c19c9da1e2

It's very fast, so if lexing is the bottleneck that should improve performance a lot. There's still a lot of work to do, though (like properly handling string interpolation, which I think is still possible)

@vmchale
Copy link
Collaborator

vmchale commented Mar 24, 2018

What's the status of this? I may be able to write an alex-based lexer, but I do not wish to tread on any toes.

@Gabriella439
Copy link
Collaborator

So my inclination at this point is to accept the switch to megaparsec even if it takes a performance hit. We just need to clean this up and complete the transition.

I do think we should do a separate lexing phase at some point but the first step should be to complete the transition to megaparsec since trifecta/parsers does not permit customizing the token type like megaparsec does. Once that is complete we can move forward with changing the token type from characters to lexed tokens.

@ocharles
Copy link
Member Author

I've updated my branch to have master merged in, and slightly more idiomatic Nix changes.

Copy link
Collaborator

@Gabriella439 Gabriella439 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CI is only failing due to -Werror and a redundant import

Text.Parser.Token.Style.haskellCommentStyle

nesting (Parser m) = Parser (nesting m)
highlight _ = id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is the default implementation of highlight so you can omit it


instance Show ParseError where
show (ParseError doc) =
"\n\ESC[1;31mError\ESC[0m: Invalid input\n\n" <> show doc
show (err) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: You don't need parentheses here any longer

This gives nicer error messages which contextualize the error in the
context of the original source code
@Gabriella439
Copy link
Collaborator

I also created a pull request against your branch that includes these changes and a change to use the parseErrorPretty' function: ocharles#2

Use `parseErrorPretty'` instead of `parseErrorPretty`
@Gabriella439
Copy link
Collaborator

Awesome! Thanks for doing this :)

@Gabriella439 Gabriella439 merged commit 922e20e into dhall-lang:master Mar 27, 2018
@chrisdone
Copy link

chrisdone commented Jun 21, 2018

Just to add some encouragement, it's really easy to write a slow parser. So even translating to megaparsec is just step one. Avoiding inefficient combinators and backtracking etc. should help.

I have a similar issue for a language parser I'm writing where I have splicing like string substitution which involves parsing two languages at once. A tokenizer in attoparsec can handle this i.e. to parse into chunks of LanguageOne and LanguageTwo, I've implemented one (parses e.g. 100KB of multi-level nesting in 1.9ms~ just benchmarked it), but the error messages for a misbalanced delimiter in attoparsec are still useless. So, I'm considering either implementing the tokenizer with a manual bytestring parser, which can show useful error messages, or implementing the tokenizer in megaparsec. It's only a page of code so I'll probably try out megaparsec and see how it shapes up speed-wise against attoparsec.

Ref. https://gist.github.com/chrisdone/b2966766a07bc1a8eb38d8ff693ff674

@Gabriella439
Copy link
Collaborator

@chrisdone: One solution which we're considering as an intermediate approach is to use the parsers type class to take advantage of both libraries: we first try to parse using attoparsec entirely for speed on the happy path and then if parsing fails we fall back on reparsing using megaparsec for better error messages

Then after that the next step is what you suggested: using attoparsec for lexing and megaparsec for parsing. I've already done a very rough first draft of lexing using attoparsec and I was able to reproduce much faster throughput like you said. However, switching to a lexer requires a bit more effort than what I've invested so far because the language has some constructs that are not friendly to lexing, like nested block comments or string interpolation. I don't think it's impossible, but it just requires more work.

@chrisdone
Copy link

chrisdone commented Jun 21, 2018

One solution which we're considering as an intermediate approach is to use the parsers type class to take advantage of both libraries: we first try to parse using attoparsec entirely for speed on the happy path and then if parsing fails we fall back on reparsing using megaparsec for better error messages

Right, I saw your comment on using a parsers type class. That approach sounds interesting. In the case of Dhall I imagine most code you ever parse is correct, so the successful path will happen more often. And only in dev will you actually see a parse error. So that's an interesting idea! I'd be interested to see updates on that.

I've already done a very rough first draft of lexing using attoparsec and I was able to reproduce much faster throughput like you said. However, switching to a lexer requires a bit more effort than what I've invested so far because the language has some constructs that are not friendly to lexing, like nested block comments or string interpolation. I don't think it's impossible, but it just requires more work.

Indeed, I have the same interpolation problem, trying to do a lex on syntax like this:

demo = { # quoted 
  quoted $var; 
  echo "quoted ${} \"foo\" string";
  ${let i = [1,2,3] # unquoted 
        p = {ls -al; this is; # quoted 
           $i ${i} code}
        cmd = {stack}
    in unquoted text
       "unquoted {} string"
       p
       unquoted text2};
  quoted;
  text2;
}

requires some balancing and language switching between the "shell mode" and the "haskell-like mode" (foo ${bar {mu ${foo}}} etc). Lexing this is fast to make something simply producing boundaries (see below). Obviously, those QuotedByteString and UnquotedByteString are just the untokenized strings that could be further fleshed out, either immediately or in a later consumer.

One nice thing about this is that I could use the library of my lang parser to provide trivial editor support for syntax highlighting by exposing the tokenizer as e.g. JSON output. I'm hesitant to go all in on a single char-by-char parser straight into the AST.

I'll have a go at trying to "drop-in" megaparsec in place of parsec here. Apparently it's capable of fast takeWhile-type consumers, so I'll give it a go and report my numbers vs attoparsec. 👍

Right
  [ UnquotedByteString "demo = "
  , UnquotedSplice
      [ QuotedByteString " "
      , QuotedComment " quoted comment"
      , QuotedByteString "\n  quoted "
      , QuotedSpliceWord "var"
      , QuotedByteString "; "
      , QuotedComment " another"
      , QuotedByteString "\n  echo "
      , QuotedString "quoted ${}\n  \"foo\" string"
      , QuotedByteString ";\n  "
      , QuotedSplice
          [ UnquotedByteString "let i = [1,2,3] "
          , UnquotedComment " unquoted"
          , UnquotedByteString "\n        p = "
          , UnquotedSplice
              [ QuotedByteString "ls -al; this is; "
              , QuotedComment " quoted"
              , QuotedByteString "\n           "
              , QuotedSpliceWord "i"
              , QuotedByteString " "
              , QuotedSplice [UnquotedByteString "i"]
              , QuotedByteString " code"
              ]
          , UnquotedByteString "\n        cmd = "
          , UnquotedSplice [QuotedByteString "stack"]
          , UnquotedByteString "\n    in unquoted text\n       "
          , UnquotedString "unquoted {} string"
          , UnquotedByteString "\n       p\n       unquoted text2"
          ]
      , QuotedByteString ";\n  quoted;\n  text2;\n"
      ]
  , UnquotedByteString "\n"
  ]

@chrisdone
Copy link

chrisdone commented Jun 22, 2018

So for a word for word drop in replacement, megaparsec is about 1.5x the time of attoparsec on a 100KB file: https://gist.github.com/chrisdone/349c07bf80f89fc5482331cce20199e4

benchmarking attoparsec
time                 1.686 ms   (1.679 ms .. 1.694 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.678 ms   (1.672 ms .. 1.686 ms)
std dev              25.18 μs   (19.12 μs .. 32.91 μs)

benchmarking megaparsec
time                 2.646 ms   (2.627 ms .. 2.667 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 2.623 ms   (2.615 ms .. 2.634 ms)
std dev              30.27 μs   (22.03 μs .. 41.35 μs)

it seems like a reasonable trade-off. It's not like it's 5x or 10x slower.

I don't know whether there's a lot of room for performance improvement in megaparsec.

@chrisdone
Copy link

The difference in error messages if I omit a closing }:

>  lexFile "bad.hell" -- attoparsec
endOfInput
> lexFileM "bad.hell" -- megaparsec
27:1:
   |
27 | <empty line>
   | ^
unexpected end of input
expecting "${", '"', '#', '$', or '}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants