Replace trifecta with megaparsec #268

ocharles · 2018-02-16T13:36:33Z

For me tests go from 1.08s to 1.6s 😞 This isn't a finished PR (I only did enough to build lib:dhall and test:test), stopping here for feedback. I think it's probably still worth doing, as megaparsec has more eyes on it, and more scope for performance tuning. If you're happy with this I'll merge in master and complete the work.

Fixes #240.

PierreR · 2018-02-16T14:12:55Z

src/Dhall/Parser.hs


 instance TokenParsing Parser where
    someSpace =
        Text.Parser.Token.Style.buildSomeSpaceParser
-            (Parser someSpace)
+            (Parser (Text.Megaparsec.skipSome (Text.Megaparsec.Char.satisfy Data.Char.isSpace)))


Do you thing replacing satisfy isSpace with something ala sc = L.space space1 lineComment blockComment might make a difference performance wise ?

No idea! You're welcome to grab my branch and see :) I won't be able to check that for a while, probably not free to hack now until Sunday.

Gabriella439 · 2018-02-17T22:31:32Z

src/Dhall/Parser.hs

+  notFollowedBy = Text.Megaparsec.notFollowedBy
+
+instance Text.Parser.Char.CharParsing Parser where
+  satisfy = Parser . Text.Megaparsec.Char.satisfy


You will probably get a speed improvement if you implement the text method of the CharParsing class with a high-performance primitive (and probably the rest of the class, too). One of the reasons for using megaparsec is that it provides bulk parsing primitives

The other reason I suggest this is because that is where the parsing bottleneck is on master

Gabriella439 · 2018-02-17T23:09:08Z

@ocharles: Also, if you can push your branch to this repository then I can open a pull request against your branch to get the rest of the library to build

However, I think we should get at least similar performance before merging this. A 60% slowdown is still a lot

…ie/megaparsec

ocharles · 2018-02-18T17:37:27Z

Ok, this is now up to date against master, and all components build. On my laptop:

master tests run in ~1s
ollie/megaparsec tests run in ~2.2S.

We still have work to do. I have at least implemented all of the parsers classes, but note that text and string do a lot of packing and unpacking. That might not be helping things, I'm going to look at that next.

The branch has been pushed to this repository.

Gabriella439 · 2018-02-18T18:47:27Z

Yeah, in my own branch based on yours I found that implementing the text method in terms of string actually slowed down performance. The next thing I'm trying is Chris Done's suggestion of using attoparsec to tokenize the input but using megaparsec to parse based on the tokens

PierreR · 2018-02-18T19:11:41Z

Yeah, in my own branch based on yours I found that implementing the text method in terms of string actually slowed down performance.

I have experienced the same ... I am bit puzzled by the IsString instance in the Parser module and the Chunk datatype. Isn't this what Stream is about in megaparsec ?

Gabriella439 · 2018-02-19T19:06:32Z

I have a branch with an example use of attoparsec to tokenize the input and a benchmark: https://github.com/dhall-lang/dhall-haskell/tree/337d363151a3be60be01cc8a5636b4c19c9da1e2

It's very fast, so if lexing is the bottleneck that should improve performance a lot. There's still a lot of work to do, though (like properly handling string interpolation, which I think is still possible)

vmchale · 2018-03-24T18:34:56Z

What's the status of this? I may be able to write an alex-based lexer, but I do not wish to tread on any toes.

Gabriella439 · 2018-03-25T02:03:20Z

So my inclination at this point is to accept the switch to megaparsec even if it takes a performance hit. We just need to clean this up and complete the transition.

I do think we should do a separate lexing phase at some point but the first step should be to complete the transition to megaparsec since trifecta/parsers does not permit customizing the token type like megaparsec does. Once that is complete we can move forward with changing the token type from characters to lexed tokens.

…ie/megaparsec

ocharles · 2018-03-26T18:00:25Z

I've updated my branch to have master merged in, and slightly more idiomatic Nix changes.

Gabriella439

Note that CI is only failing due to -Werror and a redundant import

Gabriella439 · 2018-03-27T02:39:44Z

src/Dhall/Parser.hs

            Text.Parser.Token.Style.haskellCommentStyle

-    nesting (Parser m) = Parser (nesting m)
+    highlight _ = id


Note that this is the default implementation of highlight so you can omit it

Gabriella439 · 2018-03-27T02:40:21Z

src/Dhall/Parser.hs


 instance Show ParseError where
-    show (ParseError doc) =
-      "\n\ESC[1;31mError\ESC[0m: Invalid input\n\n" <> show doc
+    show (err) =


Minor nitpick: You don't need parentheses here any longer

This gives nicer error messages which contextualize the error in the context of the original source code

Gabriella439 · 2018-03-27T03:46:03Z

I also created a pull request against your branch that includes these changes and a change to use the parseErrorPretty' function: ocharles#2

Use `parseErrorPretty'` instead of `parseErrorPretty`

Gabriella439 · 2018-03-27T16:06:58Z

Awesome! Thanks for doing this :)

chrisdone · 2018-06-21T14:29:05Z

Just to add some encouragement, it's really easy to write a slow parser. So even translating to megaparsec is just step one. Avoiding inefficient combinators and backtracking etc. should help.

I have a similar issue for a language parser I'm writing where I have splicing like string substitution which involves parsing two languages at once. A tokenizer in attoparsec can handle this i.e. to parse into chunks of LanguageOne and LanguageTwo, I've implemented one (parses e.g. 100KB of multi-level nesting in 1.9ms~ just benchmarked it), but the error messages for a misbalanced delimiter in attoparsec are still useless. So, I'm considering either implementing the tokenizer with a manual bytestring parser, which can show useful error messages, or implementing the tokenizer in megaparsec. It's only a page of code so I'll probably try out megaparsec and see how it shapes up speed-wise against attoparsec.

Ref. https://gist.github.com/chrisdone/b2966766a07bc1a8eb38d8ff693ff674

Gabriella439 · 2018-06-21T15:04:40Z

@chrisdone: One solution which we're considering as an intermediate approach is to use the parsers type class to take advantage of both libraries: we first try to parse using attoparsec entirely for speed on the happy path and then if parsing fails we fall back on reparsing using megaparsec for better error messages

Then after that the next step is what you suggested: using attoparsec for lexing and megaparsec for parsing. I've already done a very rough first draft of lexing using attoparsec and I was able to reproduce much faster throughput like you said. However, switching to a lexer requires a bit more effort than what I've invested so far because the language has some constructs that are not friendly to lexing, like nested block comments or string interpolation. I don't think it's impossible, but it just requires more work.

chrisdone · 2018-06-21T15:43:34Z

One solution which we're considering as an intermediate approach is to use the parsers type class to take advantage of both libraries: we first try to parse using attoparsec entirely for speed on the happy path and then if parsing fails we fall back on reparsing using megaparsec for better error messages

Right, I saw your comment on using a parsers type class. That approach sounds interesting. In the case of Dhall I imagine most code you ever parse is correct, so the successful path will happen more often. And only in dev will you actually see a parse error. So that's an interesting idea! I'd be interested to see updates on that.

I've already done a very rough first draft of lexing using attoparsec and I was able to reproduce much faster throughput like you said. However, switching to a lexer requires a bit more effort than what I've invested so far because the language has some constructs that are not friendly to lexing, like nested block comments or string interpolation. I don't think it's impossible, but it just requires more work.

Indeed, I have the same interpolation problem, trying to do a lex on syntax like this:

demo = { # quoted 
  quoted $var; 
  echo "quoted ${} \"foo\" string";
  ${let i = [1,2,3] # unquoted 
        p = {ls -al; this is; # quoted 
           $i ${i} code}
        cmd = {stack}
    in unquoted text
       "unquoted {} string"
       p
       unquoted text2};
  quoted;
  text2;
}

requires some balancing and language switching between the "shell mode" and the "haskell-like mode" (foo ${bar {mu ${foo}}} etc). Lexing this is fast to make something simply producing boundaries (see below). Obviously, those QuotedByteString and UnquotedByteString are just the untokenized strings that could be further fleshed out, either immediately or in a later consumer.

One nice thing about this is that I could use the library of my lang parser to provide trivial editor support for syntax highlighting by exposing the tokenizer as e.g. JSON output. I'm hesitant to go all in on a single char-by-char parser straight into the AST.

I'll have a go at trying to "drop-in" megaparsec in place of parsec here. Apparently it's capable of fast takeWhile-type consumers, so I'll give it a go and report my numbers vs attoparsec. 👍

Right
  [ UnquotedByteString "demo = "
  , UnquotedSplice
      [ QuotedByteString " "
      , QuotedComment " quoted comment"
      , QuotedByteString "\n  quoted "
      , QuotedSpliceWord "var"
      , QuotedByteString "; "
      , QuotedComment " another"
      , QuotedByteString "\n  echo "
      , QuotedString "quoted ${}\n  \"foo\" string"
      , QuotedByteString ";\n  "
      , QuotedSplice
          [ UnquotedByteString "let i = [1,2,3] "
          , UnquotedComment " unquoted"
          , UnquotedByteString "\n        p = "
          , UnquotedSplice
              [ QuotedByteString "ls -al; this is; "
              , QuotedComment " quoted"
              , QuotedByteString "\n           "
              , QuotedSpliceWord "i"
              , QuotedByteString " "
              , QuotedSplice [UnquotedByteString "i"]
              , QuotedByteString " code"
              ]
          , UnquotedByteString "\n        cmd = "
          , UnquotedSplice [QuotedByteString "stack"]
          , UnquotedByteString "\n    in unquoted text\n       "
          , UnquotedString "unquoted {} string"
          , UnquotedByteString "\n       p\n       unquoted text2"
          ]
      , QuotedByteString ";\n  quoted;\n  text2;\n"
      ]
  , UnquotedByteString "\n"
  ]

chrisdone · 2018-06-22T09:44:24Z

So for a word for word drop in replacement, megaparsec is about 1.5x the time of attoparsec on a 100KB file: https://gist.github.com/chrisdone/349c07bf80f89fc5482331cce20199e4

benchmarking attoparsec
time                 1.686 ms   (1.679 ms .. 1.694 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.678 ms   (1.672 ms .. 1.686 ms)
std dev              25.18 μs   (19.12 μs .. 32.91 μs)

benchmarking megaparsec
time                 2.646 ms   (2.627 ms .. 2.667 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 2.623 ms   (2.615 ms .. 2.634 ms)
std dev              30.27 μs   (22.03 μs .. 41.35 μs)

it seems like a reasonable trade-off. It's not like it's 5x or 10x slower.

I don't know whether there's a lot of room for performance improvement in megaparsec.

chrisdone · 2018-06-22T10:47:53Z

The difference in error messages if I omit a closing }:

>  lexFile "bad.hell" -- attoparsec
endOfInput
> lexFileM "bad.hell" -- megaparsec
27:1:
   |
27 | <empty line>
   | ^
unexpected end of input
expecting "${", '"', '#', '$', or '}'

Replace trifecta with megaparsec

a63aa1f

ocharles requested a review from Gabriella439 February 16, 2018 13:36

PierreR reviewed Feb 16, 2018

View reviewed changes

Gabriella439 reviewed Feb 17, 2018

View reviewed changes

ocharles added 7 commits February 18, 2018 17:09

Incorporate some of Pierre Radermecker's suggestions

718a3fc

Fully implement all of the classes we use from parsers

f2b4ce0

Merge branch 'master' of github.com:dhall-lang/dhall-haskell into oll…

7d0023f

…ie/megaparsec

Regenerate default.nix

68d6b98

Update dhall-repl

6876945

Fix dhall-repl

da6c0f4

Regenerate default.nix

40e53c3

ocharles added 2 commits March 26, 2018 18:34

Merge branch 'master' of github.com:dhall-lang/dhall-haskell into oll…

4c2b2d4

…ie/megaparsec

Merge master

a9faf1d

Gabriella439 reviewed Mar 27, 2018

View reviewed changes

Use parseErrorPretty' instead of parseErrorPretty

4c4d446

This gives nicer error messages which contextualize the error in the context of the original source code

Merge pull request #2 from dhall-lang/gabriel/pretty_error

8984a2b

Use `parseErrorPretty'` instead of `parseErrorPretty`

Gabriella439 merged commit 922e20e into dhall-lang:master Mar 27, 2018

aeons mentioned this pull request May 28, 2018

Switch trifecta to megaparsec cabalism/hpack-dhall#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace trifecta with megaparsec #268

Replace trifecta with megaparsec #268

ocharles commented Feb 16, 2018 •

edited

Loading

PierreR Feb 16, 2018

ocharles Feb 16, 2018

Gabriella439 Feb 17, 2018

Gabriella439 commented Feb 17, 2018

ocharles commented Feb 18, 2018

Gabriella439 commented Feb 18, 2018

PierreR commented Feb 18, 2018 •

edited

Loading

Gabriella439 commented Feb 19, 2018

vmchale commented Mar 24, 2018

Gabriella439 commented Mar 25, 2018

ocharles commented Mar 26, 2018

Gabriella439 left a comment

Gabriella439 Mar 27, 2018

Gabriella439 Mar 27, 2018

Gabriella439 commented Mar 27, 2018

Gabriella439 commented Mar 27, 2018

chrisdone commented Jun 21, 2018 •

edited

Loading

Gabriella439 commented Jun 21, 2018

chrisdone commented Jun 21, 2018 •

edited

Loading

chrisdone commented Jun 22, 2018 •

edited

Loading

chrisdone commented Jun 22, 2018

Replace trifecta with megaparsec #268

Replace trifecta with megaparsec #268

Conversation

ocharles commented Feb 16, 2018 • edited Loading

PierreR Feb 16, 2018

Choose a reason for hiding this comment

ocharles Feb 16, 2018

Choose a reason for hiding this comment

Gabriella439 Feb 17, 2018

Choose a reason for hiding this comment

Gabriella439 commented Feb 17, 2018

ocharles commented Feb 18, 2018

Gabriella439 commented Feb 18, 2018

PierreR commented Feb 18, 2018 • edited Loading

Gabriella439 commented Feb 19, 2018

vmchale commented Mar 24, 2018

Gabriella439 commented Mar 25, 2018

ocharles commented Mar 26, 2018

Gabriella439 left a comment

Choose a reason for hiding this comment

Gabriella439 Mar 27, 2018

Choose a reason for hiding this comment

Gabriella439 Mar 27, 2018

Choose a reason for hiding this comment

Gabriella439 commented Mar 27, 2018

Gabriella439 commented Mar 27, 2018

chrisdone commented Jun 21, 2018 • edited Loading

Gabriella439 commented Jun 21, 2018

chrisdone commented Jun 21, 2018 • edited Loading

chrisdone commented Jun 22, 2018 • edited Loading

chrisdone commented Jun 22, 2018

ocharles commented Feb 16, 2018 •

edited

Loading

PierreR commented Feb 18, 2018 •

edited

Loading

chrisdone commented Jun 21, 2018 •

edited

Loading

chrisdone commented Jun 21, 2018 •

edited

Loading

chrisdone commented Jun 22, 2018 •

edited

Loading