Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Large space usage increase between 0.10.3.0 and 0.10.4.0 #42

Closed
tibbe opened this Issue · 9 comments

6 participants

@tibbe

There's a large increase in space usage between 0.10.3.0 and 0.10.4.0. The increase can most easily be seen using tibbe/cassava@76451b4 or later and this test program. Earlier version of cassava show the same problem, but it's easier to see using the current master. You'll need the n32_results.txt input file.

To compile and run the test do:

ghc -O2 Test.hs -rtsopts
./Test n32_results.txt +RTS -h -i0.01

(Make sure cassava is compiled against either 0.10.3.0 or 0.10.4.0.)

Using 0.10.3.0 I get this heap profile:

attoparsec-0 10 3 0

And with 0.10.4.0 I get this:

attoparsec-0 10 4 0

I suspect c707514 might be to blame.

@batterseapower

I wouldn't be surprised if fixing <|> to have the correct behaviour increased space usage: after all the old version dropped the input it gathered upon detecting an error. That's the price of correctness I guess.

@tibbe

If the cost is this high though, perhaps we should consider a breaking change and move to the parsec behavior where backtracking doesn't happen by default?

@tibbe

An alternative would be to have a non-backtracking or as well as a backtracking one. (Instead of changing the backtracking by default behavior.) attoparsec is used to write parser for many data formats that require no or very little backtracking (often only 1 byte worth of look-ahead), so this quite a hefty price to pay for something that's rarely needed.

@bos
Owner

I'd be happy to explore our options around when to drop the input we save in case we need to backtrack. I'm certainly open to having a "non-backtracking <|> combinator", though there may be other possibilities worth investigating.

The general experience with Parsec not backtracking by default suggests that people have a very hard time figuring out when to add uses of try, so I'd strongly prefer to maintain the current backtracking-by-default behaviour of attoparsec.

@nh2

For reference, there is a haskell-cafe thread that seems to suggest that a non-backgracking <|> would be useful - in this case it is to give correct error messages.

In case you make such a combinator, <||> would be loved by most as the short-circuit OR ;)

@bos I would totally agree that the backtracking <|> should be default behaviour - in my opinion, this is what makes attoparsec intuitive and parsec. Parsec's behaviour makes my parsers less composable: It forces you know in each parser for each little thing to globally know what can come next (somewhat like FIRST/FOLLOW sets) in order to not accidentally overcommit / eat too much input.

(To be honest, I would not be surprised if parsec's <|> behaviour violated some law. Anybody knows?)

For example, take an input like ( 4 2 12 31 ) - numbers in brackets separated by whitespace. Intuitively, I would parse this like this:

ws = whitespace
parse = string "(" *> ws *> numbers <* ws <* string ")"
numbers = decimal `sepBy1` ws

In attoparsec, this immediately works and numbers has the nice property of only spanning 4 2 12 31 and not eating the following whitespace (eating 4 2 12 31) - parsers that only consume what they really need are easier to re-use.

In parsec, this does not even work: After parsing 31, sepBy1 will try to consume ws again; this will succeed, but although it fails to afterwards parse a decimal, it is already too late, and the whole parsing will fail.

This is because sepBy1 is defined like this:

sepBy1 p sep = do x <- p
                  xs <- many (sep >> p)
                  return (x:xs)

and as a user, I have no way to express that I want to have try semantics inside there.

So my points are:

  • Exclusively offering backtracking xor non-backtracking behaviour is not enough. We seem to need both to write simpler and faster parsers while at the same time allowing better error messages.
  • Backtracking-by-default is the intuitive choice.
@t0yv0

Have you considered parallel evaluation of alternatives? Basic idea: when evaluating feed s (a <|> b) you can split s into chunks of a certain size, and feed one chunk at a time to both parsers, until either of them fails. Same idea as in iterative deeping depth-first search (IDDFS), if you will. Should give same answers as backtracking, but reduce memory use for cases where some of the alternatives fail quickly. I am not sure at all how this would play with real world grammars - is the benchmark folder the one to look at?

@mvv
mvv commented

It is actually much much worse. The following simple (notice that I don't even aggregate parsed values) program parses a 8640x3432 (~115MB file) table of Doubles:

{-# LANGUAGE UnicodeSyntax #-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DoAndIfThenElse #-}

module Main where

import Control.Applicative
import Control.Monad
import Data.Attoparsec.ByteString.Char8
import qualified Data.Attoparsec.ByteString.Lazy as AttL
import qualified Data.ByteString.Lazy as LBS
import System.Environment (getArgs)

skipWs ∷ Parser ()
skipWs = skipMany1 (satisfy $ (||) <$> (== ' ') <*> (== '\t'))

skipNl ∷ Parser ()
skipNl =  skipMany (satisfy $ (||) <$> (== ' ') <*> (== '\t'))
       *> void (char '\n')

parseTable ∷ IntIntParser ()
parseTable nRows nCols = do
    parseRows nRows
  where
    parseRows !left = do
      parseCols nCols
      if left == 1
      then do
        skipMany skipNl
        return ()
      else do
        skipNl
        parseRows (left - 1)
    parseCols !left = do
      void $ double
      if left == 1
      then return ()
      else do
        skipWs
        parseCols (left - 1)

main ∷ IO ()
main = do
  [path] ← getArgs
  AttL.parse (parseTable 3432 8640) <$> LBS.readFile path >>= \case
    AttL.Done _ ()do
      putStrLn $ "OK"
    AttL.Fail _ ctx e → do
      putStrLn $ "Failed: " ++ show ctx ++ ": " ++ show e

attoparsec-0.10.3 does the job in ~4min, using a constant amount of memory (~13M according to top). Here is the profiler report:
test1

I had to stop the attoparsec-0.10.4 version after 1 hour (!) because it was hogging all my RAM (~7.2G according to top). Here is the partial profiler report:
test2

This is obscene.

@mvv mvv referenced this issue in ekmett/parsers
Merged

Added instances for Attoparsec. #21

@bos
Owner

I have good news and bad. The good news is that the upcoming attoparsec 0.12 improves space usage on @tibbe's benchmark by 50%.

The bad news is that the internal machinery has fundamentally changed such that attoparsec now uses a cursor to index its internal buffer, and no longer shrinks that buffer as a side effect of parsing. The consequence is that if I was to introduce a non-backtracking <|>, it would no longer have any effect on space usage, unless we were to switch to yet another internal model.

To be honest, the original buggy version of <|> saved space by accident. If you'd like to design an internal data structure that can stream (by which I mean that input that we've parsed can be GCed), please go for it, but remember that streaming is just one consideration.

Said data structure must also efficiently support all of the following: bytestring-levels-of-cheap sequential access, backtracking, no-backtracking, streaming, and incremental input (including good behaviour in the face of input maliciously trickled one byte at a time). What we have today satisfies all of these needs except streaming.

@bos bos closed this
@tibbe

Sounds fair to me. I worked around the <|> thunk build-up issue quite a long time ago by using the peek* functions instead.

Finding a design that fulfills all your requirements, including support for streaming, looks like a research paper worthy challenge to me. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.