Suspicious amount of time and allocation on decode path #4

istathar · 2014-01-23T06:38:14Z

I'm not sure if this rings any bells, but we've finally reached the point of doing some profiling on our application and there's one standout: getVarintPrefixedBS is doing the bulk of the allocation (fine) but taking an awfully long time to do it.

                                                                                             individual     inherited
COST CENTRE                 MODULE                                 no.     entries  %time %alloc   %time %alloc  ticks     bytes 
getVarintPrefixedBS         Data.ProtocolBuffers.Wire              264           0   77.7   57.8    77.7   57.8   1974 2900040432

Not sure if that rings any bells, but I thought I'd ask and see if there's anything obvious you know about there that we should have a look at.

AfC

The text was updated successfully, but these errors were encountered:

NathanHowell · 2014-01-23T06:48:16Z

I've seen this a bit in the past...my recollection is that it's from code inlined from cereal, and getVarintPrefixedBS doesn't do much else. I think the alternatives would be to fix cereal, switch to binary or use kmett's package that abstracts the two

NathanHowell · 2014-01-23T07:20:14Z

The suspect function is here if you want to check it out, I'm also open to other suggestions. Are you decoding a lot of strings?

istathar · 2014-01-27T08:15:33Z

Yeah I am. The protobufs are (pretty . show . contrived) something like

datacenter:lhr1,
hostname:secure.example.org,
metric:eth0-tx-bytes
1390810279171907000
15301

The numbers are just a few bytes varint and or fixed64 encoded, but the key value pairs are about a hundred bytes and yeah, string heavy.

AfC

NathanHowell · 2014-01-27T16:01:48Z

How much of an impact is this having on your app? I'd like to get this sorted out but I'm short of time at the moment.

istathar · 2014-01-28T04:25:16Z

1,000 data points 0.23 seconds;
10,000 data points 9.6 seconds.

Ouch. Profiling shows it to be about 75% of time and 92% of the allocation burden.

      fmap                    Data.Serialize.Get                     301       10000    0.0    0.0    75.4   92.5
       fmap.\                 Data.Serialize.Get                     302       40002    0.0    0.0    75.4   92.5
        getWord8              Data.Serialize.Get                     312           0    0.0    0.0    75.4   92.4
         >>=                  Data.Serialize.Get                     313       50001    0.0    0.0    75.4   92.4
          >>=.\               Data.Serialize.Get                     314       70002    0.1    0.0    75.4   92.4
           demandInput.\      Data.Serialize.Get                     322           1    0.0    0.0     0.0    0.0
            mplus             Data.Serialize.Get                     323           0    0.0    0.0     0.0    0.0
             mplus.\          Data.Serialize.Get                     324           0    0.0    0.0     0.0    0.0
              mplus.\.kf'     Data.Serialize.Get                     325           1    0.0    0.0     0.0    0.0
               unGet          Data.Serialize.Get                     326           1    0.0    0.0     0.0    0.0
           >>=.\.ks'          Data.Serialize.Get                     316      130000   75.2   92.3    75.3   92.4
            unGet             Data.Serialize.Get                     318      130000    0.0    0.0     0.0    0.0
            return            Data.Serialize.Get                     317       70001    0.0    0.0     0.1    0.1
             return.\         Data.Serialize.Get                     319       70000    0.0    0.0     0.1    0.1
              fmap.\.ks'      Data.Serialize.Get                     320       40000    0.1    0.1     0.1    0.1
           unGet              Data.Serialize.Get                     315       70003    0.0    0.0     0.0    0.0

So it's on the Get path. cereal has a Builder which it employs for the work of building ByteStrings on the encode path, but seems to be doing something horrible on the decode one. We're looking at it further now. Off hand I can't see anything stupid that protobuf is doing, but overall the heapmap shows we're getting a steady leak, especially in PINNED. That's a bit of a surprise.

AfC

christian-marie · 2014-01-28T05:18:57Z

I've taken a look at this, it appears that cereal is simply slow due to it's continuation passing nature and the fact that each bind creates a new continuation.

This problem is easy to replicate with any repeated field.

Here is an example that shows the issue:

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize

data Burst = Burst {
    frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)

instance Encode Burst
instance Decode Burst

encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

main :: IO ()
main = do
    let encoded = encodeBurst $ replicate 30000 "hai"
    print $ B.length encoded

    let decoded = decodeBurst encoded
    either error (const $ return ()) decoded

christian-marie · 2014-01-28T23:54:25Z

Disreguard that handwavey statement. I've found the real issue. I will provide free performance in a pull request shortly. Feel free to actually implement the fix however you like. The issue is with list traversal on insertion to the hashmap.

NathanHowell · 2014-01-29T00:23:20Z

Let me know if this fixes your perf issue. I'll cut a new release if it does.

istathar · 2014-01-29T02:26:56Z

@NathanHowell definite improvement. We're down to 1.6 seconds from 9.6! @christian-marie is definitely on the right track. That same part of cereal is still costing about 24% of time, though. Makes me wonder if there's another thunk leak happening somewhere.

NathanHowell · 2014-02-01T07:17:47Z

Definitely possible, the issue I've seen was from Value Text fields iirc, not Repeated x. If it's not from cereal it's probably excess laziness. I'm going to leave this issue open for now, but I did cut a new release (v0.1.3 8b2333f) with @christian-marie's patches.

christian-marie · 2014-02-20T06:37:39Z

I'm going to give hs-packer a go as soon as we can get the missing features into it:

vincenthz/hs-packer#2
vincenthz/hs-packer#3

NathanHowell · 2014-02-20T08:52:03Z

I've heard there is a big rewrite coming for binary... and it also has enough features (as of 0.6.x iirc) to work for protobuf too. I think it is usually faster than cereal. If hs-packer works and is fast, I'm certainly open to taking it on.

christian-marie · 2014-02-24T03:11:00Z

I'm putting hs-packer on the back-burner until someone can work this out:

vincenthz/hs-packer#5

I did implement the decode path, though that wasn't an order of magnitude faster. You can see that here:

be0a7a2

I believe that the allocation of thousands of haskell objects for each element of a repeated field might be the bottleneck here, not the serialization library.

NathanHowell · 2014-02-24T16:30:13Z

Because of HashMap Tag [WireField] in the intermediate decoding step?

christian-marie · 2014-02-24T21:40:19Z

I'm not entirely sure. I suspect so though.

Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation.

My bet is that if you can think of a faster way to do that, you'll probably nail the performance issues.

NathanHowell · 2014-02-24T21:53:23Z

The other obvious way is to use a static (per type) HashMap of setters
instead of a HashMap of WireFields.. I've thought about doing this before,
may still be necessary to batch repeated field updates to minimize record
copies (and GC time). I was originally thinking it would be faster to use a
HashMap because of this but haven't tried it. I can hack up an
implementation later in the week unless you want to take a stab at it.
On Feb 24, 2014 1:40 PM, "Christian Marie" notifications@github.com wrote:

I'm not entirely sure. I suspect so though.

Running a large decode with +RTS -hy shows a lot of WireField, HashMap and
ByteString allocation.

My bet is that if can think of a faster way to do that, you'll probably
nail the performance issues.

Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35942482
.

christian-marie · 2014-02-24T21:55:49Z

A simple test, decoding 2.9MB of repeated four byte strings: http://ponies.io/raw/protobuf-hy.pdf

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize

data Burst = Burst {
    frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)

instance Encode Burst
instance Decode Burst

encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

main :: IO ()
main = do
    Right m <- decodeBurst `fmap` B.readFile "encoded"
    print m

christian-marie · 2014-02-24T21:57:37Z

I'll note that the HashMap allocations are missing in that graph, probably because the only thing growing in this test case is one key (a list).

christian-marie · 2014-02-24T22:04:48Z

Also interestingly: ARR_WORDS, which I am pretty sure is the underlying bytestring data buffer (2.9MB worth) stays pretty constant. So, that's good. I don't think there's much we can do about the ByteString container allocations unless we're doing more than we need to.

Which only leaves WireField and [] containers to be optimized.

As for taking a stab at this alternative implementation you mention. I'd be happy to but I'm not going to be able to prioritize this until next week either. Maybe we can poke each-other again then. Otherwise, if you can describe the implementation in a little more detail I'll probably get around to trying it in my own time out of curiosity.

christian-marie · 2014-04-03T08:25:36Z

Hi Nathan, poke.

This is bugging us again, we'd like it to be faster. Any suggestions on where our efforts should be directed to cut down on these allocations next?

NathanHowell · 2014-04-03T14:54:54Z

Is the concern allocation or runtime? Reading 500k ByteStrings into a list has a similar heap profile, with the exception of WireField.DelimitedField.

quickDecode :: Get [ByteString]
quickDecode = go [] (512*1024) where
  go xs 0 = return (reverse xs)
  go xs i = do
   x <- getBytes 4
   go (x:xs) (i-1)

alternativeDecode :: Get [ByteString]
alternativeDecode = reverse <$> go [] where
  go xs = do
   mx <- Just <$> getBytes 4 <|> pure Nothing
   case mx of
     Just x -> go (x:xs)
     Nothing -> pure xs

decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage

christian-marie · 2014-04-03T23:38:59Z

Our concern is specifically the speed of decoding many delimited fields, and decoding many small messages in series.

We are seeing this kind of thing currently:

COST CENTRE MODULE             %time %alloc

>>=.\.ks'   Data.Serialize.Get  50.8   58.3
>>=.\       Data.Serialize.Get  15.8   18.4
fmap.\.ks'  Data.Serialize.Get   7.7   15.1

Where, I think ks is the success continuation bind? I.e. mfield <- getWireField

I'm thinking that reducing the allocations by trying to box things as little as possible will decrease the runtime. Our bottom line is reducing CPU usage and runtime, memory is cheap as far as we're concerned. But allocations are expensive.

The motivation behind this is that we are currently bottlenecked almost completely on decoding protobufs.

NathanHowell · 2014-04-04T05:43:50Z

Yeah, the expensive >>=.\.ks' is this one.

The implementation in protobuf decodes a message into a HashMap Tag [WireField], and then runs a second step to finish parsing the WireField values and set them in the message record. This is faster for messages with a relatively large number of fields, because HashMap has better insert performance than copying a record. But it's also going to keep the WireField values around for a while, so they're likely to get aged out of gen0 and be more expensive to collect.

An alternative implementation with much better cache locality and improved GC friendliness (most/all intermediate parses should be collected early in gen0) is to do the field conversion and setting during decode... encoding is done in a similar fashion already. The decoding implementation would start with something like this:

messageFields
  :: (Generic msg, GDecode (Rep msg))
  => proxy msg
  -> HashMap Tag (Setter' msg WireField)

Then just fold a default message over a stream of (Tag, WireField) and let the setters fill in the record. There would need to be some additional work for required fields, but otherwise it should be pretty straightforward.

NathanHowell · 2014-04-04T14:38:00Z

And what I alluded to a while back... to avoid allocating a WireField during parsing its constructors need to be lifted up a level:

data Decoders msg = Decoders {
  varintFieldDecoder    :: HashMap Tag (Setter' msg Word64)
  fixed64FieldDecoder   :: HashMap Tag (Setter' msg Word64)
  delimitedFieldDecoder :: HashMap Tag (Setter' msg ByteString)
  -- startFieldDecoder
  -- endFieldDecoder
  fixed32FieldDecoder   :: HashMap Tag (Setter' msg Word32)
}

NathanHowell · 2015-06-14T18:51:51Z

This is should be fixed by #19, many thanks to @joshbohde.

christian-marie · 2015-06-15T00:00:29Z

Nice, thanks @joshbohde. I might try to re-run my test here and see how much it's improved:
#4 (comment)

christian-marie mentioned this issue Jan 29, 2014

List traversal fix, append to list then reverse at end to preserve order... #5

Merged

AtnNn mentioned this issue May 19, 2014

Disappointing read performance AtnNn/haskell-rethinkdb#20

Closed

bohde mentioned this issue Jun 14, 2015

Improve performance by using binary #19

Merged

NathanHowell closed this as completed Jun 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspicious amount of time and allocation on decode path #4

Suspicious amount of time and allocation on decode path #4

istathar commented Jan 23, 2014

NathanHowell commented Jan 23, 2014

NathanHowell commented Jan 23, 2014

istathar commented Jan 27, 2014

NathanHowell commented Jan 27, 2014

istathar commented Jan 28, 2014

christian-marie commented Jan 28, 2014

christian-marie commented Jan 28, 2014

NathanHowell commented Jan 29, 2014

istathar commented Jan 29, 2014

NathanHowell commented Feb 1, 2014

christian-marie commented Feb 20, 2014

NathanHowell commented Feb 20, 2014

christian-marie commented Feb 24, 2014

NathanHowell commented Feb 24, 2014

christian-marie commented Feb 24, 2014

NathanHowell commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Apr 3, 2014

NathanHowell commented Apr 3, 2014

christian-marie commented Apr 3, 2014

NathanHowell commented Apr 4, 2014

NathanHowell commented Apr 4, 2014

NathanHowell commented Jun 14, 2015

christian-marie commented Jun 15, 2015

Suspicious amount of time and allocation on decode path #4

Suspicious amount of time and allocation on decode path #4

Comments

istathar commented Jan 23, 2014

NathanHowell commented Jan 23, 2014

NathanHowell commented Jan 23, 2014

istathar commented Jan 27, 2014

NathanHowell commented Jan 27, 2014

istathar commented Jan 28, 2014

christian-marie commented Jan 28, 2014

christian-marie commented Jan 28, 2014

NathanHowell commented Jan 29, 2014

istathar commented Jan 29, 2014

NathanHowell commented Feb 1, 2014

christian-marie commented Feb 20, 2014

NathanHowell commented Feb 20, 2014

christian-marie commented Feb 24, 2014

NathanHowell commented Feb 24, 2014

christian-marie commented Feb 24, 2014

NathanHowell commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Feb 24, 2014

christian-marie commented Apr 3, 2014

NathanHowell commented Apr 3, 2014

christian-marie commented Apr 3, 2014

NathanHowell commented Apr 4, 2014

NathanHowell commented Apr 4, 2014

NathanHowell commented Jun 14, 2015

christian-marie commented Jun 15, 2015