New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspicious amount of time and allocation on decode path #4
Comments
I've seen this a bit in the past...my recollection is that it's from code inlined from |
The suspect function is here if you want to check it out, I'm also open to other suggestions. Are you decoding a lot of strings? |
Yeah I am. The protobufs are (pretty . show . contrived) something like
The numbers are just a few bytes varint and or fixed64 encoded, but the key value pairs are about a hundred bytes and yeah, string heavy. AfC |
How much of an impact is this having on your app? I'd like to get this sorted out but I'm short of time at the moment. |
Ouch. Profiling shows it to be about 75% of time and 92% of the allocation burden.
So it's on the Get path. cereal has a Builder which it employs for the work of building ByteStrings on the encode path, but seems to be doing something horrible on the decode one. We're looking at it further now. Off hand I can't see anything stupid that protobuf is doing, but overall the heapmap shows we're getting a steady leak, especially in PINNED. That's a bit of a surprise. AfC |
I've taken a look at this, it appears that cereal is simply slow due to it's continuation passing nature and the fact that each bind creates a new continuation. This problem is easy to replicate with any repeated field. Here is an example that shows the issue: {-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize
data Burst = Burst {
frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)
instance Encode Burst
instance Decode Burst
encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField
decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage
main :: IO ()
main = do
let encoded = encodeBurst $ replicate 30000 "hai"
print $ B.length encoded
let decoded = decodeBurst encoded
either error (const $ return ()) decoded |
Disreguard that handwavey statement. I've found the real issue. I will provide free performance in a pull request shortly. Feel free to actually implement the fix however you like. The issue is with list traversal on insertion to the hashmap. |
Let me know if this fixes your perf issue. I'll cut a new release if it does. |
@NathanHowell definite improvement. We're down to 1.6 seconds from 9.6! @christian-marie is definitely on the right track. That same part of cereal is still costing about 24% of time, though. Makes me wonder if there's another thunk leak happening somewhere. |
Definitely possible, the issue I've seen was from |
I'm going to give hs-packer a go as soon as we can get the missing features into it: |
I've heard there is a big rewrite coming for binary... and it also has enough features (as of 0.6.x iirc) to work for protobuf too. I think it is usually faster than cereal. If hs-packer works and is fast, I'm certainly open to taking it on. |
I'm putting hs-packer on the back-burner until someone can work this out: I did implement the decode path, though that wasn't an order of magnitude faster. You can see that here: I believe that the allocation of thousands of haskell objects for each element of a repeated field might be the bottleneck here, not the serialization library. |
Because of |
I'm not entirely sure. I suspect so though. Running a large decode with +RTS -hy shows a lot of WireField, HashMap and ByteString allocation. My bet is that if you can think of a faster way to do that, you'll probably nail the performance issues. |
The other obvious way is to use a static (per type) HashMap of setters
|
A simple test, decoding 2.9MB of repeated four byte strings: http://ponies.io/raw/protobuf-hy.pdf {-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.ProtocolBuffers
import GHC.Generics(Generic)
import Data.ByteString(ByteString)
import qualified Data.ByteString as B
import Data.TypeLevel(D1)
import Data.Serialize
data Burst = Burst {
frames :: Repeated D1 (Value ByteString)
} deriving (Generic, Eq, Show)
instance Encode Burst
instance Decode Burst
encodeBurst :: [ByteString] -> ByteString
encodeBurst = runPut . encodeMessage . Burst . putField
decodeBurst :: ByteString -> Either String Burst
decodeBurst = runGet decodeMessage
main :: IO ()
main = do
Right m <- decodeBurst `fmap` B.readFile "encoded"
print m |
I'll note that the HashMap allocations are missing in that graph, probably because the only thing growing in this test case is one key (a list). |
Also interestingly: ARR_WORDS, which I am pretty sure is the underlying bytestring data buffer (2.9MB worth) stays pretty constant. So, that's good. I don't think there's much we can do about the ByteString container allocations unless we're doing more than we need to. Which only leaves WireField and [] containers to be optimized. As for taking a stab at this alternative implementation you mention. I'd be happy to but I'm not going to be able to prioritize this until next week either. Maybe we can poke each-other again then. Otherwise, if you can describe the implementation in a little more detail I'll probably get around to trying it in my own time out of curiosity. |
Hi Nathan, poke. This is bugging us again, we'd like it to be faster. Any suggestions on where our efforts should be directed to cut down on these allocations next? |
Is the concern allocation or runtime? Reading 500k
|
Our concern is specifically the speed of decoding many delimited fields, and decoding many small messages in series. We are seeing this kind of thing currently:
Where, I think ks is the success continuation bind? I.e. mfield <- getWireField I'm thinking that reducing the allocations by trying to box things as little as possible will decrease the runtime. Our bottom line is reducing CPU usage and runtime, memory is cheap as far as we're concerned. But allocations are expensive. The motivation behind this is that we are currently bottlenecked almost completely on decoding protobufs. |
Yeah, the expensive The implementation in protobuf decodes a message into a An alternative implementation with much better cache locality and improved GC friendliness (most/all intermediate parses should be collected early in gen0) is to do the field conversion and setting during decode... encoding is done in a similar fashion already. The decoding implementation would start with something like this:
Then just fold a default message over a stream of |
And what I alluded to a while back... to avoid allocating a
|
This is should be fixed by #19, many thanks to @joshbohde. |
Nice, thanks @joshbohde. I might try to re-run my test here and see how much it's improved: |
I'm not sure if this rings any bells, but we've finally reached the point of doing some profiling on our application and there's one standout:
getVarintPrefixedBS
is doing the bulk of the allocation (fine) but taking an awfully long time to do it.Not sure if that rings any bells, but I thought I'd ask and see if there's anything obvious you know about there that we should have a look at.
AfC
The text was updated successfully, but these errors were encountered: