New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompressing a specific amount of zlib data "eats" following data #20

Closed
davean opened this Issue Apr 9, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@davean

davean commented Apr 9, 2015

Some data formats (git pack files) store not the amount of compressed data, but the size that data is uncompressed. One would suppose that using the streaming-commons toolkit would easily handle this case, it does not seem to.

While it does decompress the correct amount, it does not return the unused input to be read again as demonstrated by the following example program:

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.Bits
import Control.Monad.Trans
import qualified Codec.Compression.Zlib as Z
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Builder as BSB
import qualified Data.Text as T
import qualified Data.Text.Encoding as TE
import Data.Conduit (($$), (=$))
import qualified Data.Conduit as C
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Zlib as CZ

main :: IO ()
main = do
  let c = TE.encodeUtf8 "This data is stored compressed."
  let u = TE.encodeUtf8 "This data isn't."
  let encoded = writeExample c u
  (c', u') <- CB.sourceLbs encoded $$ readExample
  print (c, u)
  print (c', u')
  putStrLn $ "Input and output matched: " ++ show (c==c' && u==u')

readExample :: C.Sink BS.ByteString IO (BS.ByteString, BS.ByteString)
readExample = do
  sbs <- CB.take 4
  let size = case map fromIntegral . BSL.unpack $ sbs of
        [s0, s1, s2, s3] -> (s3 `shiftL` 24) .|. (s2 `shiftL` 16) .|. (s1 `shiftL` 8) .|. s0
        _ -> error "We really had to read 4 octets there."
  -- We know how large it should decompress to, but not how large it is compressed.                                                                    
  -- We proced to decompress untill we have decompressed enough data.                                                                                  
  c <- (CZ.decompress CZ.defaultWindowBits) =$ (CB.take size)
  -- Immediately following the compressed stream is more data we need.                                                                                 
  u <- CB.sinkLbs
  return (BSL.toStrict c, BSL.toStrict u)

writeExample :: BS.ByteString -> BS.ByteString -> BSL.ByteString
writeExample cdata udata =
  let c = Z.compress . BSL.fromStrict $ cdata
  in BSB.toLazyByteString . mconcat $
   [ BSB.int32LE . fromIntegral . BS.length $ cdata -- We record the size of the uncompressed data                                                     
   , BSB.lazyByteString c -- but store it compressed.                                                                                                  
   , BSB.byteString udata -- Then we store other important data with no delimination.                                                                  
   ]

example output:

("This data is stored compressed.","This data isn't.")
("This data is stored compressed.","")
Input and output matched: False
@davean

This comment has been minimized.

Show comment
Hide comment
@davean

davean Apr 9, 2015

To be clear, this seems to be an issue with the streaming-commons API, I merely used conduit to produce a clear demonstration of it.

This problem is exampled and documented in http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/#pack_file_objects where he explains that http://hackage.haskell.org/package/iteratee-compress does pass this test.

davean commented Apr 9, 2015

To be clear, this seems to be an issue with the streaming-commons API, I merely used conduit to produce a clear demonstration of it.

This problem is exampled and documented in http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/#pack_file_objects where he explains that http://hackage.haskell.org/package/iteratee-compress does pass this test.

snoyberg added a commit that referenced this issue Apr 9, 2015

getUnusedInflated #20
Return uncompressed data following compressed data
@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 9, 2015

Member

I've just added a new function (getUnusedInflate) which I believe solves this problem. Can you test it out? I'm going to add the necessary call to that function in conduit-extra and test your snippet.

Member

snoyberg commented Apr 9, 2015

I've just added a new function (getUnusedInflate) which I believe solves this problem. Can you test it out? I'm going to add the necessary call to that function in conduit-extra and test your snippet.

snoyberg added a commit to snoyberg/conduit that referenced this issue Apr 9, 2015

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 9, 2015

Member

@davean Please have a look at snoyberg/conduit@1aa59ee, which fixes the problem in conduit-extra. Your original code had problems with leftover handling, so I had to tweak it a bit in the test case.

Member

snoyberg commented Apr 9, 2015

@davean Please have a look at snoyberg/conduit@1aa59ee, which fixes the problem in conduit-extra. Your original code had problems with leftover handling, so I had to tweak it a bit in the test case.

@davean

This comment has been minimized.

Show comment
Hide comment
@davean

davean Apr 10, 2015

Thanks for being so prompt. Things look good to me. I believe it should work in practice (and in fact it does against real data).

As for the change to the test, yes. I wrote for the intent, but what I wrote doesn't seem to be something conduit can express directly. I agree your rewriting does express what conduit should actually implement. If one actually did truly need what I wrote - in conduit - one could implement a zlib decompressor that took the desired output size as a parameter and set the buffer sizes to exactly align with the final output size.

Mildly interesting hole in what you can directly represent with conduit, from an academic sense I guess.

To be clear though, this fully closes the issue for me.

davean commented Apr 10, 2015

Thanks for being so prompt. Things look good to me. I believe it should work in practice (and in fact it does against real data).

As for the change to the test, yes. I wrote for the intent, but what I wrote doesn't seem to be something conduit can express directly. I agree your rewriting does express what conduit should actually implement. If one actually did truly need what I wrote - in conduit - one could implement a zlib decompressor that took the desired output size as a parameter and set the buffer sizes to exactly align with the final output size.

Mildly interesting hole in what you can directly represent with conduit, from an academic sense I guess.

To be clear though, this fully closes the issue for me.

@davean davean closed this Apr 10, 2015

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Apr 12, 2015

Member

That form of fusion is supported, it just needs a separate fusion function (e.g. fuseLeftovers). Have a look at slide 29 from https://docs.google.com/presentation/d/1RBefOCZ7AKOo4f1yiF4mtKPAT3l5vY9ky2SR02O4Vvg/edit#slide=id.g3c22e35a9_0144 for more details.

Member

snoyberg commented Apr 12, 2015

That form of fusion is supported, it just needs a separate fusion function (e.g. fuseLeftovers). Have a look at slide 29 from https://docs.google.com/presentation/d/1RBefOCZ7AKOo4f1yiF4mtKPAT3l5vY9ky2SR02O4Vvg/edit#slide=id.g3c22e35a9_0144 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment