Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflat Compressor Length #69

Closed
Bit00009 opened this issue Jul 20, 2020 · 7 comments
Closed

Deflat Compressor Length #69

Bit00009 opened this issue Jul 20, 2020 · 7 comments
Labels

Comments

@Bit00009
Copy link

Hi there dear @ebiggers
First thank you for this amazing library, I just came from zlib to your library and I have some difficulties.
I want to do a very simple deflate compression on a buffer and I can't figure it out how to deal with length.

I know there's two way to deal with length in gzip and lz4 , I worked with them before and I was adding original data size as a little header in my compressed data but In this case every single byte is important to me.

In C# there's a internal compressor known as DeflatStream we just pass a byte array and get a byte array back :

public static byte[] Compress(byte[] data)
{
	MemoryStream output = new MemoryStream();
	using (DeflateStream dstream = new DeflateStream(output, CompressionLevel.Optimal))
	{
		dstream.Write(data, 0, data.Length);
	}
	return output.ToArray();
}
public static byte[] Decompress(byte[] data)
{
	MemoryStream input = new MemoryStream(data);
	MemoryStream output = new MemoryStream();
	using (DeflateStream dstream = new DeflateStream(input, CompressionMode.Decompress))
	{
		dstream.CopyTo(output);
	}
	return output.ToArray();
}

Result data doesn't contain any thing related to original size data.
So I tried to do the same using libdeflate and here's my current code :

#include <iostream>
#include <vector>
#include <fstream>

#include "libdeflate.h"
#pragma comment( lib , "libdeflatestatic.lib" )

using namespace std;

#define LOG(x) printf("%s\n",x)
#define LOGVAL(valname,val) printf("Value `%s` = %d\n",valname,val)

int main() 
{
	struct libdeflate_compressor* compressor;
	struct libdeflate_decompressor* decompressor;
	ofstream fout;

	std::ifstream input("inputdata.dta", std::ios::binary);

	std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});

	const int originalDataSize = buffer.size();

	// Compress
	compressor = libdeflate_alloc_compressor(9);
	std::vector<unsigned char> outdata;
	outdata.resize(originalDataSize);

	LOG("Compressing...");
	int cmpsize = libdeflate_deflate_compress(compressor, buffer.data(), originalDataSize, outdata.data(), originalDataSize);

	LOGVAL("cmpsize",cmpsize);

	fout.open("compressed.bin", ios::binary | ios::out);
	fout.write((char*)outdata.data(), cmpsize);
	fout.close();

	libdeflate_free_compressor(compressor);

	// Decompress
	size_t dec_size;
	std::vector<unsigned char> dec_outdata;
	dec_outdata.resize(originalDataSize);

	decompressor = libdeflate_alloc_decompressor();
	libdeflate_deflate_decompress(decompressor, outdata.data(), cmpsize, dec_outdata.data(), originalDataSize, &dec_size);

	fout.open("decompressed.bin", ios::binary | ios::out);
	fout.write((char*)dec_outdata.data(), dec_size);
	fout.close();

	libdeflate_free_decompressor(decompressor);

	LOG("App finished.");
}

As you can see I always need originalDataSize to make compression/decompression work well, I don't know if I'm doing something wrong or not but I need to do it like how C# DeflateStream works, I need to pass a new vector or byte array without knowing original size and do the decompression.

I want to have a pure deflate result in compression with out adding any metadata or extra information I will be very greatful if you tell me how is this possible in c++ within your library.

Regards,
Ultran

@ebiggers
Copy link
Owner

You'd need to guess some uncompressed size, allocate a buffer of that size, then try decompressing into it. If it fails with LIBDEFLATE_INSUFFICIENT_SPACE, enlarge the buffer and try again. And so on.

It's not really a good solution. It's much better to just store the uncompressed size along with the compressed data.

It only takes a few bytes to store the uncompressed size -- or less if you store an approximate size only.

Note that your code snippet only uses compression level 9, while libdeflate goes up to level 12. You probably could save much more than a few bytes by using a higher compression level.

@Bit00009
Copy link
Author

Bit00009 commented Jul 21, 2020

Hello @ebiggers thanks for your fast reply,
Yes I'm currently use guessing but it's extremely slow when my data becomes more than 5 MB
I'm really ceriouse how C# DeflateStream manage to handle same situation and It's fast enough , the source is available here.

I noticed there's a buffer and windows size value in .net DeflateStream :

        internal const int DefaultBufferSize = 8192;
        private const int WindowSizeUpperBound = 47;

What do you think about it? How it's handling unknown size?

It's much better to just store the uncompressed size along with the compressed data.

I want to prevent data splitting as much as I can but if it's the last thing I can do, I'll do but I'm dying from curiosity how .net managed deflate compression/decompression.

Also i tried a very small test data (24 byte) DeflateStream compressed file but libdeflate returns 0 byte and compressed file is empty. I attached files.
InputData.txt

This is deflatestream output
compressed-deflatestream.gz

Regards,
Ultran

@ebiggers
Copy link
Owner

What do you think about it? How it's handling unknown size?

Your C# code snippet streams the data to a MemoryStream, which implements an automatically-resizing array. So it could end up copying the data many times as it incrementally reallocates the array, and end up with a buffer up to 2x larger than is required.

You can improve the performance of both the C# version and the libdeflate version, and also make it much easier to use libdeflate, by storing the uncompressed size along with the compressed data.

Also i tried a very small test data (24 byte) DeflateStream compressed file but libdeflate returns 0 byte and compressed file is empty.

Check libdeflate.h for how to use the API.

@dmitry-azaraev
Copy link

@Bit00009 internally any deflate decoder rely on BFINAL bit in stream so them generally has idea when they should stop. Also there is exist some kind of natural chunking in deflate, so it may adapted to work with input/output buffers which may essentially be represented under Stream-like facade.

Technically, at least for decompressing, libdeflate can be extended to support to work with chunks (i guess this needs injection points at decompress_template.h about next_block / block_done), but... while it is might looks as easy, it is not. (And i has no idea about compression at all.)

"Stream"-like facade - is not so simple in fact, as it may look. Just imagine what you have file with 80,000 small compressed blocks (say 500-1.5Kb) which are written in stream one-by-one. Each instance of DeflateStream will buffer input (from file stream) and will decode (decompress) stream. As result after first decoded block (let's say it was 500 bytes) it actually consumes 8Kb (internal buffer), and original file stream essentially no more point to next uncompressed block (did you know this?), like it should be. DeflateStream even doesn't provide information how many bytes it really consumed. (In my case I has compressed data size, so i have options - create another stream which will limit how many bytes DeflateStream may read (just return EOF after N bytes). Or just use codecs like libdeflate which work with raw buffers, as result skipping of this complex machinery saves from lot of allocations, resulting in at least twice better throughput.) Buffered streams which provide access to internal buffer for consumer (consumer is something like DeflateStream) - will have more flexibility, but in .NET is not a option and generally it is not a popular option at all.

I'm generally trying to show what blind reproducing of generic Stream-like interface - is a way to making library with stupid and hard to use interface. Which one interface will be easy to use and satisfy all requirements? I don't know.

@jibsen
Copy link

jibsen commented Jul 25, 2020

I guess there could be a function that simulates the decompression without actually writing any output. It might be useful for checking if a given block of data contains a valid deflate stream, and could report the number of input bytes consumed and output bytes produced, which would be useful here. This might be outside the scope of libdeflate though.

@ebiggers
Copy link
Owner

I guess there could be a function that simulates the decompression without actually writing any output. It might be useful for checking if a given block of data contains a valid deflate stream, and could report the number of input bytes consumed and output bytes produced, which would be useful here. This might be outside the scope of libdeflate though.

Sure, but that would encourage people to do the wrong thing (decompress the data twice) instead of the right thing (store the uncompressed size along with the compressed data). And I'd like to keep the API simple. So I think it's better to not add it, and instead just encourage people to do the right thing.

@wegylexy
Copy link

wegylexy commented Nov 6, 2021

When the deflated stream is like several MB, partial read results in LIBDEFLATE_INSUFFICIENT_SPACE or DestinationTooSmall. It is unreasonable to require 2 continuous blocks of memory that big.
For example, a 4MB gitpack may be inflated to 13MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants