Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blosc2_decompress_ctx can overrun src #40

Closed
asomers opened this issue Apr 20, 2018 · 8 comments
Closed

blosc2_decompress_ctx can overrun src #40

asomers opened this issue Apr 20, 2018 · 8 comments

Comments

@asomers
Copy link

asomers commented Apr 20, 2018

blosc2_decompress_ctx does not take the size of the source buffer as an argument. Instead, it assumes that the buffer is complete and correctly formed. That can result in array overruns for buffers that are truncated, malicious or otherwise corrupt. It should take the source buffer length as an argument, and never read beyond that.

@FrancescAlted
Copy link
Member

Yes, for Blosc1 compressed chunks, I think this makes total sense. The proposal would be to add a new function for safe decompression, something like:

int blosc2_decompress_safe_ctx(blosc2_context* context, const void* src, size_t srcsize, void* dest, size_t destsize);

Also, I was thinking in optionally adding a (64-bit) hash at the end of frames (a new container which has been recently introduced in Blosc2). This could bring much higher safety for them. If a hash exists, functions like blosc2_frame_decompress_chunk() could double check that the contents of the frame are ok and safely proceed if that is the case. What do you think?

@asomers
Copy link
Author

asomers commented Aug 10, 2018

I think that if blosc2 is already making backwards-incompatible changes to the API, then why bother to introduce a new decompression function? I think it would be simpler just to make the normal decompression function safe-by-default. An alternate version, however, would be useful for blosc1, which doesn't want to break backwards compatibility.

I think that a builtin checksum might be useful for some applications, but it doesn't help with the safety situation. A malicious chunk, after all, could still have a valid checksum. Checksums, even cryptographic ones, only protect against accidental damage.

Regarding bundled checksums, there is one benefit that really intrigues me: cache efficiency. Using a separate checksum layer atop of the compression layer requires an application to traverse a block of data twice. Later parts of the first pass could evict useful data from the CPU cache. But a combined hashcompress algorithm could do the whole thing in a single pass. I don't know how much of a performance impact that would have, but it sounds like it might be significant.

@FrancescAlted
Copy link
Member

FrancescAlted commented Aug 13, 2019

I have put some serious thought on this, and I think that in scenarios where safety is important, the user can still call blosc_cbuffer_sizes() first to check whether the cbytes (i.e. the number of bytes to be read from src buffer) is safe for her. I have added a note about this in commit 2c9d598.

One might argue that the destsize argument could be removed for the same reason, but IMO, limiting the destination size is generally more useful, so I think it is fine keeping it.

@FrancescAlted
Copy link
Member

For reference, I have just filed a new ticket for implementing checksums.

@asomers
Copy link
Author

asomers commented Aug 13, 2019

However, blosc_cbuffer_sizes works by interrogating the provided buffer. So it still doesn't protect against a corrupted buffer. Blosc's compression and decompression routines should be safe even if provided with a corrupted or garbage buffer. Even a maliciously crafted buffer shouldn't be allowed to cause an array overrun.

@FrancescAlted
Copy link
Member

Uh, I don't completely understand the difference between passing a size of the source buffer and taking a decision based on the blosc_cbuffer_sizes() feedback. Normally you have an educated guess for the size of the source buffer and this can be compared with the cbytes returned by blosc_cbuffer_sizes(); in case there is too much a difference you can distrust the buffer, right?

On the other hand, one cannot implement the limitation of size of the source buffer as you suggest because it happens that blosc compressed buffers are not decompressed strictly sequentially, but require to access the index of the different blocks inside the buffer that are at the end of the buffer, so we are enforced to read at minimum cbytes bytes.

For the record, the reason why the compressed buffers need the index at the end is because the parallelization creates blocks of different sizes, and the decoder needs this info in order to pass the starting points for every block for every decompressing thread.

@asomers
Copy link
Author

asomers commented Aug 13, 2019

What I have in mind is an application that receives a blosc buffer from an untrusted source. The pseudocode would look like this:

let buffer = read_from_(net|disk|other);
let destlen = length_that_I_expect_based_on_metadata_stored_somewhere_else();
let dest = make_buffer(destlen);
let err = blosc2_decompress_safe_ctx(ctx, buffer.base_address, buffer.size, dest.base_address, destlen);

Even if the received buffer is malicious, then blosc2_decompress_safe_ctx won't overrun the source buffer, because it knows the source length. You seem to suggest that I should do the following instead:

let buffer = read_from_(net|disk|other);
let destlen = length_that_I_expect_based_on_metadata_stored_somewhere_else();
let dest = make_buffer(destlen);
let (nbytes, cbytes, blocksize) = blosc_cbuffer_sizes(buffer.base_address);
if (cbytes != buffer.len)
    return EINVAL;
if (nbytes != destlen)
    return EINVAL;
let err = blosc2_decompress_safe_ctx(ctx, buffer.base_address, dest.base_address, destlen);

Not only does that strategy require additional error-prone code in every blosc consumer, but the output of blosc_cbuffer_sizes cannot be trusted if the buffer came from an untrusted source.

Why can't blosc2_decompress_safe_ctx enforce the length of the source buffer? It shouldn't be a problem that the buffer is read non-sequentially. All that is required is that it never read beyond buffer.base_address + buffer.size.

@FrancescAlted
Copy link
Member

Yes, I am suggesting exactly that. I agree that you cannot trust the output of blosc_cbuffer_sizes(), but you can trust your buffer.lenand if they do not match, then you should not proceed decompressing the buffer. I concede that this is a bit more work for users (not that much, just one more call), but this should be a fair price to pay for people that are worried about source trustyness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants