-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content_encoding: restart zlib in raw mode only if still possible. #2068
Conversation
If the initial data is not available anymore or if output has already started, restarting in raw mode is not possible. This commit also rephrase the inflate_stream() procedure and checks for deflate/gzip unparsable trailing data bytes. New test 232 checks the deflate algorithm in raw mode.
lib/content_encoding.c
Outdated
break; | ||
default: | ||
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to understand:
if(zp->zlib_init != ZLIB_INIT &&
zp->zlib_init != ZLIB_INFLATING &&
zp->zlib_init != ZLIB_INIT_GZIP &&
zp->zlib_init != ZLIB_GZIP_INFLATING) {
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
break;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the review.
Maybe you're right. At least we have the same effect.
lib/content_encoding.c
Outdated
result = exit_zlib(conn, z, &zp->zlib_init, result); | ||
} | ||
z->avail_in = 0; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check necessary and did you account in these changes for concatenated gzip streams, I've read that is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not "necessary": it just allows detecting trailing bytes (after the stream), while the previous code silently ignored the trailing bytes in buffer and restartied a possibly desynced stream at the next call if some. See
Line 128 in 462f3ca
/* Done? clean up, return */ |
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
did you account in these changes for concatenated gzip streams, I've read that is possible.
No, and I think this was never supported in curl: see the above comment. In any case, use of zlib < 1.2.0.4 cannot allow it if we don't process the trailer explicitly.
The link to Mark Adler's comment is about gzip files and this is primarily a mean to support gzip archives with several files. Strictly speaking, you're right about concatenation (see RFC 2616, 3.5). In practice, I don't know if it is also intended for HTTP streams, if some server can generate it or if some client supports it and how (cannot find something about it on internet). Functionally speaking, concatenation only makes sense if curl does not inflate (thus outputting a real gzip file).
Supporting concatenation could be a new feature. TBD since it implies a loss of structure data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
Probably but there's a lot of wackiness out there.
lib/content_encoding.c
Outdated
if(status == Z_OK || status == Z_STREAM_END) { | ||
allow_restart = 0; | ||
/* Flush output data if some. */ | ||
if(z->avail_out != DSIZ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find in the doc that it's safe to handle this without checking that status == Z_OK || status == Z_STREAM_END
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I can't find that it is not safe. inflate() always maintains the in/out pointers/counters properly. If some data is inflated before an error is detected in the same call, it returns the error status but there are some data to flush in the output buffer. curl flushes as much as possible before returning the error. See https://github.com/madler/zlib//blob/master/inflate.c#L1248
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know I'm taking it upstream I'll cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
No nobody replied
/cc @madler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I wrote a test program for it. See below.
This shows that, with inflate(Z_SYNC_FLUSH), although the pointers and counters are always kept in sync, there might be some trailing garbage generated in the output buffer before the error is detected. So i'll reintroduce the status test before writing.
Please note that this alone will not avoid garbage output in all cases: if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate(). The solution is to use Z_BLOCK
instead of Z_SYNC_FLUSH
: this will process one deflate block at a time and will reduce the "chances" to have output garbage data mixed with good data, maybe at the expense of a little more overhead. End of processing will be signalled by status Z_BUF_ERROR
.
By playing with the size of decbuf (i.e.: setting it to 1024), you'll see a single inflate(Z_SYNC_FLUSH) call does generate correct data followed by garbage before detecting the error.
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char testdata[] = "Zlib compressed data";
int
main(int argc, char * * argv)
{
z_stream z;
size_t enclen;
char encbuf[1024];
char decbuf[24];
int zlibrc;
/* Build compressed corrupt data. */
memset((char *) &z, 0, sizeof z);
deflateInit2(&z, 9, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY);
z.next_out = (Bytef *) encbuf;
z.avail_out = sizeof encbuf;
z.next_in = (Bytef *) testdata;
z.avail_in = strlen(testdata);
deflate(&z, Z_SYNC_FLUSH);
strncpy((char *) z.next_out, "zlib corrupt data", z.avail_out);
enclen = sizeof encbuf - z.avail_out + strlen((char *) z.next_out);
deflateEnd(&z);
/* Reinflate. */
inflateInit2(&z, -MAX_WBITS);
z.next_in = (Bytef *) encbuf;
z.avail_in = enclen;
do {
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
printf("consumed: %u bytes, inflated: %u bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
} while (zlibrc == Z_OK);
exit(0);
}
I'll push these changes as soon as my local "make check" terminates successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not getting what the question is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not getting what the question is.
We were trying to figure out whether it was safe to use the data written by inflate if we just checked avail_out to see if it had changed regardless of checking inflate return value. In other words is what is written in next_out always valid. Let's say next_out is buf[100] and avail_out before is 100 but after is 30. Does that mean the 70 bytes written by inflate is always valid?
@monnerat has since wrote a test in the previous post that appears to show inflate z_ok even with corrupt input.
if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate()
I added the libcurl dump function to the test to visualize it:
--- ztest.c.orig 2017-11-25 14:49:54.732905692 -0500
+++ ztest.c 2017-11-25 14:54:04.076899105 -0500
@@ -3,6 +3,37 @@
#include <stdlib.h>
#include <string.h>
+void dump(const char *text, FILE *stream,
+ const unsigned char *ptr, unsigned long long size)
+{
+ unsigned long long i;
+ unsigned int c, width = 0x10;
+
+ fprintf(stream, "%s length is %llu bytes (0x%llx)\n",
+ text, size, size);
+
+ for(i = 0; i < size; i += width) {
+
+ fprintf(stream, "%8.8llx: ", i);
+
+ /* show hex to the left */
+ for(c = 0; c < width; c++) {
+ if(i+c < size)
+ fprintf(stream, "%02x ", ptr[i+c]);
+ else
+ fputs(" ", stream);
+ }
+
+ /* show data on the right */
+ for(c = 0; (c < width) && (i+c < size); c++) {
+ char x = (ptr[i+c] >= 0x20 && ptr[i+c] < 0x80) ? ptr[i+c] : '.';
+ fputc(x, stream);
+ }
+
+ fputc('\n', stream); /* newline */
+ }
+}
+
char testdata[] = "Zlib compressed data";
int
@@ -34,10 +65,11 @@
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
- printf("consumed: %u bytes, inflated: %u bytes, " \
+ printf("consumed: %zu bytes, inflated: %zu bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
+ dump("decbuf", stdout, (unsigned char *)decbuf, sizeof decbuf - z.avail_out);
} while (zlibrc == Z_OK);
exit(0);
}
owner@ubuntu1604-x64-vm:~$ ./ztest
consumed: 32 bytes, inflated: 24 bytes, inflate return code = 0 ((null))
decbuf length is 24 bytes (0x18)
00000000: 5a 6c 69 62 20 63 6f 6d 70 72 65 73 73 65 64 20 Zlib compressed
00000010: 64 61 74 61 e3 39 34 30 data.940
consumed: 42 bytes, inflated: 8 bytes, inflate return code = -3 (invalid distance too far back)
decbuf length is 8 bytes (0x8)
00000000: 1c 3f 34 c9 ab 53 5b 51 .?4..S[Q
.940 and after is the garbage. it seems to me if we're dealing with bad input we may not be able to detect all bad output on inflate?
There's a failing Travis job, but i'm pretty sure it is not related. |
The zlib documentation states that when the output buffer has been fully filled by inflate(), there still may be some output data bytes latched in state: even if all input data have been consumed, inflate() must be called again to obtain these bytes.
This reduces the "chances" to have corrupt data silently appended to properly inflated data. Also flush data only if status is Z_OK or Z_STREAM_END.
@jay : Review stalled ?
Of course, there will still be cases where corrupt data remains undetected, i.e.: if non-zlib-control bytes are altered. The overall changes in this PR improve zlib inflating, both for restarting in raw mode and at the data flushing/error detection level. |
@monnerat That all sounds good. I have not done a final review yet but I will. Also I have a question about the different flags, from the zlib manual:
If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)? |
I made another test program, creating 100000 |
} | ||
} | ||
/* UNREACHED */ | ||
free(decomp); | ||
if(nread && zp->zlib_init == ZLIB_INIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're about to leave this call thus the nread-bytes won't be seen again. If we are in a state that would wrongly allow restart in raw mode at the next call, change state.
if(inflateInit2(z, -MAX_WBITS) != Z_OK) { | ||
return process_zlib_error(conn, z); | ||
} | ||
zp->trailerlen = 8; /* Trailer length. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did you get this magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://www.zlib.org/rfc-gzip.html#file-format
A trailer is made of a CRC32 and a 32-bit input size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
if(inflateInit2(z, -MAX_WBITS) != Z_OK) { | ||
return process_zlib_error(conn, z); | ||
} | ||
zp->trailerlen = 8; /* Trailer length. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
re commits I think some of them could be squashed but i'll leave it to you. also please add |
@jay : thanks for review. |
If the initial data is not available anymore or if output has
already started, restarting in raw mode is not possible.
This commit also rephrase the inflate_stream() procedure and checks
for deflate/gzip unparsable trailing data bytes.
New test 232 checks the deflate algorithm in raw mode.