content_encoding: restart zlib in raw mode only if still possible. #2068
Conversation
If the initial data is not available anymore or if output has already started, restarting in raw mode is not possible. This commit also rephrase the inflate_stream() procedure and checks for deflate/gzip unparsable trailing data bytes. New test 232 checks the deflate algorithm in raw mode.
break; | ||
default: | ||
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR); | ||
continue; |
jay
Nov 11, 2017
Member
I think this would be easier to understand:
if(zp->zlib_init != ZLIB_INIT &&
zp->zlib_init != ZLIB_INFLATING &&
zp->zlib_init != ZLIB_INIT_GZIP &&
zp->zlib_init != ZLIB_GZIP_INFLATING) {
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
break;
}
I think this would be easier to understand:
if(zp->zlib_init != ZLIB_INIT &&
zp->zlib_init != ZLIB_INFLATING &&
zp->zlib_init != ZLIB_INIT_GZIP &&
zp->zlib_init != ZLIB_GZIP_INFLATING) {
result = exit_zlib(conn, z, &zp->zlib_init, CURLE_WRITE_ERROR);
break;
}
monnerat
Nov 12, 2017
Author
Contributor
Many thanks for the review.
Maybe you're right. At least we have the same effect.
Many thanks for the review.
Maybe you're right. At least we have the same effect.
result = exit_zlib(conn, z, &zp->zlib_init, result); | ||
} | ||
z->avail_in = 0; | ||
break; |
jay
Nov 11, 2017
Member
Is this check necessary and did you account in these changes for concatenated gzip streams, I've read that is possible.
Is this check necessary and did you account in these changes for concatenated gzip streams, I've read that is possible.
monnerat
Nov 12, 2017
Author
Contributor
This check is not "necessary": it just allows detecting trailing bytes (after the stream), while the previous code silently ignored the trailing bytes in buffer and restartied a possibly desynced stream at the next call if some. See
Line 128
in
462f3ca
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
did you account in these changes for concatenated gzip streams, I've read that is possible.
No, and I think this was never supported in curl: see the above comment. In any case, use of zlib < 1.2.0.4 cannot allow it if we don't process the trailer explicitly.
The link to Mark Adler's comment is about gzip files and this is primarily a mean to support gzip archives with several files. Strictly speaking, you're right about concatenation (see RFC 2616, 3.5). In practice, I don't know if it is also intended for HTTP streams, if some server can generate it or if some client supports it and how (cannot find something about it on internet). Functionally speaking, concatenation only makes sense if curl does not inflate (thus outputting a real gzip file).
Supporting concatenation could be a new feature. TBD since it implies a loss of structure data.
This check is not "necessary": it just allows detecting trailing bytes (after the stream), while the previous code silently ignored the trailing bytes in buffer and restartied a possibly desynced stream at the next call if some. See
Line 128 in 462f3ca
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
did you account in these changes for concatenated gzip streams, I've read that is possible.
No, and I think this was never supported in curl: see the above comment. In any case, use of zlib < 1.2.0.4 cannot allow it if we don't process the trailer explicitly.
The link to Mark Adler's comment is about gzip files and this is primarily a mean to support gzip archives with several files. Strictly speaking, you're right about concatenation (see RFC 2616, 3.5). In practice, I don't know if it is also intended for HTTP streams, if some server can generate it or if some client supports it and how (cannot find something about it on internet). Functionally speaking, concatenation only makes sense if curl does not inflate (thus outputting a real gzip file).
Supporting concatenation could be a new feature. TBD since it implies a loss of structure data.
jay
Nov 12, 2017
Member
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
Probably but there's a lot of wackiness out there.
Returning an error for unparsable trailing bytes is probably cleaner than silently discarding them.
Probably but there's a lot of wackiness out there.
if(status == Z_OK || status == Z_STREAM_END) { | ||
allow_restart = 0; | ||
/* Flush output data if some. */ | ||
if(z->avail_out != DSIZ) { |
jay
Nov 12, 2017
Member
I can't find in the doc that it's safe to handle this without checking that status == Z_OK || status == Z_STREAM_END
I can't find in the doc that it's safe to handle this without checking that status == Z_OK || status == Z_STREAM_END
monnerat
Nov 12, 2017
Author
Contributor
Alternatively, I can't find that it is not safe. inflate() always maintains the in/out pointers/counters properly. If some data is inflated before an error is detected in the same call, it returns the error status but there are some data to flush in the output buffer. curl flushes as much as possible before returning the error. See https://github.com/madler/zlib//blob/master/inflate.c#L1248
Alternatively, I can't find that it is not safe. inflate() always maintains the in/out pointers/counters properly. If some data is inflated before an error is detected in the same call, it returns the error status but there are some data to flush in the output buffer. curl flushes as much as possible before returning the error. See https://github.com/madler/zlib//blob/master/inflate.c#L1248
jay
Nov 12, 2017
Member
I don't know I'm taking it upstream I'll cc
I don't know I'm taking it upstream I'll cc
monnerat
Nov 22, 2017
Author
Contributor
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
jay
Nov 22, 2017
Member
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
No nobody replied
/cc @madler
I've received the question to upstream, thanks for cc. Did you have an upstream answer about it ?
No nobody replied
/cc @madler
monnerat
Nov 23, 2017
Author
Contributor
Ok, I wrote a test program for it. See below.
This shows that, with inflate(Z_SYNC_FLUSH), although the pointers and counters are always kept in sync, there might be some trailing garbage generated in the output buffer before the error is detected. So i'll reintroduce the status test before writing.
Please note that this alone will not avoid garbage output in all cases: if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate(). The solution is to use Z_BLOCK
instead of Z_SYNC_FLUSH
: this will process one deflate block at a time and will reduce the "chances" to have output garbage data mixed with good data, maybe at the expense of a little more overhead. End of processing will be signalled by status Z_BUF_ERROR
.
By playing with the size of decbuf (i.e.: setting it to 1024), you'll see a single inflate(Z_SYNC_FLUSH) call does generate correct data followed by garbage before detecting the error.
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char testdata[] = "Zlib compressed data";
int
main(int argc, char * * argv)
{
z_stream z;
size_t enclen;
char encbuf[1024];
char decbuf[24];
int zlibrc;
/* Build compressed corrupt data. */
memset((char *) &z, 0, sizeof z);
deflateInit2(&z, 9, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY);
z.next_out = (Bytef *) encbuf;
z.avail_out = sizeof encbuf;
z.next_in = (Bytef *) testdata;
z.avail_in = strlen(testdata);
deflate(&z, Z_SYNC_FLUSH);
strncpy((char *) z.next_out, "zlib corrupt data", z.avail_out);
enclen = sizeof encbuf - z.avail_out + strlen((char *) z.next_out);
deflateEnd(&z);
/* Reinflate. */
inflateInit2(&z, -MAX_WBITS);
z.next_in = (Bytef *) encbuf;
z.avail_in = enclen;
do {
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
printf("consumed: %u bytes, inflated: %u bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
} while (zlibrc == Z_OK);
exit(0);
}
I'll push these changes as soon as my local "make check" terminates successfully.
Ok, I wrote a test program for it. See below.
This shows that, with inflate(Z_SYNC_FLUSH), although the pointers and counters are always kept in sync, there might be some trailing garbage generated in the output buffer before the error is detected. So i'll reintroduce the status test before writing.
Please note that this alone will not avoid garbage output in all cases: if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate(). The solution is to use Z_BLOCK
instead of Z_SYNC_FLUSH
: this will process one deflate block at a time and will reduce the "chances" to have output garbage data mixed with good data, maybe at the expense of a little more overhead. End of processing will be signalled by status Z_BUF_ERROR
.
By playing with the size of decbuf (i.e.: setting it to 1024), you'll see a single inflate(Z_SYNC_FLUSH) call does generate correct data followed by garbage before detecting the error.
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char testdata[] = "Zlib compressed data";
int
main(int argc, char * * argv)
{
z_stream z;
size_t enclen;
char encbuf[1024];
char decbuf[24];
int zlibrc;
/* Build compressed corrupt data. */
memset((char *) &z, 0, sizeof z);
deflateInit2(&z, 9, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY);
z.next_out = (Bytef *) encbuf;
z.avail_out = sizeof encbuf;
z.next_in = (Bytef *) testdata;
z.avail_in = strlen(testdata);
deflate(&z, Z_SYNC_FLUSH);
strncpy((char *) z.next_out, "zlib corrupt data", z.avail_out);
enclen = sizeof encbuf - z.avail_out + strlen((char *) z.next_out);
deflateEnd(&z);
/* Reinflate. */
inflateInit2(&z, -MAX_WBITS);
z.next_in = (Bytef *) encbuf;
z.avail_in = enclen;
do {
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
printf("consumed: %u bytes, inflated: %u bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
} while (zlibrc == Z_OK);
exit(0);
}
I'll push these changes as soon as my local "make check" terminates successfully.
madler
Nov 23, 2017
Sorry, I'm not getting what the question is.
Sorry, I'm not getting what the question is.
jay
Nov 25, 2017
Member
Sorry, I'm not getting what the question is.
We were trying to figure out whether it was safe to use the data written by inflate if we just checked avail_out to see if it had changed regardless of checking inflate return value. In other words is what is written in next_out always valid. Let's say next_out is buf[100] and avail_out before is 100 but after is 30. Does that mean the 70 bytes written by inflate is always valid?
@monnerat has since wrote a test in the previous post that appears to show inflate z_ok even with corrupt input.
if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate()
I added the libcurl dump function to the test to visualize it:
--- ztest.c.orig 2017-11-25 14:49:54.732905692 -0500
+++ ztest.c 2017-11-25 14:54:04.076899105 -0500
@@ -3,6 +3,37 @@
#include <stdlib.h>
#include <string.h>
+void dump(const char *text, FILE *stream,
+ const unsigned char *ptr, unsigned long long size)
+{
+ unsigned long long i;
+ unsigned int c, width = 0x10;
+
+ fprintf(stream, "%s length is %llu bytes (0x%llx)\n",
+ text, size, size);
+
+ for(i = 0; i < size; i += width) {
+
+ fprintf(stream, "%8.8llx: ", i);
+
+ /* show hex to the left */
+ for(c = 0; c < width; c++) {
+ if(i+c < size)
+ fprintf(stream, "%02x ", ptr[i+c]);
+ else
+ fputs(" ", stream);
+ }
+
+ /* show data on the right */
+ for(c = 0; (c < width) && (i+c < size); c++) {
+ char x = (ptr[i+c] >= 0x20 && ptr[i+c] < 0x80) ? ptr[i+c] : '.';
+ fputc(x, stream);
+ }
+
+ fputc('\n', stream); /* newline */
+ }
+}
+
char testdata[] = "Zlib compressed data";
int
@@ -34,10 +65,11 @@
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
- printf("consumed: %u bytes, inflated: %u bytes, " \
+ printf("consumed: %zu bytes, inflated: %zu bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
+ dump("decbuf", stdout, (unsigned char *)decbuf, sizeof decbuf - z.avail_out);
} while (zlibrc == Z_OK);
exit(0);
}
owner@ubuntu1604-x64-vm:~$ ./ztest
consumed: 32 bytes, inflated: 24 bytes, inflate return code = 0 ((null))
decbuf length is 24 bytes (0x18)
00000000: 5a 6c 69 62 20 63 6f 6d 70 72 65 73 73 65 64 20 Zlib compressed
00000010: 64 61 74 61 e3 39 34 30 data.940
consumed: 42 bytes, inflated: 8 bytes, inflate return code = -3 (invalid distance too far back)
decbuf length is 8 bytes (0x8)
00000000: 1c 3f 34 c9 ab 53 5b 51 .?4..S[Q
.940 and after is the garbage. it seems to me if we're dealing with bad input we may not be able to detect all bad output on inflate?
Sorry, I'm not getting what the question is.
We were trying to figure out whether it was safe to use the data written by inflate if we just checked avail_out to see if it had changed regardless of checking inflate return value. In other words is what is written in next_out always valid. Let's say next_out is buf[100] and avail_out before is 100 but after is 30. Does that mean the 70 bytes written by inflate is always valid?
@monnerat has since wrote a test in the previous post that appears to show inflate z_ok even with corrupt input.
if garbage data is split across 2 output buffers, the error is not detected before the 2nd call to inflate()
I added the libcurl dump function to the test to visualize it:
--- ztest.c.orig 2017-11-25 14:49:54.732905692 -0500
+++ ztest.c 2017-11-25 14:54:04.076899105 -0500
@@ -3,6 +3,37 @@
#include <stdlib.h>
#include <string.h>
+void dump(const char *text, FILE *stream,
+ const unsigned char *ptr, unsigned long long size)
+{
+ unsigned long long i;
+ unsigned int c, width = 0x10;
+
+ fprintf(stream, "%s length is %llu bytes (0x%llx)\n",
+ text, size, size);
+
+ for(i = 0; i < size; i += width) {
+
+ fprintf(stream, "%8.8llx: ", i);
+
+ /* show hex to the left */
+ for(c = 0; c < width; c++) {
+ if(i+c < size)
+ fprintf(stream, "%02x ", ptr[i+c]);
+ else
+ fputs(" ", stream);
+ }
+
+ /* show data on the right */
+ for(c = 0; (c < width) && (i+c < size); c++) {
+ char x = (ptr[i+c] >= 0x20 && ptr[i+c] < 0x80) ? ptr[i+c] : '.';
+ fputc(x, stream);
+ }
+
+ fputc('\n', stream); /* newline */
+ }
+}
+
char testdata[] = "Zlib compressed data";
int
@@ -34,10 +65,11 @@
z.next_out = (Bytef *) decbuf;
z.avail_out = sizeof decbuf;
zlibrc = inflate(&z, Z_SYNC_FLUSH);
- printf("consumed: %u bytes, inflated: %u bytes, " \
+ printf("consumed: %zu bytes, inflated: %zu bytes, " \
"inflate return code = %d (%s)\n",
enclen - z.avail_in, sizeof decbuf - z.avail_out,
zlibrc, z.msg);
+ dump("decbuf", stdout, (unsigned char *)decbuf, sizeof decbuf - z.avail_out);
} while (zlibrc == Z_OK);
exit(0);
}
owner@ubuntu1604-x64-vm:~$ ./ztest
consumed: 32 bytes, inflated: 24 bytes, inflate return code = 0 ((null))
decbuf length is 24 bytes (0x18)
00000000: 5a 6c 69 62 20 63 6f 6d 70 72 65 73 73 65 64 20 Zlib compressed
00000010: 64 61 74 61 e3 39 34 30 data.940
consumed: 42 bytes, inflated: 8 bytes, inflate return code = -3 (invalid distance too far back)
decbuf length is 8 bytes (0x8)
00000000: 1c 3f 34 c9 ab 53 5b 51 .?4..S[Q
.940 and after is the garbage. it seems to me if we're dealing with bad input we may not be able to detect all bad output on inflate?
There's a failing Travis job, but i'm pretty sure it is not related. |
The zlib documentation states that when the output buffer has been fully filled by inflate(), there still may be some output data bytes latched in state: even if all input data have been consumed, inflate() must be called again to obtain these bytes.
This reduces the "chances" to have corrupt data silently appended to properly inflated data. Also flush data only if status is Z_OK or Z_STREAM_END.
@jay : Review stalled ?
Of course, there will still be cases where corrupt data remains undetected, i.e.: if non-zlib-control bytes are altered. The overall changes in this PR improve zlib inflating, both for restarting in raw mode and at the data flushing/error detection level. |
@monnerat That all sounds good. I have not done a final review yet but I will. Also I have a question about the different flags, from the zlib manual:
If we don't flush the data then could we end up in a situation where some huge block can't be decompressed (inflated)? |
I made another test program, creating 100000 |
} | ||
} | ||
/* UNREACHED */ | ||
free(decomp); | ||
if(nread && zp->zlib_init == ZLIB_INIT) |
jay
Dec 13, 2017
Member
what is this for
what is this for
monnerat
Dec 13, 2017
Author
Contributor
We're about to leave this call thus the nread-bytes won't be seen again. If we are in a state that would wrongly allow restart in raw mode at the next call, change state.
We're about to leave this call thus the nread-bytes won't be seen again. If we are in a state that would wrongly allow restart in raw mode at the next call, change state.
if(inflateInit2(z, -MAX_WBITS) != Z_OK) { | ||
return process_zlib_error(conn, z); | ||
} | ||
zp->trailerlen = 8; /* Trailer length. */ |
jay
Dec 13, 2017
Member
where did you get this magic number
where did you get this magic number
monnerat
Dec 13, 2017
Author
Contributor
http://www.zlib.org/rfc-gzip.html#file-format
A trailer is made of a CRC32 and a 32-bit input size.
http://www.zlib.org/rfc-gzip.html#file-format
A trailer is made of a CRC32 and a 32-bit input size.
jay
Dec 14, 2017
Member
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
if(inflateInit2(z, -MAX_WBITS) != Z_OK) { | ||
return process_zlib_error(conn, z); | ||
} | ||
zp->trailerlen = 8; /* Trailer length. */ |
jay
Dec 14, 2017
Member
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
I see. I had actually looked at that before I asked and couldn't figure it out. I realize now the (more -->)
is describing the structure and so trailer is after ...compressed blocks...
.
re commits I think some of them could be squashed but i'll leave it to you. also please add |
@jay : thanks for review. |
If the initial data is not available anymore or if output has
already started, restarting in raw mode is not possible.
This commit also rephrase the inflate_stream() procedure and checks
for deflate/gzip unparsable trailing data bytes.
New test 232 checks the deflate algorithm in raw mode.