New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflate error after all content was received #2719

Open
clbr opened this Issue Jul 7, 2018 · 22 comments

Comments

Projects
None yet
4 participants
@clbr
Contributor

clbr commented Jul 7, 2018

I did this

curl -o test -k https://stooq.com/q/?s=cbl_e.us -v --compressed

Ever since dbcced8, this download fails with "curl: (23) Failed writing data".

I expected the following

Successful download. It may be curl's or Apache's bug, but in either case curl should handle it gracefully.

curl/libcurl version

curl 7.61.0-DEV (x86_64-unknown-linux-gnu) libcurl/7.61.0-DEV OpenSSL/1.0.2o zlib/1.2.3 librtmp/2.3
Release-Date: [unreleased]
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp
Features: Largefile NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy

operating system

Linux x86_64

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jul 7, 2018

Member

in either case curl should handle it gracefully

If this is a broken stream then curl should report error and not silently ignore it. (I'm not saying it is, as I haven't researched it deeper.)

When I try that command line I get a test file in my directory that is 146222 bytes big which looks like it could be the whole file so the error could be something in the final bits of the stream...

Member

bagder commented Jul 7, 2018

in either case curl should handle it gracefully

If this is a broken stream then curl should report error and not silently ignore it. (I'm not saying it is, as I haven't researched it deeper.)

When I try that command line I get a test file in my directory that is 146222 bytes big which looks like it could be the whole file so the error could be something in the final bits of the stream...

@bagder bagder added the HTTP label Jul 7, 2018

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 8, 2018

Contributor

The error is annoying in a browser, since going to that page then throws up an error page instead of the page contents. Ignoring that error would be bad in contexts where it's more serious.

I added some printfs, and looks like it's because that commit removed the restarting support when the state was ZLIB_UNINIT. Is it legal to have multiple deflate streams?

Contributor

clbr commented Jul 8, 2018

The error is annoying in a browser, since going to that page then throws up an error page instead of the page contents. Ignoring that error would be bad in contexts where it's more serious.

I added some printfs, and looks like it's because that commit removed the restarting support when the state was ZLIB_UNINIT. Is it legal to have multiple deflate streams?

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jul 8, 2018

Member

@monnerat, can you figure out if there's anything we can improve here?

Member

bagder commented Jul 8, 2018

@monnerat, can you figure out if there's anything we can improve here?

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 9, 2018

Collaborator

This particular case uses the "deflate" algorithm in its deviant form (raw data), with 2 extra bytes appended (0x03, 0x00). The semantics of these bytes are unknown to me, but I can see they are constants even if the page content changes.

This was not detected before the commit mentioned above, since all trailing data were silently ignored, even if consisting of thousands of bytes.

Curl deals with 3 kinds of zlib-supported data:

  • Gzip data with header and trailer.
  • Normal deflate data (probably containing header and trailer).
  • Raw deflate data, as a fallback when initial normal deflate data parsing fails. This has no header/trailer. I've read long ago this was implemented to support lame M$ (IIS?) encoding, without further explanations.

What is strange here is this seems to come from an Apache server: I can suspect it is a reverse proxy to a M$ software, or the encoding is processed by the application itself.

Is it another deviant form of data ? Do we have to support it ? Since I can't understand what these additional bytes are, we need details about what to support. And write down these in our own documentation since these forms are not "official".

In any case, unexpected trailing data should not be ignored without error, since this may result in considering the received data as complete while it is not.

If you have details about expected data in deviant forms we should support, I can investigate for an implementation. Else I have no serious solution.

There is no support for multiple streams in curl, since it is hard to figure how to process them. In our current case, the trailing data is obviously not a second stream.

Collaborator

monnerat commented Jul 9, 2018

This particular case uses the "deflate" algorithm in its deviant form (raw data), with 2 extra bytes appended (0x03, 0x00). The semantics of these bytes are unknown to me, but I can see they are constants even if the page content changes.

This was not detected before the commit mentioned above, since all trailing data were silently ignored, even if consisting of thousands of bytes.

Curl deals with 3 kinds of zlib-supported data:

  • Gzip data with header and trailer.
  • Normal deflate data (probably containing header and trailer).
  • Raw deflate data, as a fallback when initial normal deflate data parsing fails. This has no header/trailer. I've read long ago this was implemented to support lame M$ (IIS?) encoding, without further explanations.

What is strange here is this seems to come from an Apache server: I can suspect it is a reverse proxy to a M$ software, or the encoding is processed by the application itself.

Is it another deviant form of data ? Do we have to support it ? Since I can't understand what these additional bytes are, we need details about what to support. And write down these in our own documentation since these forms are not "official".

In any case, unexpected trailing data should not be ignored without error, since this may result in considering the received data as complete while it is not.

If you have details about expected data in deviant forms we should support, I can investigate for an implementation. Else I have no serious solution.

There is no support for multiple streams in curl, since it is hard to figure how to process them. In our current case, the trailing data is obviously not a second stream.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 9, 2018

Contributor
Contributor

clbr commented Jul 9, 2018

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jul 10, 2018

Member

It's sometimes hard to mimic browsers since they are often very lenient with errors are never very strict on enforcing what's right or wrong (if it would make users annoyed).

In this case where browsers show the full content without errors and even older curls do that I agree that it feels highly annoying that it now instead shows an error.

This specific content is delivered with chunked-encoding. Can we take the fact that it has delivered the final terminating chunk as an additional clue that we have indeed seen the end of the "document" and then be lenient with gzip errors?

Member

bagder commented Jul 10, 2018

It's sometimes hard to mimic browsers since they are often very lenient with errors are never very strict on enforcing what's right or wrong (if it would make users annoyed).

In this case where browsers show the full content without errors and even older curls do that I agree that it feels highly annoying that it now instead shows an error.

This specific content is delivered with chunked-encoding. Can we take the fact that it has delivered the final terminating chunk as an additional clue that we have indeed seen the end of the "document" and then be lenient with gzip errors?

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 10, 2018

Collaborator

This specific content is delivered with chunked-encoding. Can we take the fact that it has delivered the final terminating chunk as an additional clue that we have indeed seen the end of the "document" and then be lenient with gzip errors?

I dont' like it at all. First, because chunked encoding processing is done at the input level, while other encodings (gtip, deflate, brotli) are handled at the output level (they really should not...). Next, because there can be a lot of unprocessed data (up to buffer size) at the write time. In addition, nothing tells us we cannot have it without chunked encoding... and maybe without data size indication too.

I would prefer a solution being tolerant about the trailer length (i.e.: <= 4 bytes) and only in the case of deviant deflate data.

The best solution remains to identify the meaning of these 2 additional bytes, in order to keep control of what curl does: without it, we are navigating in the fog, without GPS :-)

Maybe browser code can tell us details of their implementation, but that goes out of my skills.

Collaborator

monnerat commented Jul 10, 2018

This specific content is delivered with chunked-encoding. Can we take the fact that it has delivered the final terminating chunk as an additional clue that we have indeed seen the end of the "document" and then be lenient with gzip errors?

I dont' like it at all. First, because chunked encoding processing is done at the input level, while other encodings (gtip, deflate, brotli) are handled at the output level (they really should not...). Next, because there can be a lot of unprocessed data (up to buffer size) at the write time. In addition, nothing tells us we cannot have it without chunked encoding... and maybe without data size indication too.

I would prefer a solution being tolerant about the trailer length (i.e.: <= 4 bytes) and only in the case of deviant deflate data.

The best solution remains to identify the meaning of these 2 additional bytes, in order to keep control of what curl does: without it, we are navigating in the fog, without GPS :-)

Maybe browser code can tell us details of their implementation, but that goes out of my skills.

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jul 10, 2018

Member

Maybe browser code can tell us details of their implementation

As I said, browsers are very lenient when it comes to detecting the end of a document and they will refrain from triggering an error or alert and rather just hope all data got in - and if in doubt they rather just mark the connection as not suitable for reuse.

A while ago I worked on trying to make Firefox's treatment of content-length and "framing" stricter, but I ended up having to surrender because the web is already broken in so many regards in that area so in the end we basically had to let it keep ignoring chunked-encoded streams that are cut off before the last chunk, weirdo gzip stream problems, gzip streams without the trailer etc.

If we were to act like the browsers we would probably: ignore all gzip problems, stop reading at that point of the error and mark the connection for not-reuse.

Member

bagder commented Jul 10, 2018

Maybe browser code can tell us details of their implementation

As I said, browsers are very lenient when it comes to detecting the end of a document and they will refrain from triggering an error or alert and rather just hope all data got in - and if in doubt they rather just mark the connection as not suitable for reuse.

A while ago I worked on trying to make Firefox's treatment of content-length and "framing" stricter, but I ended up having to surrender because the web is already broken in so many regards in that area so in the end we basically had to let it keep ignoring chunked-encoded streams that are cut off before the last chunk, weirdo gzip stream problems, gzip streams without the trailer etc.

If we were to act like the browsers we would probably: ignore all gzip problems, stop reading at that point of the error and mark the connection for not-reuse.

@monnerat monnerat closed this in f8be737 Jul 12, 2018

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 17, 2018

Contributor

stooq.com works, but I hit another site that fails with master: ebay.

curl -o test -k 'https://www.ebay.com/sch/i.html?LH_CAds=&_ex_kw=&_fpos=&_fspt=1&_mPrRngCbx=1&_nkw=150+in+1+gba&_sacat=&_sadis=&_sop=12&_udhi=&_udlo=&rmvSB=true&_fosrp=1' --compressed

Contributor

clbr commented Jul 17, 2018

stooq.com works, but I hit another site that fails with master: ebay.

curl -o test -k 'https://www.ebay.com/sch/i.html?LH_CAds=&_ex_kw=&_fpos=&_fspt=1&_mPrRngCbx=1&_nkw=150+in+1+gba&_sacat=&_sadis=&_sop=12&_udhi=&_udlo=&rmvSB=true&_fosrp=1' --compressed

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 17, 2018

Contributor

Oh, even the front page reproduces, no need for the long url:
curl -o test -k 'https://www.ebay.com' --compressed

Contributor

clbr commented Jul 17, 2018

Oh, even the front page reproduces, no need for the long url:
curl -o test -k 'https://www.ebay.com' --compressed

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 18, 2018

Collaborator

I cannot reproduce with the ebay home page: this works as expected here.
The long ebay url fails and already failed with 7.55.1.
This is not the same problem as the stooq.com one: the encoding is not a deviant deflate, but supposed to be a conformant gzip. By dumping the server data, I see some clear text appended to the gzip stream by the server:

<!-- RcmdId srpnweb,RlogId t6pwvit%60d%3D9vjdpwvit%60d*2%3D40212-164aaeaf815-0x604 --><!-- SiteId: 0, Environment: production, AppName: srpnweb, PageId: 2334524 -->

This is clearly a server bug.

Collaborator

monnerat commented Jul 18, 2018

I cannot reproduce with the ebay home page: this works as expected here.
The long ebay url fails and already failed with 7.55.1.
This is not the same problem as the stooq.com one: the encoding is not a deviant deflate, but supposed to be a conformant gzip. By dumping the server data, I see some clear text appended to the gzip stream by the server:

<!-- RcmdId srpnweb,RlogId t6pwvit%60d%3D9vjdpwvit%60d*2%3D40212-164aaeaf815-0x604 --><!-- SiteId: 0, Environment: production, AppName: srpnweb, PageId: 2334524 -->

This is clearly a server bug.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 18, 2018

Contributor
Contributor

clbr commented Jul 18, 2018

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 18, 2018

Collaborator

Would it be possible to add a curl option to ignore any such buggy trailing data?

I'm not convinced this is the good and right thing to do. Therefore I request other developers opinion.

Not being able to access ebay is bad.

You're still able to access it uncompressed and I think this is the best thing to do when facing a server compression bug.

Collaborator

monnerat commented Jul 18, 2018

Would it be possible to add a curl option to ignore any such buggy trailing data?

I'm not convinced this is the good and right thing to do. Therefore I request other developers opinion.

Not being able to access ebay is bad.

You're still able to access it uncompressed and I think this is the best thing to do when facing a server compression bug.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 18, 2018

Contributor

I cannot tell it apart from legit (23) write errors.

Contributor

clbr commented Jul 18, 2018

I cannot tell it apart from legit (23) write errors.

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 18, 2018

Collaborator

Would it be satisfactory for you to get CURLE_BAD_CONTENT_ENCODING (61: Unrecognized or bad HTTP Content or Transfer-Encoding) in our cases ?

Collaborator

monnerat commented Jul 18, 2018

Would it be satisfactory for you to get CURLE_BAD_CONTENT_ENCODING (61: Unrecognized or bad HTTP Content or Transfer-Encoding) in our cases ?

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 18, 2018

Contributor

Not sure if that would also have legit errors. A new "retry without compression" error would work.

However, forcing a page reload without compression would give a worse experience vs other browsers, so ideally I'd want these trailers to be ignored. It might cause a visible delay up to one second, plus the time taken by an uncompressed download, and then having to keep track which sites need to disable compression.

Contributor

clbr commented Jul 18, 2018

Not sure if that would also have legit errors. A new "retry without compression" error would work.

However, forcing a page reload without compression would give a worse experience vs other browsers, so ideally I'd want these trailers to be ignored. It might cause a visible delay up to one second, plus the time taken by an uncompressed download, and then having to keep track which sites need to disable compression.

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 18, 2018

Contributor

(well, a CURLE_RETRY_WITHOUT_COMPRESSION would also achieve the ideal goal, since I could then ignore that error without fear of false positives)

Contributor

clbr commented Jul 18, 2018

(well, a CURLE_RETRY_WITHOUT_COMPRESSION would also achieve the ideal goal, since I could then ignore that error without fear of false positives)

@monnerat

This comment has been minimized.

Show comment
Hide comment
@monnerat

monnerat Jul 18, 2018

Collaborator

The following situations may currently return CURLE_BAD_CONTENT_ENCODING:

  • Unrecognized encoding token in header Content-Encoding or Transfer-Encoding
  • Base64 decoding error (wrong input data)
  • Real deflate, gzip or brotli decode error returned by the corresponding libraries.
  • WWW-Authenticate or Proxy-Authenticateshould be digest and is not.
Collaborator

monnerat commented Jul 18, 2018

The following situations may currently return CURLE_BAD_CONTENT_ENCODING:

  • Unrecognized encoding token in header Content-Encoding or Transfer-Encoding
  • Base64 decoding error (wrong input data)
  • Real deflate, gzip or brotli decode error returned by the corresponding libraries.
  • WWW-Authenticate or Proxy-Authenticateshould be digest and is not.
@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 19, 2018

Contributor

Okay, all those are errors I want to show to the user, so they would be false positives from the ignoring POV.

Contributor

clbr commented Jul 19, 2018

Okay, all those are errors I want to show to the user, so they would be false positives from the ignoring POV.

@bagder

This comment has been minimized.

Show comment
Hide comment
@bagder

bagder Jul 20, 2018

Member

CURLE_RETRY_WITHOUT_COMPRESSION

That would not be an error code name I would approve of. We identify the problem with the error codes. The correct next action is then up to the application, I don't think the error code itself should suggest that as we can't possibly know what the application actually wants to do.

I think we should rather increase our efforts to figure out when we get these errors after all "relevant" data has already been returned so that we can ignore the error (but possibly mark the connection for close since it's in a dubious state).

Member

bagder commented Jul 20, 2018

CURLE_RETRY_WITHOUT_COMPRESSION

That would not be an error code name I would approve of. We identify the problem with the error codes. The correct next action is then up to the application, I don't think the error code itself should suggest that as we can't possibly know what the application actually wants to do.

I think we should rather increase our efforts to figure out when we get these errors after all "relevant" data has already been returned so that we can ignore the error (but possibly mark the connection for close since it's in a dubious state).

@clbr

This comment has been minimized.

Show comment
Hide comment
@clbr

clbr Jul 25, 2018

Contributor

Should I open a new issue so this doesn't get forgotten, or can this be reopened?

Contributor

clbr commented Jul 25, 2018

Should I open a new issue so this doesn't get forgotten, or can this be reopened?

@bagder bagder reopened this Jul 25, 2018

@bagder bagder changed the title from Deflate regression to Deflate error after all content was received Jul 25, 2018

@AGenchev

This comment has been minimized.

Show comment
Hide comment
@AGenchev

AGenchev Aug 20, 2018

I was trying to find what's that broken server software and while ebay hid their server,
I found that this site: https://dir.bg reports nginx (without version)
curl -o test.htm -k 'https://dir.bg' --compressed
fails and the decoded page is not complete.

AGenchev commented Aug 20, 2018

I was trying to find what's that broken server software and while ebay hid their server,
I found that this site: https://dir.bg reports nginx (without version)
curl -o test.htm -k 'https://dir.bg' --compressed
fails and the decoded page is not complete.

falconindy added a commit to falconindy/curl that referenced this issue Sep 10, 2018

content_encoding: accept up to 4 unknown trailer bytes after raw defl…
…ate data

Some servers issue raw deflate data that may be followed by an undocumented
trailer. This commit makes curl tolerate such a trailer of up to 4 bytes
before considering the data is in error.

Reported-by: clbr on github
Fixes #2719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment