Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC writer incorrectly adds extra line in response records between HTTP headers and payload content #5

Closed
sebastian-nagel opened this issue Sep 14, 2018 · 5 comments

Comments

@sebastian-nagel
Copy link

(reported by @wumpus, thanks!)
The WARC writer adds a redundant line between the HTTP headers and the payload content of WARC response records. This contradicts the HTTP standard which requires exactly one single empty line between the response header fields and the payload content.

This bug affects all WARC files (resp. the response records) in the Common Crawl August 2018 crawl archives. This may cause the following problems when the WARC files are processed:

  • a WARC reader/parser assumes only a single empty line, the extracted payload content starts with "\r\n". Leading new lines are usually ignored by HTML parsers, but will fail other parsers for PDF, office documents, etc.
  • the payload length indicated in the HTTP Content-Length header (optional) is off by 2: the payload includes the leading "\r\n" while the header value does not include it.
sebastian-nagel added a commit that referenced this issue Sep 14, 2018
between HTTP headers and payload content (#5)
- HTTP headers are now supposed to always contain a trailing
  empty line
- and are fixed in case they do not
- also removes invalid header lines
  (empty lines or those not in "key: value" format)
sebastian-nagel added a commit that referenced this issue Sep 14, 2018
between HTTP headers and payload content (#5)
- HTTP headers are now supposed to always contain a trailing
  empty line
- and are fixed in case they do not
- also removes invalid header lines
  (empty lines or those not in "key: value" format)
@sebastian-nagel
Copy link
Author

sebastian-nagel commented Sep 14, 2018

The HTTP headers are "recorded" by protocol plugins. A check of the plugins shows that they do not agree regarding

  • the number of trailing line breaks (2 => one empty line)
  • the type of the line break (\r\n or n)
  • even the same plugin may record the headers differently for request and response headers
Response Request Plugin Nutch version
2 \r\n 2 \r\n protocol-okhttp 1.15 (impl)
1 \n 2 \r\n protocol-http 1.15 (impl)
(not supported) protocol-httpclient 1.15
1 \r\n 2 \r\n protocol-http 1.6 CC (impl)

This bug is caused by an upgrade to Nutch 1.15 and a change from protocol-http (Common Crawl's fork) to protocol-okhttp. The latter records both request and response headers with a trailing empty line. The WARC writer should not add an extra empty line then. Ideally, the protocol-http should be fixed to be consistent regarding line breaks.

@anjackson
Copy link

anjackson commented Oct 3, 2018

While resolving a separate issue I noticed that the WARC-Payload-Digest value for response records in the CC-MAIN-2018-34 collection appear to be broken. The WARC-Block-Digest matches the whole response (including headers and extra newline), but the WARC-Payload-Digest does not match. I believe this is because of the extra newline noted in this issue.

EDIT: Apologies for the multiple edits. I had misdiagnosed the original issue.

UPDATE: Verified - if I skip the first two bytes of the payload I then get the expected digest.

@sebastian-nagel
Copy link
Author

Hi @anjackson, thanks for remembering me about the digests. The WARC files of the September crawl are correct now. If time I'll want to unify the header format for all Nutch protocol plugins. Eventually, we'll fix the WARC files of the August crawl but that's cumbersome as the URL index will be wrong for a couple of days until all WARC files are rewritten and reindexed.

sebastian-nagel added a commit that referenced this issue Oct 10, 2018
between HTTP headers and payload content (#5)
- HTTP headers are now supposed to always contain a trailing empty line
- fix unit to reflect this
@sebastian-nagel
Copy link
Author

Format unification of stored HTTP headers tracked in NUTCH-2657.

@sebastian-nagel
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants