New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC writer incorrectly adds extra line in response records between HTTP headers and payload content #5
Comments
between HTTP headers and payload content (#5) - HTTP headers are now supposed to always contain a trailing empty line - and are fixed in case they do not - also removes invalid header lines (empty lines or those not in "key: value" format)
between HTTP headers and payload content (#5) - HTTP headers are now supposed to always contain a trailing empty line - and are fixed in case they do not - also removes invalid header lines (empty lines or those not in "key: value" format)
The HTTP headers are "recorded" by protocol plugins. A check of the plugins shows that they do not agree regarding
This bug is caused by an upgrade to Nutch 1.15 and a change from protocol-http (Common Crawl's fork) to protocol-okhttp. The latter records both request and response headers with a trailing empty line. The WARC writer should not add an extra empty line then. Ideally, the protocol-http should be fixed to be consistent regarding line breaks. |
While resolving a separate issue I noticed that the EDIT: Apologies for the multiple edits. I had misdiagnosed the original issue. UPDATE: Verified - if I skip the first two bytes of the payload I then get the expected digest. |
Hi @anjackson, thanks for remembering me about the digests. The WARC files of the September crawl are correct now. If time I'll want to unify the header format for all Nutch protocol plugins. Eventually, we'll fix the WARC files of the August crawl but that's cumbersome as the URL index will be wrong for a couple of days until all WARC files are rewritten and reindexed. |
between HTTP headers and payload content (#5) - HTTP headers are now supposed to always contain a trailing empty line - fix unit to reflect this
Format unification of stored HTTP headers tracked in NUTCH-2657. |
|
(reported by @wumpus, thanks!)
The WARC writer adds a redundant line between the HTTP headers and the payload content of WARC response records. This contradicts the HTTP standard which requires exactly one single empty line between the response header fields and the payload content.
This bug affects all WARC files (resp. the response records) in the Common Crawl August 2018 crawl archives. This may cause the following problems when the WARC files are processed:
Content-Length
header (optional) is off by 2: the payload includes the leading "\r\n" while the header value does not include it.The text was updated successfully, but these errors were encountered: