Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WARC writer / protocol-okhttp] WARC-Truncated header issues and improvements #10

Open
sebastian-nagel opened this issue Jul 18, 2019 · 2 comments

Comments

@sebastian-nagel
Copy link

commented Jul 18, 2019

There are some oddities how truncated captures are recorded in WARC files. See also Henry Thompson's report and the discussion in the Common Crawl user group.

  • (protocol-okhttp) reliably annotate content truncated by length limit:
    • this happens mostly (100 out of 110) for pages with Content-Encoding: gzip
    • no truncation flag is added if the loop to read content chunk by chunk is exited reaching the content limit exactly: verify and open issue to fix this in upstream Nutch (NUTCH-2729)
  • analyze truncations flagged as "disconnect" due to an IOException
    • in 3 analyzed WARC files all records flagged by "disconnect" have either "gzip" content encoding or "chunked" transfer encoding (or even both) - the reason could be also a broken encoding not a "network disconnect". Note: would need also to clarify how to annotate truncations due to protocol-level errors.
    • see #13 for more details
  • always add a Content-Length header to HTTP headers in WARC file, even if there wasn't one in the original HTTP response (eg. for chunked transfer encoding). Implemented in 3663f35.
  • (in the course) upgrade to latest okhttp library (NUTCH-2728)
  • port fixes to StormCrawler
@sebastian-nagel

This comment has been minimized.

Copy link
Author

commented Jul 19, 2019

For analysis and verification see

@sebastian-nagel

This comment has been minimized.

Copy link
Author

commented Aug 30, 2019

Implemented and fixed for August 2019 crawl (CC-MAIN-2019-35). Solution verified on 100 randomly selected WARC files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.