Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated" #15

Merged
merged 1 commit into from Nov 12, 2019

Conversation

@sebastian-nagel
Copy link

sebastian-nagel commented Nov 7, 2019

  • add key "truncated" if the record payload is truncated indication the reason for the truncation, cf.
    WARC-Truncated in WARC 1.1 spec
  • add key "redirect" containing the redirect target
    • from HTTP header field "Location" if the HTTP status code indicates a HTTP redirect
    • relative paths converted to absolute URLs using the page URL as base/context
    • absent if the "Location" value is missing or is not a valid URL or a valid relative URL path

Example CDX snippets (multi-line JSON):

  • redirect target/location
org,commoncrawl)/faq 20191107134157 {
  "url": "https://commoncrawl.org/faq/",
  ...,
  "status": "301",
  ...,
  "redirect": "http://commoncrawl.org/big-picture/frequently-asked-questions/"
}
  • truncation because of overlong content
es,remax,inmomas)/robots.txt 20191107134158 {
  "url": "http://www.inmomas.remax.es/robots.txt",
  ...,
  "status": "200",
  ...,
  "truncated": "length"
}
- add key "truncated" if the record payload is truncated
  indication the reason for the truncation, cf.
  http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated
- add key "redirect" containing the redirect target
  - from HTTP header field "Location"
  - relative paths converted to absolute URLs
    using the page URL as base/context
  - absent if the "Location" string is not a valid URL
    or relative URL path
@sebastian-nagel sebastian-nagel changed the title WARC writer (CDX writer): new CDX fields/keys in JSON data WARC writer (CDX writer): new optional CDX JSON fields "redirect" and "truncated" Nov 8, 2019
@sebastian-nagel sebastian-nagel merged commit adfcc45 into cc Nov 12, 2019
@sebastian-nagel sebastian-nagel deleted the warc-cdx-mark-truncation-and-redirects branch Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.