Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Date in robots.txt subset not to rely on HTTP Date #14

Closed
sebastian-nagel opened this issue Nov 7, 2019 · 0 comments

Comments

@sebastian-nagel
Copy link

@sebastian-nagel sebastian-nagel commented Nov 7, 2019

In the robots.txt subset the WARC request/response header field WARC-Date is filled from the field Date in the HTTP header because for robots.txt responses there is no underlying CrawlDatum which normally holds the fetch time. In few cases the value in the HTTP header is wrong, e.g. a date pointing to Dec 2018 in the October 2019 crawl:

WARC/1.0
WARC-Type: response
WARC-Date: 2018-12-25T00:00:57Z
WARC-Record-ID: <urn:uuid:39df27c4-1ab9-4be6-b997-f474f8402e06>
Content-Length: 211
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:6f71e2c2-9fd4-4ed6-9c21-a8f3cba70614>
WARC-Concurrent-To: <urn:uuid:c87d0843-ede8-4707-ae98-0c293bd25890>
WARC-IP-Address: 198.200.48.174
WARC-Target-URI: http://www.jzyjtf.com/robots.txt
WARC-Payload-Digest: sha1:GLWMDG4MKBODCSNNKNKDDN24HUOAMEZH
WARC-Block-Digest: sha1:L434WCUBLIKYJKTXD63XB2GBGZSS6GJH
WARC-Identified-Payload-Type: message/rfc822

HTTP/1.1 200 OK
X-Crawler-Transfer-Encoding: chunked
Content-Type: text/plain; charset=utf-8
Server: Microsoft-HTTPAPI/2.0
Date: Tue, 25 Dec 2018 00:00:57 GMT
Content-Length: 25

User-Agent: *
Disallow: 

Instead, the current fetch time should be tracked and passed forward to the WARC writer.

sebastian-nagel added a commit that referenced this issue Nov 12, 2019
- fixes #14 (WARC-Date in robots.txt subset not to rely on HTTP Date header)
sebastian-nagel added a commit to commoncrawl/webarchive-indexing that referenced this issue Nov 25, 2019
- the robots.txt subset used the HTTP date sent by the server
  see commoncrawl/nutch#14
    WARC/1.0
    WARC-Type: response
    WARC-Date: 48784-07-15T07:13:33Z
    ...
    HTTP/1.1 404 Not Found
    ...
    Date: Sun, 15 Jul 48784 07:13:33 GMT
  causing a "ValueError: year is out of range" in warcio.timeutils
- catch the ValueError and use the date when capturing started
  given in the WARC file name
- increase number of WARC files processed per map task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.