Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
In the robots.txt subset the WARC request/response header field
Instead, the current fetch time should be tracked and passed forward to the WARC writer.
- the robots.txt subset used the HTTP date sent by the server see commoncrawl/nutch#14 WARC/1.0 WARC-Type: response WARC-Date: 48784-07-15T07:13:33Z ... HTTP/1.1 404 Not Found ... Date: Sun, 15 Jul 48784 07:13:33 GMT causing a "ValueError: year is out of range" in warcio.timeutils - catch the ValueError and use the date when capturing started given in the WARC file name - increase number of WARC files processed per map task