Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of DateTimeParseException in WARCSpout #1140

Merged
merged 3 commits into from
Jan 8, 2024

Conversation

michaeldinzinger
Copy link
Contributor

Hello all,

while loading and parsing external WARC files with the WARCSpout, StormCrawler crashed again and again, and the topology on the Storm cluster was restarted. The reason was a unhandled DateTimeParseException in the WARCSpout, which is thrown in case the WARC-Date of the WARC record is invalid (e.g. some random number instead of a proper year). DateTimeParseException extends RuntimeException, I assume this is the reason why StormCrawler shuts down automatically as soon as this error is thrown and not catched.

I haven't encountered it until now, so this error seems to be rather rare. But for example in this WARC file from Common Crawl's robots.txt dumps of 2016, there is indeed an unparsable WARC-Date.

By surrounding the parsing of WARC-Date with a try-catch block, the record with the invalid date is only skipped and the crawler continues with the next record without restart.

Example for invalid WARC-Date:

WARC/1.0
WARC-Type: request
WARC-Date: 5088968-11-06T12:31:23Z
WARC-Record-ID: <urn:uuid:57f00bf3-8ff4-406f-afac-81fa42085070>
Content-Length: 218
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:4aea73f4-08ea-4439-bc17-cce6f6ec3226>
WARC-IP-Address: 216.107.198.209
WARC-Target-URI: http://www.cultural.com/robots.txt

GET /robots.txt HTTP/1.0
Host: www.cultural.com
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



WARC/1.0
WARC-Type: response
WARC-Date: 5088968-11-06T12:31:23Z
WARC-Record-ID: <urn:uuid:78e4cfbc-5a60-4247-8f45-3d5c13ce58e5>
Content-Length: 188
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:4aea73f4-08ea-4439-bc17-cce6f6ec3226>
WARC-Concurrent-To: <urn:uuid:57f00bf3-8ff4-406f-afac-81fa42085070>
WARC-IP-Address: 216.107.198.209
WARC-Target-URI: http://www.cultural.com/robots.txt
WARC-Payload-Digest: sha1:PDT67EMCHALOEGV4266IV6O72I2E5I5X
WARC-Block-Digest: sha1:LCZ46ZOLZUWBBRNWCDKRX2SGL2X55FN2

HTTP/1.0 404 file does not exist
Server: publicfile
Date: Sun, 06 Nov 5088968 12:31:23 GMT
Content-Length: 47
Content-Type: text/html

<html><body>file does not exist</body></html>

michaeldinzinger and others added 3 commits January 5, 2024 20:37
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
@jnioche jnioche added this to the 2.12 milestone Jan 8, 2024
@jnioche jnioche merged commit 93747cd into apache:master Jan 8, 2024
5 checks passed
@jnioche
Copy link
Contributor

jnioche commented Jan 8, 2024

Great detective work! thanks @michaeldinzinger

@jnioche jnioche modified the milestones: 2.12, 3.0 Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants