You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the crawl topology died or was killed the WARC file is not properly closed. This causes an error when decompressing the WARC file: gzip: CC-NEWS-20160926233041-00001.warc.gz: unexpected end of file.
The text was updated successfully, but these errors were encountered:
Looks like the cleanup() method in WARCHdfsBolt/GzipHdfsBolt is not called when a topology is killed. The last record in the WARC file is truncated, compared with the worker.log more than 200 records are missing from the WARC file.
Exactly. There could be a way of improving things though by using the sync policy (i.e. when we mark/fail the tuples) better? A lower value would mean that we flush more often.
If the crawl topology died or was killed the WARC file is not properly closed. This causes an error when decompressing the WARC file:
gzip: CC-NEWS-20160926233041-00001.warc.gz: unexpected end of file
.The text was updated successfully, but these errors were encountered: