Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastwarc: CLI may index gzipped WARC records with erroneous length 0 #13

Closed
sebastian-nagel opened this issue Oct 4, 2021 · 3 comments
Labels
bug Something isn't working fastwarc FastWARC issue

Comments

@sebastian-nagel
Copy link
Contributor

The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:

$> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz

$> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
    | grep -F '"length": "0"'
{"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
{"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}

See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.

@phoerious
Copy link
Member

phoerious commented Oct 4, 2021

Thanks, I'll check what may be the reason for the remaining records.

With uncompressed the WARC file the error is not reproducible.

Yes, that would be very unexpected, since without compression, everything is very straight-forward. With compressed records, offset calculation is more difficult.

@phoerious
Copy link
Member

phoerious commented Oct 4, 2021

Turns out, consume() can already skip over to the next GZip member in some cases. I don't really know when this happens, but it's probably to do with buffer refills at member boundaries. The easiest way to fix this was to simply use the next/previous record for calculating the length similar to what you did in your first draft.

@phoerious phoerious added bug Something isn't working fastwarc FastWARC issue labels Oct 4, 2021
@phoerious
Copy link
Member

phoerious commented Oct 4, 2021

v0.6.1 is underway: https://github.com/chatnoir-eu/chatnoir-resiliparse/actions/runs/1304376163

I also fixed an LZ4 buffer skip error that could result in incomplete WARC reads, so that warrants another patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fastwarc FastWARC issue
Projects
None yet
Development

No branches or pull requests

2 participants