You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks, I'll check what may be the reason for the remaining records.
With uncompressed the WARC file the error is not reproducible.
Yes, that would be very unexpected, since without compression, everything is very straight-forward. With compressed records, offset calculation is more difficult.
Turns out, consume() can already skip over to the next GZip member in some cases. I don't really know when this happens, but it's probably to do with buffer refills at member boundaries. The easiest way to fix this was to simply use the next/previous record for calculating the length similar to what you did in your first draft.
The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:
See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.
The text was updated successfully, but these errors were encountered: