You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using warcat.
One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like
$ dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.warc.gz bs=1 && gunzip 1.warc.gz
1326824+0 records in
1326824+0 records out
1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
Which is >11,200x faster than extracting everything in warcat and looking for the file I need.
The downside is needing to mess with dd, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.
It'd be great if warcat could include some additional arguments to the extract functionality like a pair of --length=n and --index=i flags to provide a nicer interface to pulling out a few warcs.
This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)
The text was updated successfully, but these errors were encountered:
In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using
warcat
.One might have already checked the CDX files (to find which mega WARC to download) and so know the index and length. If you know this, it's possible to seek directly in the WARC and extract the sequence of bytes which make up a particular WARC. For example, using a cdx line like
I can handwrite the extraction using
dd
:Which is >11,200x faster than extracting everything in
warcat
and looking for the file I need.The downside is needing to mess with
dd
, being totally inaccessible to non-programmers, being inconvenient in terms of scripting, etc.It'd be great if warcat could include some additional arguments to the extract functionality like a pair of
--length=n
and--index=i
flags to provide a nicer interface to pulling out a few warcs.This would also go very well with HTTP Range support; then you could look up the index/length in a CDX file, seek right to the specific binary sequence on Archive.org, and download only the few MB you need instead of, say, a giant 52GB megawarc. (You could imagine doing a on-demand extraction service using this: store only the master index on your server, and when a user requests a particular file, extract the WARC index/length from the master index, call warcat to extract the specific WARC from the IA-hosted megawarc, and return that to the user. So you don't need to store all 9tb or whatever.)
The text was updated successfully, but these errors were encountered: