This repository extracts archived content from the Wayback Machine and converts collected metadata and downloaded snapshot files into compressed WARC files. The project currently supports three primary modes of operation: downloading snapshots from the Internet Archive, combining/cleaning CSV metadata produced by Wayback backup tools, and converting that metadata + downloaded files into a WARC-GZ archive.
- Download mode: reads a CSV of Internet Archive (Wayback) URLs, determines snapshot ranges, and uses
pywaybackupto download snapshots (the downloads are stored in the localwaybackup_snapshots/folder by default). - Convert mode: combines CSV files (from a directory) into a single CSV and then converts that CSV into a compressed WARC (
.warc.gz) usingwarcio. - Full mode: runs download then combine+convert to produce a WARC in one run.
Install the Python dependencies from the repository requirements.txt:
pip install -r requirements.txt
Notable packages used:
- pywaybackup — downloads Wayback snapshots
- pandas — CSV handling and merging when combining multiple CSVs
- warcio — writing WARC records
See requirements.txt for the exact pinned versions used in this repository.
src/main.py— command-line entry point that exposesdownload,convert, andfullmodes.src/internet_archive_downloader.py— logic that reads an input CSV of Internet Archive URLs and runspywaybackupto download snapshots.src/waybackup_to_warc.py— functions to combine CSV files, clean URLs (remove:80), and produce a.warc.gzfrom a CSV of records.resources/— example CSVs (e.g.curated_urls.csv) useful for quick testing.
Usage pattern for the main runner (src/main.py):
python src/main.py <mode> <input> [--output OUTPUT] [--column_name COLUMN] [--period DAY|WEEK] [--reset]
Modes and example usage:
-
Download mode — download snapshots listed in a CSV
-
Description: Reads a CSV containing full Wayback URLs such as
https://web.archive.org/web/20251002062751/https://example.com/pageand downloads snapshots for a small period around the archived date. -
Required
input: path to the CSV file to read (e.g.resources/curated_urls.csv). The default column name expected isInternet_Archive_URL. -
Example:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URL --period DAY -
Flags:
--period—DAY(default) orWEEK. Controls whether the downloader fetches snapshots ±1 day or ±1 week around the archived date.--reset— if present, passesreset=Truetopywaybackup(useful to force re-download).
-
-
Convert mode — combine CSVs and produce a WARC
-
Description: Combine all
.csvfiles from the specified directory into a single CSV (written tocombined_output.csvby default) and convert that CSV to a WARC-GZ. -
Required
input: path to a directory that contains CSV files to combine (e.g.waybackup_snapshots/or any folder with CSV exports). -
--outputshould be provided to name the resulting WARC file (the code will append.warc.gz). -
Example:
python src/main.py convert waybackup_snapshots --output mysite_archive -
Notes: The script combines CSV files using
pandas.concatand writes the combined CSV tocombined_output.csv(value ofCOMBINED_CSV_PATH). The combined CSV is then read and converted intooutput/<output>.warc.gz.
-
-
Full mode — download then convert
-
Description: Downloads snapshots from the input CSV, then combines CSVs (from
waybackup_snapshots) and converts them into a WARC. -
Example:
python src/main.py full resources/curated_urls.csv --output combined_site_archive
-
- Default combined CSV file path:
combined_output.csv(the module-levelCOMBINED_CSV_PATHinsrc/waybackup_to_warc.py). - The CSVs read by the converter are expected to contain columns like
url_origin,url_archive,file,timestamp, andresponse(seesrc/waybackup_to_warc.pyfor required field names used when creating WARC records). - The converter will skip entries whose
filepath does not exist and prints a warning. It also emits simple 404/500 WARC entries when those response codes are encountered.
-
Create or obtain a CSV of Wayback URLs (column name
Internet_Archive_URL), e.g.resources/curated_urls.csv. -
Download snapshots for those URLs:
python src/main.py download resources/curated_urls.csv --column_name Internet_Archive_URLThis writes per-site CSVs and downloaded files into
waybackup_snapshots/(and related subfolders) usingpywaybackup. -
Combine CSVs and convert to WARC:
python src/main.py convert waybackup_snapshots --output archived_site -
The resulting WARC will be written to
output/archived_site.warc.gz.
- If
--outputis not provided forconvert/full, the conversion step may attempt to use aNonefilename. Always provide--outputwhen converting. - If the script can't find expected CSV columns, inspect the CSV(s) created by
pywaybackupand ensure the required column names (file,timestamp,response,url_origin) are present. - If downloads fail, try rerunning with
--resetto force re-downloads.
- Add argument validation to require
--outputforconvertandfullmodes. - Add unit tests for CSV combining and WARC creation edge cases (missing files, bad timestamps).