WikiHist.html: English Wikipedia's Full Revision History Parsed to HTML

This repository contains all code related to the WikiHist.html data release.

The dataset itself is not described here. For a description of the dataset, please refer to these resources:

Dataset website: https://doi.org/10.5281/zenodo.3605388
Dataset paper: https://arxiv.org/abs/2001.10256

Quick start: Data download

As described on the dataset website, the HTML revision history is hosted at the Internet Archive. To facilitate downloading the data, we provide two alternative ways to get the data:

a torrent-based solution (recommended ✳️)
a Python script in the downloading_scripts directory of this repo. Note: Using the script, you can download either all data or only revisions for specific Wikipedia articles.

Option 1: Torrent-based solution

Pros: fast, automatic retry and restore

Cons: intended only for full download

This method is the recommended way to download the full dataset. If you are interested in a partial download (i.e., only some articles), please consider Option 2. This solution requires the command-line utility Aria2 available at https://aria2.github.io/

Once the repository is cloned, the download requires 2 steps:

Download the utility aria2c from its Github repository.
Run the script download.sh in the folder TorrentDownload

This script starts the download of the Torrent files listed in files_list.txt. The parameters in the file download.sh can be adapted to your connection specifics. Please refer to Aria2 documentation (aria2c -h and Online Manual).

By default, the script uses 16 parallel connections and saves the downloaded dataset in the folder WikiHist_html.

Option 2: Custom script

Pros: allows partial download, e.g., to get revisions for certain articles only

Cons: slower, maximum number of retries is upper-bounded

This solution allows both the full dowload and the partial download of the dataset based on the article ID(s).

Install dependencies

The scripts require the internetarchive and wget packages. First you need to install those:

pip install internetarchive
pip install wget

Download the full dataset of all historical HTML revisions (7TB)

To download the full dataset, go to the downloading_scripts directory and run

python download_whole_dataset.py

Caveat emptor: the dataset is 7TB large, so make sure you have enough disk space before starting the download. Given the size, downloading the data will take a while.

Download historical HTML revisions for specific Wikipedia articles only

If, rather than downloading the full dataset, you want to download revisions for specific Wikipedia articles only, proceed as follows:

Go to the downloading_scripts directory.
In the file titles_to_download.txt, list the titles of the articles whose HTML revision history you would like to download.
Run python download_subset.py.
The script requires some metadata. If you don't have it yet, you will be asked if the script should download it. Type Yes.
When asked about the search mode, type page_title or page_id. (If you choose page_id then titles_to_download.txt should contain page ids, rather than page titles.)
When prompted, provide the path to titles_to_download.txt.
The data will be saved in the downloaded_data directory.

Data extraction pipeline

Most users will need only the above scripts for downloading the ready-made WikiHist.html dataset. The remainder of this README refers to the code for producing the dataset from scratch.

Code structure

The scripts are divided into 7 directories:

1_downloading_wiki_dump
2_extracting_templates
3_create_mysql_db
4_docker_containers
5_server_scripts
6_dealing_with_failed
7_uploading_to_IA

Every directory represents a step in the process of converting wikitext to HTML, from downloading the raw wikitext dump, to extracting the templates, etc., all the way to uploading the data to the Internet Archive. In each step's directory, there is a README with details about that step.

Dependencies

The following libraries are required:

Internet Archive Command-Line Tool (installation guide)
Docker (installation guide)

Debugging the pipeline

To run the pipeline on a small sample (mostly for debugging), run

bash quick_run.sh

The above script processes the small input file data/sample.xml. Note that, even in this setting, the script needs 111GB of free disk space, as it downloads an 11GB MySQL database that decompresses to 100GB.

If the processing completes successfully, a directory data/results/sample.xml/_SUCCESS will be created, and the resulting JSON files will be placed in a directory data/results/sample.xml.

License

Attribution 3.0 Unported (CC BY 3.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiHist.html: English Wikipedia's Full Revision History Parsed to HTML

Quick start: Data download

Option 1: Torrent-based solution

Option 2: Custom script

Install dependencies

Download the full dataset of all historical HTML revisions (7TB)

Download historical HTML revisions for specific Wikipedia articles only

Data extraction pipeline

Code structure

Dependencies

Debugging the pipeline

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
1_downloading_wiki_dump		1_downloading_wiki_dump
2_extracting_templates		2_extracting_templates
3_create_mysql_db		3_create_mysql_db
4_docker_images		4_docker_images
5_server_scripts		5_server_scripts
6_dealing_with_failed		6_dealing_with_failed
7_uploading_to_IA		7_uploading_to_IA
TorrentDownload		TorrentDownload
data		data
downloading_scripts		downloading_scripts
README.md		README.md
quick_run.sh		quick_run.sh

epfl-dlab/WikiHist.html

Folders and files

Latest commit

History

Repository files navigation

WikiHist.html: English Wikipedia's Full Revision History Parsed to HTML

Quick start: Data download

Option 1: Torrent-based solution

Option 2: Custom script

Install dependencies

Download the full dataset of all historical HTML revisions (7TB)

Download historical HTML revisions for specific Wikipedia articles only

Data extraction pipeline

Code structure

Dependencies

Debugging the pipeline

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages