We archive wikis, from Wikipedia to tiniest wikis
WikiTeam software is a set of tools for archiving wikis. They work on MediaWiki wikis, but we want to expand to other wiki engines. As of December 2014, WikiTeam has preserved more than 25,000 stand-alone wikis, several wikifarms, regular Wikipedia dumps and 34 TB of Wikimedia Commons images.
There are thousands of wikis in the Internet. Every day some of them are no longer publicly available and, due to lack of backups, lost forever. Millions of people download tons of media files (movies, music, books, etc) from the Internet, serving as a kind of distributed backup. Wikis, most of them under free licenses, disappear from time to time because nobody grabbed a copy of them. That is a shame that we would like to solve.
WikiTeam is the Archive Team (GitHub) subcommittee on wikis. It was founded and originally developed by Emilio J. Rodríguez-Posada, a Wikipedia veteran editor and amateur archivist. Many people have helped by sending suggestions, reporting bugs, writing documentation, providing help in the mailing list and making wiki backups. Thanks to all, especially to: Federico Leva, Alex Buie, Scott Boyd, Hydriz, Platonides, Ian McEwen, Mike Dupont, balr0g and PiRSquared17.
Confirm you satisfy the requirements:
pip install --upgrade -r requirements.txt
or, if you don't have enough permissions for the above,
pip install --user --upgrade -r requirements.txt
Download any wiki
To download any wiki, use one of the following options:
python dumpgenerator.py http://wiki.domain.org --xml --images (complete XML histories and images)
If the script can't find itself the API and/or index.php paths, then you can provide them:
python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --xml --images
python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --index=http://wiki.domain.org/w/index.php --xml --images
If you only want the XML histories, just use
--xml. For only the images, just
--images. For only the current version of every page,
You can resume an aborted download:
python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --xml --images --resume --path=/path/to/incomplete-dump
See more options:
python dumpgenerator.py --help
Download Wikimedia dumps
To download Wikimedia XML dumps (Wikipedia, Wikibooks, Wikinews, etc) you can run:
python wikipediadownloader.py (download all projects)
See more options:
python wikipediadownloader.py --help
Download Wikimedia Commons images
There is a script for this, but we have uploaded the tarballs to Internet Archive, so it's more useful to reseed their torrents than to re-generate old ones with the script.
You can run tests easily by using the tox command. It is probably already present in your operating system, you would need version 1.6. If it is not, you can download it from pypi with:
pip install tox.
$ tox py27 runtests: commands | nosetests --nocapture --nologcapture Checking http://wiki.annotation.jp/api.php Trying to parse かずさアノテーション - ソーシャル・ゲノム・アノテーション.jpg from API Retrieving image filenames . Found 266 images . ------------------------------------------- Ran 1 test in 2.253s OK _________________ summary _________________ py27: commands succeeded congratulations :) $