Perpetual Wiki is a program used to download the most current Wikipedia database dumps from http://wikipedia.c3sl.ufpr.br, generate metadata files, and ingest the database dumps and metadata into http://archive.org.
Perpetual Wiki downloads the most recent database dumps for every wiki listed
on http://wikipedia.c3sl.ufpr.br. It creates a
new directory for each item, as well as two metadata files. Below is an example
of what the contents of a single directory for the aawiki might look like (in a
directroy named aawiki-20111010
):
aawiki-20111010-pages-meta-history.xml.7z
aawiki-20111010-pages-meta-history.xml.bz2
aawiki-20111010-pages-logging.xml.gz
aawiki-20111010-pages-meta-current.xml.bz2
aawiki-20111010-pages-articles.xml.bz2
aawiki-20111010-stub-meta-history.xml.gz
aawiki-20111010-stub-meta-current.xml.gz
aawiki-20111010-stub-articles.xml.gz
aawiki-20111010-abstract.xml
aawiki-20111010-all-titles-in-ns0.gz
aawiki-20111010-iwlinks.sql.gz
aawiki-20111010-redirect.sql.gz
aawiki-20111010-protected_titles.sql.gz
aawiki-20111010-page_props.sql.gz
aawiki-20111010-page_restrictions.sql.gz
aawiki-20111010-page.sql.gz
aawiki-20111010-category.sql.gz
aawiki-20111010-user_groups.sql.gz
aawiki-20111010-interwiki.sql.gz
aawiki-20111010-langlinks.sql.gz
aawiki-20111010-externallinks.sql.gz
aawiki-20111010-templatelinks.sql.gz
aawiki-20111010-imagelinks.sql.gz
aawiki-20111010-categorylinks.sql.gz
aawiki-20111010-pagelinks.sql.gz
aawiki-20111010-oldimage.sql.gz
aawiki-20111010-image.sql.gz
aawiki-20111010-site_stats.sql.gz
aawiki-20111010_meta.xml
aawiki-20111010_files.xml
The last two files aawiki-20111010_meta.xml
and aawiki-20111010_files.xml
are
generated by wiki-d.py
. The _meta.xml file is a metadata file for the entire
dump. The _files.xml is a stub file used for archive.org purposes.
Below is an example of what the _meta.xml file aawiki-20111010_meta.xml
might
look like for the aawiki:
<?xml version="1.0" encoding="utf-8"?>
<metadata>
<description>Retrieved from wikipedia.org on 2011-12-01</description>
<creator>Wikipedia</creator>
<mediatype>web</mediatype>
<collection>wikipediadumps</collection>
<licenseurl>http://creativecommons.org/licenses/by-nc-sa/3.0/us/</licenseurl>
<date>20110901</date>
<identifier>aawiki-20110901</identifier>
<uploader>jake@archive.org</uploader>
<collection>web</collection>
<title>aawiki-20110901</title>
</metadata>
The program can easily be modified to suit general, non-archive.org, needs.
Simply ignore the perpetual-wiki.php
file and use wiki-d.py
. A few unecessary
archive.org files will still be genearated, but the script will function properly.
To prevent wiki-d.py
from generating the unecessary files simply remove
the following code from the beginning of the main function in wiki-d.py
:
''' <Perpetual Loop Auto-submit business> '''
list_home = os.getcwd()
readyListFileName = "ready_list.txt"
lockFileName = readyListFileName + ".lck"
### Exit if last list still pending, wait for it to be renamed/removed.
if os.access( readyListFileName, os.F_OK ) is True:
print ( 'ABORT: %s exists (Not picked up yet? Should be renamed'
'when retrieved by auto_submit loop!)' % readyListFileName )
if os.access( lockFileName, os.F_OK ) is True:
os.remove(lockFileName)
exit(0)
### If lock file exists, another process is already generating the list
if os.access( lockFileName, os.F_OK ) is True:
print ( 'ABORT: %s lockfile exists (Another process generating list'
'already? Should be deleted when complete!)' % lockFileName )
exit(0)
### Touch a lock and list file.
touchLi = open(readyListFileName,'wb')
touchLi.write('')
touchLi.close()
touchLo = open(lockFileName, 'wb')
touchLo.write('')
touchLo.close()
''' <Peprpetual Loop Auto-submit business /> '''
The following lines towards the beginning of the makeMeta function:
f = open("%s_files.xml" % identifier, 'wb')
f.write('<files/>')
f.close()
And the following line towards the end of the main function:
os.remove(lockFileName)