Perpetually Archive Wikipedia
Python PHP
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is even with jjjake:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.markdown
logging.conf
perpetual-wiki.php
wiki-d.py

README.markdown

Perpetual Wiki

Perpetual Wiki is a program used to download the most current Wikipedia database dumps from http://wikipedia.c3sl.ufpr.br, generate metadata files, and ingest the database dumps and metadata into http://archive.org.

Item Structure

Perpetual Wiki downloads the most recent database dumps for every wiki listed on http://wikipedia.c3sl.ufpr.br. It creates a new directory for each item, as well as two metadata files. Below is an example of what the contents of a single directory for the aawiki might look like (in a directroy named aawiki-20111010):

 aawiki-20111010-pages-meta-history.xml.7z
 aawiki-20111010-pages-meta-history.xml.bz2
 aawiki-20111010-pages-logging.xml.gz
 aawiki-20111010-pages-meta-current.xml.bz2
 aawiki-20111010-pages-articles.xml.bz2
 aawiki-20111010-stub-meta-history.xml.gz
 aawiki-20111010-stub-meta-current.xml.gz
 aawiki-20111010-stub-articles.xml.gz
 aawiki-20111010-abstract.xml
 aawiki-20111010-all-titles-in-ns0.gz
 aawiki-20111010-iwlinks.sql.gz
 aawiki-20111010-redirect.sql.gz
 aawiki-20111010-protected_titles.sql.gz
 aawiki-20111010-page_props.sql.gz
 aawiki-20111010-page_restrictions.sql.gz
 aawiki-20111010-page.sql.gz
 aawiki-20111010-category.sql.gz
 aawiki-20111010-user_groups.sql.gz
 aawiki-20111010-interwiki.sql.gz
 aawiki-20111010-langlinks.sql.gz
 aawiki-20111010-externallinks.sql.gz
 aawiki-20111010-templatelinks.sql.gz
 aawiki-20111010-imagelinks.sql.gz
 aawiki-20111010-categorylinks.sql.gz
 aawiki-20111010-pagelinks.sql.gz
 aawiki-20111010-oldimage.sql.gz
 aawiki-20111010-image.sql.gz
 aawiki-20111010-site_stats.sql.gz
 aawiki-20111010_meta.xml
 aawiki-20111010_files.xml

The last two files aawiki-20111010_meta.xml and aawiki-20111010_files.xml are generated by wiki-d.py. The _meta.xml file is a metadata file for the entire dump. The _files.xml is a stub file used for archive.org purposes.

Metadata files

Below is an example of what the _meta.xml file aawiki-20111010_meta.xml might look like for the aawiki:

 <?xml version="1.0" encoding="utf-8"?>
 <metadata>
   <description>Retrieved from wikipedia.org on 2011-12-01</description>
   <creator>Wikipedia</creator>
   <mediatype>web</mediatype>
   <collection>wikipediadumps</collection>
   <licenseurl>http://creativecommons.org/licenses/by-nc-sa/3.0/us/</licenseurl>
   <date>20110901</date>
   <identifier>aawiki-20110901</identifier>
   <uploader>jake@archive.org</uploader>
   <collection>web</collection>
   <title>aawiki-20110901</title>
 </metadata>

Non Internet Archive Use

The program can easily be modified to suit general, non-archive.org, needs. Simply ignore the perpetual-wiki.php file and use wiki-d.py. A few unecessary archive.org files will still be genearated, but the script will function properly.

To prevent wiki-d.py from generating the unecessary files simply remove the following code from the beginning of the main function in wiki-d.py:

 ''' <Perpetual Loop Auto-submit business> '''
 list_home = os.getcwd()
 readyListFileName = "ready_list.txt"
 lockFileName = readyListFileName + ".lck"
 ### Exit if last list still pending, wait for it to be renamed/removed.
 if os.access( readyListFileName, os.F_OK ) is True:
     print ( 'ABORT: %s exists (Not picked up yet? Should be renamed'
             'when retrieved by auto_submit loop!)' % readyListFileName )
     if os.access( lockFileName, os.F_OK ) is True:
         os.remove(lockFileName)
     exit(0)
 ### If lock file exists, another process is already generating the list
 if os.access( lockFileName, os.F_OK ) is True:
     print ( 'ABORT: %s lockfile exists (Another process generating list'
             'already? Should be deleted when complete!)' % lockFileName )
     exit(0)
 ### Touch a lock and list file.
 touchLi = open(readyListFileName,'wb')
 touchLi.write('')
 touchLi.close()
 touchLo = open(lockFileName, 'wb')
 touchLo.write('')
 touchLo.close()
 ''' <Peprpetual Loop Auto-submit business /> '''

The following lines towards the beginning of the makeMeta function:

 f = open("%s_files.xml" % identifier, 'wb')
 f.write('<files/>')
 f.close()

And the following line towards the end of the main function:

os.remove(lockFileName)