Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Parse MediaWiki dumps for the Get to Philosophy game

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 bin
Octocat-spinner-32 config
Octocat-spinner-32 db
Octocat-spinner-32 lib
Octocat-spinner-32 log
Octocat-spinner-32 temp
Octocat-spinner-32 test
Octocat-spinner-32 .gitignore
Octocat-spinner-32 README
Octocat-spinner-32 Rakefile
Octocat-spinner-32 configuration.example.yml
README
Parses Wikipedia dumps for the Get to Philosophy game, as seen at http://en.wikipedia.org/wiki/Wikipedia:Get_to_Philosophy

Ultimately this will save data in a format suitable for loading into a Rails website, but until then, run bin/page_xml_parser_interface.rb on an XML file like simplewiki-20081029-pages-articles.xml

To get such an XML file, go to http://download.wikimedia.org/backup-index.html and then choose your language edition, and download a dump that contains the latest revision (as opposed to all revisions) of all articles.

To do:

Speed up analysis, preferably using less memory. Ensure iteration over pages doesn't load all of them at once.

Handle mainspace articles that have colons in their names.

Consider how to handle footnotes.
Should Philosophy count as linking to A.C. Grayling, the author of a cited publication?
What about an explanatory footnote in the United Kingdom?

Find the longest chain from any article to its loop.

Find out how long the terminating loops are.

Add the option to ignore certain links such as dates or Greek Language.

Handle templates containing text content, such as Template:Day.

Distinguish between articles and redirects?
Something went wrong with that request. Please try again.