Parse MediaWiki dumps for the Get to Philosophy game
Fetching latest commit…
Cannot retrieve the latest commit at this time
Parses Wikipedia dumps for the Get to Philosophy game, as seen at http://en.wikipedia.org/wiki/Wikipedia:Get_to_Philosophy Ultimately this will save data in a format suitable for loading into a Rails website, but until then, run bin/page_xml_parser_interface.rb on an XML file like simplewiki-20081029-pages-articles.xml To get such an XML file, go to http://download.wikimedia.org/backup-index.html and then choose your language edition, and download a dump that contains the latest revision (as opposed to all revisions) of all articles. To do: Speed up analysis, preferably using less memory. Ensure iteration over pages doesn't load all of them at once. Handle mainspace articles that have colons in their names. Consider how to handle footnotes. Should Philosophy count as linking to A.C. Grayling, the author of a cited publication? What about an explanatory footnote in the United Kingdom? Find the longest chain from any article to its loop. Find out how long the terminating loops are. Add the option to ignore certain links such as dates or Greek Language. Handle templates containing text content, such as Template:Day. Distinguish between articles and redirects?