Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Scripts and other tools to perform MediaWiki dump parsing
Python Java
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
libs
preprocessing
LICENSE
README

README

Wikipedia Tools
---------------

WikiXMLSplit
  Wiki XML Split splits a compressed (bz2) or uncompressed MediaWiki XML dump
  file into a set of XML files per page organized into subdirectories by page id
  number.
  /preprocessing/wikixmlsplit/

WikiXML2HTML
  Wiki XML to HTML converts the MediaWiki formatted text contained inside the split
  XML files produced by WikiXMLSplit, along with the title information, into a set
  of HTML files (XHTML 1.0) organized into directories by page id number.
  /preprocessing/wikixml2html/

WikiHTML2Text
  Wiki HTML to Text converts the HTML formatted text (in XHTML 1.0) contained inside the
  files produced by WikiXML2HTML, along with the title tag information, into a set
  of plain-text files (UTF-8) organized into directories by page id number.
  /preprocessing/wikihtml2text/
Something went wrong with that request. Please try again.