-
Notifications
You must be signed in to change notification settings - Fork 2
Wikipedia preprocessor and information extractor.
License
avian2/wikiprep
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Wikiprep ============== Zemanta fork MediaWiki syntax preprocessor and information extractor ============== NOTE: Code in this repository has been unmaintained for several years. It is known to have problems with newer Wikipedia dumps and Perl versions. I no longer have at hand datasets and processing power necessary to run Wikiprep. Hence I am not able to help with any problems. -- Tomaz Solc ============== Content ======= 1. Introduction 1.1 Relation to the original Wikiprep 1.2 Current status 2. Requirements and installation 2.1 Software 2.2 Hardware 3. Usage 3.1 Parallel processing 4. Tools 5. License 6. Hacking 1. Introduction =============== Wikiprep is a Perl script that parses MediaWiki data dumps in XML format and extracts useful information from them (MediaWiki is the software that is best known for running Wikipedia and other Wikimedia foundation projects). MediaWiki uses a markup language (wiki syntax) that is optimized for easy editing by human beings. It contains a lot of quirks, special cases and odd corners that help MediaWiki correctly display wiki pages, even when they contain typing errors. This makes parsing this syntax with other software highly non-trivial. One goal of Wikiprep is to implement a parser that: o is compatible with MediaWiki as closely as possible, implementing as much functionality as is needed to achive other goals. o is as fast as possible, allowing tracking the English Wikipedia dataset as closely as possible (MediaWiki's PHP code is slow) The other goal is to use that parser to extract various information from the dump that is suitable for further processing and is stored in files with simple syntax. 1.1 Relation to the original Wikiprep Wikiprep was initialy developed by Evgeniy Gabrilovich to aid his research. Tomaz Solc adapted his script for use in semantic information extraction from Wikipedia as part of Zemanta web service. This version of Wikiprep undergone some extensive modification to be able to extract information needed by Zemanta's engine. 1.2 Current status Currently implemented MediaWiki functionality: o Templates: - Named and positional parameters, - parameter defaults, - recursive inclusion to some degree - infinite recursion breaking is currently implemented in a way incompatible with MediaWiki and - support for <noinclude>, <includeonly> and similar syntax. o Parser functions (currently #if, #ifeq, #language, #switch) o Magic words (currently urlencode, PAGENAME) o Redirects o Internal, external and interwiki links o Proper handling of <nowiki> and other pseudo-HTML tags o Disambiguation page recognition and special parsing o MediaWiki compatible date handling o Stub page recognition o Table and math syntax blocks are recognized and removed from the final output o Related article identification 2. Requirements =============== 2.1 Software You need a recent version of Perl 5 compiled for a 64 bit architecture with the following modules installed (names in parentheses are names of respective Debian packages): Parse::MediaWikiDump (libparse-mediawikidump-perl) Regexp::Common (libregexp-common-perl) Inline::C (libinline-perl) XML::Writer (libxml-writer-perl) BerkeleyDB (libberkeleydb-perl) Log::Handler (liblog-handler-perl) If you can't use Inline::C for some reason, run wikiprep with "-pureperl" flag and it will use the (roughly) equivalent pure Perl implementations. If you want to process gzip or bzip2 compressed dumps you will need gzip and bzip2 installed. Gzip is also required for the -compress option. To run unit tests you will also need xmllint utility (shipped with libxml2, libxml2-utils) 2.2 Hardware English Wikipedia is big and is getting bigger, so requirements below slowly grow over time. As of January 2011, Wikiprep output takes approximately 22 GB of hard disk space (with -compress, -format composite and default logging). Debug log takes 20 GB or more. Prescan phase requires a little over 4 GB of memory in a single Perl process (hence the requirement for a 64 bit version of Perl). In transform phase Wikiprep requires 100 to 200 MB per Perl process (but 4 GB or more is recommended for decent performance due to OS caching BDB tables) On a dual 6 core 2.6 GHz AMD Opteron with 32 GB of RAM and 12 parallel processes it takes approximately 16 hours to process English Wikipedia dump from 15 January 2011 3. Usage ======== 3.1 Installation Run the following commands from the top of the Wikiprep distribution: $ perl Makefile.PL $ make $ make test (optional) And as root: $ make install 3.2 Running The most common command to start Wikiprep on an XML dump you downloaded from Wikimedia: $ wikiprep -format composite -compress -f enwiki-20090306-pages-articles.xml.bz2 This will produce a number of files in the same directory as the dump. Alternatively, run the following to get a list of other available options: $ wikiprep To run regression tests included in the distribution, run: $ make test 3.3 Parallel processing Wikiprep is capable of using multiple CPUs, however in order to do this you must provide it with a split XML dump. First split the XML dump using the splitwiki utility in the tools subdirectory: $ bzcat enwiki-20090306-pages-articles.xml.bz2 | \ splitwiki 4 enwiki-20090306-pages-articles.xml Then run Wikiprep on the split dump using the -parallel option: $ perl wikiprep.pl -format composite -compress \ -f enwiki-20090306-pages-articles.xml.0000 You only need to run Wikiprep once on a machine and it will automatically split into as many parallel processes as there are parts of the dump (identified by the .NNNN suffix). Wikiprep processing is split into sequential parts: "prescan" which is done in one process and "transform" which can be done in parallel. So you will need to wait for prescan to finish before you will see multiple wikiprep processes running on the system. With some scripting it is possible to distribute dump parts to multiple machines and start Wikiprep separately on each machine. It's also possible to only run prescan once on a single machine and then distribute .db files it creates to other machines where only transform part is started. 3.4 Parsing MediaWiki dumps other than English Wikipedia All language and site-specific settings are located in Perl modules under /lib/Wikiprep/Config. The configuration used can be chosen via the -config command line parameter (default is Enwiki.pm) Wikiprep distribution includes configurations for Wikipedia in various languages. These were contributed by Wikiprep users and can be outdated or incomplete. At the moment the only configuration that is thoroughly tested before each release is the English Wikipedia config. Patches are always welcome. To add support for a new language or MediaWiki installation, copy Enwiki.pm to a new file (the convention is to use the same name as is used in naming Wikimedia dumps) and adjust the values within. 4. Tools ======== There are a couple of tools in the tools/ directory that aim to make your life a bit easier. Their use should be pretty much obvious from the help message they return if you run them without any command line arguments. o findtemplate.sh Find templates that support a named parameter. Good for searching for all templates that support for example the "isbn" parameter. o getpage.py Uses Special:Export feature of Wikipedia to download a single article and all templates it depends on. It then constructs a file resembling a MediaWiki XML dump, containing only these pages. Good for making testing datasets or researching why a specific page failed to parse properly. o samplewiki Takes a complete MediaWiki XML dump, takes a random sample of pages from it and creates a new (smaller) dump. o splitwiki Splits a MediaWiki XML dump into N equal parts. For example: $ bzcat enwiki-20090306-pages-articles.xml.bz2 | \ splitwiki 4 enwiki-20090306-pages-articles.xml Will produce 4 files of roughly equal length in the current directory: $ ls enwiki-20090306-pages-articles.xml.0000 enwiki-20090306-pages-articles.xml.0001 enwiki-20090306-pages-articles.xml.0002 enwiki-20090306-pages-articles.xml.0003 o riffle Takes a complete MediaWiki XML dump and some articles downloaded using the Special:Export feature and inserts these articles into the dump. Good for keeping an XML dump up-to-date without having to re-download the whole thing. 5. License ========== Copyright (C) 2007 Evgeniy Gabrilovich (gabr@cs.technion.ac.il) Copyright (C) 2009 Tomaz Solc (tomaz.solc@tablix.org) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA, or see <http://www.gnu.org/licenses/> and <http://www.fsf.org/licensing/licenses/info/GPLv2.html> Some of the example files are copied from English Wikipedia and are copyright (C) by their respective authors. Text is available under the terms of the GNU Free Documentation License 6. Hacking ========== The code is pretty well documented. Have a look inside first, then ask on the mailing list. If you need to customize Wikiprep for a specific language, copy Wikiprep/Config/Enwiki.pm and change fields there. You can then use your new config with -language option. To run a test on just a particular test case, run for example: $ perl t/cases.t t/cases/infobox.xml
About
Wikipedia preprocessor and information extractor.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published