Metafacture modules for processing Mediawiki pages
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Data extraction from MediaWiki pages made easy.

Build Status

About Metafacture-Mediawiki

Metafacture-Mediawiki is a plugin for Metafacture. It provides modules for extracting information from MediaWiki pages such as Wikipedia articles. Currently, modules for extracting links and templates exist. Adding new extraction modules is easy.

The plugin relies on the excellent Sweble wikitext parser for parsing wikitext into abstract syntax trees.

Key Features

  • Extracts basic metadata information about pages from MediaWiki xml documents
  • Extracts simple information from wikitext using regular expressions (fast but not suitable for complex tasks)
  • Wraps the Sweble wikitext parser for conveniently parsing wikitext into an abstract syntax tree within a Flux flow
  • Extracts links and templates from abstract syntax trees created by Sweble and turns them into a Metafacture event stream
  • Makes writing additional extraction modules easy
  • Supports running multiple extraction modules hassle-free

Download and Install

Metafacture-Mediawiki can be used as a plugin in the Metafacture distribution or as a Java library in your own programs.

Plugin Usage

The plugin can be downloaded on the releases page. Drop this plugin jar into the /plugins folder of the metafacture-runner to use the plugin.

Java Library Usage

Metafacture-Mediawiki is available on Maven Central. To use it, add the following dependency declaration to your pom.xml:


Additionally, you need to add the metafacture-core package as a dependency:


Our integration server automatically publishes successful builds of the master branch as snapshot versions on Sonatype OSS Repository.


The documentation of Metafacture-Mediawiki can be found in the Wiki.


Copyright 2013, 2015 Deutsche Nationalbibliothek.

Metafacture-Mediawiki is distributed under the Apache 2.0 License.