Xoff edited this page Mar 20, 2013 · 4 revisions

Metafacture-mediawiki is a plugin for Metafacture.


The modules in Metafacture-Mediawiki can be divided in three groups.

Base modules

These modules provide MediaWiki xml and wikitext parsing. They create and augment WikiPage objects.

  • WikiXmlHandler parses a MediaWiki xml document and emits a WikiPage object for every page found
  • WikiTextParser uses Sweble to parse the wikitext in a WikiPage object and attaches the abstract syntax tree (AST) to the object


Please note: Extractors are called analyzers in the code. The code will be updated with the next major revision (see issue #2) but until this happens the documentation is ahead of the code.

The extractors extract information from the different representations of a wiki page in WikiPage object and turn these information into a Metafacture event stream.

  • AuthorityLinkExtractor extracts authority file links (GND, LOC, IMDB, VIAF) from Wikipedia articles
  • LinkExtractor extracts all internal links in a wiki page from an AST
  • SimpleLinkExtractor extracts links from a wiki page using regular expression
  • TemplateExtractor extracts all templates from a wiki pages whose name matches a pattern
  • MultiExtractor runs a list of extractors and merges the results into a single record. Additionally, it makes sure that each extractor receives a WikiPage containing the representations of the wikitext it requires.

Utility modules

These modules help working with WikiPage objects.

  • AstToJson adds a serialised representation of an AST to a WikiPage object
  • JsonToAst adds an AST to a WikiPage object which is reconstructed from a serialised represenation


Be the first to write a tutorial!

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.