Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
265 lines (207 sloc) 14 KB

This is an archive about the research that was made by Erik Rose and Peter Potrowl in order to find and finally build a good Python parser for MediaWiki's syntax.


  • Possible (to represent MediaWiki syntax). There is no such thing as invalid wiki markup, so we have to make some sense out of anything.
  • Extensible per project (for {for} syntax, DekiWiki conditionals, includes, templates, etc.)
  • Easier to comprehend than MW's existing formatter
  • Must work with Unicode input (Japanese, for example)


  • Implement those MediaWiki language features which make no sense outside the MediaWiki product: for example, category links.
  • Be bug-for-bug compatible with MW. We can't constrain ourselves to give the exact same output for egregiously bad wikisyntax; we'll make a mess, which kills our goal of being easy to understand. We should catch common errors like bad quote nesting but not go out of our minds to be bug-for-bug compatible. After all, MW itself changes the language in every release. Let them chase us.

MW language properties

Parser libs


In the following lists, (+) signifies a pro, (-) a con, and (.) a neutral point.



  • (.) LALR or, optionally, SLR (Can SLR look ahead farther? No: actually, it has no lookahead.)
  • (-) LALR(1), which is grossly insufficient for MW. I think it's about a lookahead of 4, which we'd have to take care of ourselves, probably making the code hard to comprehend in the process.
  • (+) Modal lexer and parser. (Is this necessary? Understand what can't be BNF'd about apostrophe jungles and lists.)
  • (+) Easy to translate into C later
  • (.) Can turn off magic autodiscovery
  • (-) Potential for yacc to guess wrong about how to assemble symbols: reduce/reduce and shift/reduce errors
  • (+) Much faster than PyParsing? at 23:58 suggests it's 5.5x faster. More benchmarks (ANTLR and more):
  • (-) A bit more verbose (but very clear)


  • (.) Recursive descent (LL) of PEGs
  • (+) Packrat, so O(n)
  • (+) Easy to write
  • (-) "[An LL(k) parser] may defer error detection to a different branch of the grammar due to backtracking, often making errors harder to localize across disjunctions with long common prefixes."—Wikipedia. I had that problem when writing a simple italics/bold parser: you have to keep the recursion stack in your head to make any sense of the debug info. I eventually gave up trying to fix it.


Not researched in depth.

  • (+) Claims to be nearly as fast as C
  • (-) Requires a C build step


  • (-) Separate code generation step
  • (-) Slow because it generates a lot of function calls
  • (+) Can parse LL(k) grammars (arbitrary lookahead)


  • (+) Has an implementation of an Earley parser, which can do arbitrary lookahead in n^3 worst case.


  • (+) Another Earley parser
  • (+) Long-lived. Under active development by multiple authors. Last released 4/2011.
  • (.) There's a good, free book about the project: Not sure how good the documentation about the code itself is, though.
  • (-) An enormous dependency

PyGgy (

  • (.) Untested
  • (.) GLR parser
  • (+) Public domain
  • (-) Might be dead (the home page has disappeared:
  • (-) "PyGgy was written and tested with Python 2.2.3." (in 2003)

Pijnu (

  • (+) PEG. Easy, easy grammar definition.
  • (.) Looks promising but not mature. Author has given no thought to speed but much to clarity.
  • (-) Build step
  • (-) Currently no Unicode support
  • (+) Great docs:
  • (+) Great error feedback
  • (+) The generated code looks like what you have to hand-write for PyParsing (see the user guide).
  • (+) Can handle having Unicode chars in the input.
  • (.) Can it handle having Unicode chars as part of parse rules? We might need guillemets.
  • (-) Eek, no tests! Throws DeprecationWarnings on import. Very unique coding style.

PyMeta (

  • (.) PEG. Grammar defined in a DSL.
  • (+) No build step; converts grammar from a DSL at runtime.
  • (+) Good docs in the code
  • (-) Nobody's touched it for a year.

PyMeta2 (

  • (.) Is a port of PyMeta to "the simplified OMeta 2 syntax" (new DSL syntax).

Ppeg (

  • (-) Not in Python: Python code (21 kB) code is just an API for a C parser (172 kB)

pyPEG (

  • (.) Only 340 lines of Python
  • (-) Similar to Pijnu but much less easy to use

Previous implementations


Py-wikimarkup (

  • (+) Probably works (untested)
  • (-) Direct transformation from wikitext to HTML (generates no AST)
  • (-) As a direct port of the MW PHP, it is very difficult to understand or extend.
  • (-) Because it is based on a sequence of perilously combined regexes which interact in surprising ways, it, like MW proper, sometimes yields surprising output.

mwlib (

  • (+) Works well, lots of unittests already defined and successfully passed
  • (+) Generates an AST
  • (.) Implements its own lexer/parser (see mwlib/refine/ and mwlib/refine/_core.pyx: compiled token walker)
  • (.) Seems to: tokenize the text and then apply ~20 different parsers one by one (see mwlib/refine/ and #635)
  • (-) Structure of the code somewhat hard to understand ( vs, etc.)
  • (-) Lot of code not related to parsing (fetching articles, (un)zip files, API stuff, output for ODF, Latex, etc. that should be more isolated from the parsing part)

mediawiki_parser (this one)

  • (+) Good start (parser + lexer, unittests)
  • (.) Currently using PLY but will be abandoned due to the lack of lookahead
  • (-) Currently incomplete syntax
  • (-) Currently generates no AST

Sweble (


Lexer + parser (e.g. PLY)

  • (+) Easy to use and debug
  • (+) Stateful (specific simple rules for each context)
  • (-) Not enough lookahead in the case of LR(1) parser

Recursive descent of CFGs

  • (+) No separate lexer and parser
  • (+) Memoization ("packrat") makes it run in O(n)
  • (.) Recursive
  • (-) May require large amounts of memory
  • (-) Quite hard to read and debug

Recursive descent of PEGs (e.g. Rats, PyParsing)

  • (+) No separate lexer and parser
  • (+) O(n) with packrat
  • (+) Resolves ambiguity by having precedence orders for productions. As a result, it is easy to extend a PEG with productions for use in special situations without wrecking the wider grammar. This could be a very big deal for our extensibility story.
  • (+) We can rip off Sweble's grammar.

Earley parser (e.g. Spark, NLTK)

GLR parser (e.g. Pyggy)

  • (.) Supports ambiguous grammars (which MW isn't)
  • (+) O(n) on deterministic grammars

Previous work



If we build the parse tree in custom lexer callbacks, we can make it an ElementTree or whatever we want--meaning we can use XPath on it later if we want.

Quasi Gantt chart

Re-examing parsing algorithm,
& implement links                       |----|----|----   Bold/Italics/Apostrophe Jungles (3 weeks)                                      |----|----|----   HTML formatter |----   Showfor support |--
& other long-lookahead productions
(3 weeks)                                                 Simple productions:
                                                          Paragraphs (3 days)                                                            |--
                                                          HRs (1 day)                                                                    |
                                                          magic words (3 days)                                                           |--

                                                          Tables (long lookahead?) (1 week)                                              |----

                                                          One person should do these:
                                                          Includes (long lookahead?) (2 weeks)                                           |----|----
                                                          Templates w/params (long lookahead?) (2 weeks)                                 |----|----

                                                          Redirects (3 days)                                                             |--
                                                          Naked URLs (long lookahead but doable in lexer?) (1 day)                       |
                                                          Headers (long lookahead but doable in lexer) (done for now)
                                                          Entities (done for now)
                                                          Behavior switches (optional) (4 days--will require some architecture thinking) |---

                                                          HTML tags: probably just tokenize and preserve them through the parser and     |----|----|----
                                                            then have a separate post-parse step to balance and validate them and, for
                                                            example, escape any invalid ones (3 weeks)