Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Add basic documentation in README.rst

Also: move the current content of README.rst to parsers.rst to keep a
trace of the research work we have done.
  • Loading branch information...
commit 6de10a822bab61af62a5b3955b62d62c2ffbc18b 1 parent a6f9e78
@peter17 peter17 authored
Showing with 342 additions and 231 deletions.
  1. +78 −231 README.rst
  2. +264 −0 parsers.rst
View
309 README.rst
@@ -1,262 +1,109 @@
-Goals
-=====
-* Possible (to represent MediaWiki syntax). There is no such thing as invalid wiki markup, so we have to make some sense out of anything.
-* Extensible per project (for {for} syntax, DekiWiki conditionals, includes, templates, etc.)
-* Easier to comprehend than MW's existing formatter
-* Must work with Unicode input (Japanese, for example)
+Presentation
+============
+This is a parser for MediaWiki's (MW) syntax. It's goal is to transform wikitext into an abstract syntax tree (AST) and then render this AST into various formats such as plain text and HTML.
-Ungoals
-=======
-* Implement those MediaWiki language features which make no sense outside the MediaWiki product: for example, category links.
-* Be bug-for-bug compatible with MW. We can't constrain ourselves to give the exact same output for egregiously bad wikisyntax; we'll make a mess, which kills our goal of being easy to understand. We should catch common errors like bad quote nesting but not go out of our minds to be bug-for-bug compatible. After all, MW itself changes the language in every release. Let them chase us.
+How it works
+============
-MW language properties
-======================
-* Unambiguous. Rats-based Sweble parses it, and Rats is a PEG-based lib, and PEGs can't represent unambiguous grammars, according to http://en.wikipedia.org/wiki/Parsing_expression_grammar.
+Two files, preprocessor.pijnu and mediawiki.pijnu describe the MW syntax using patterns that form a grammar. Another Python tool called Pijnu will interpret those grammars and use them to match the wikitext content and build the AST.
+Then, specific Python functions will render the leaves of the AST into the wanted format.
-Parser libs
+The reason why we use two grammars is that we will first substitute the templates in the wikitext with a preprocessor before actually parsing the content of the page.
+
+
+How to test
===========
-See:
-
-* http://wiki.python.org/moin/LanguageParsing
-* http://en.wikipedia.org/wiki/Comparison_of_parser_generators
-
-In the following lists, (+) signifies a pro, (-) a con, and (.) a neutral point.
-
-LEPL
-----
-* (.) Supports ambiguous grammars (doesn't matter: MW is unambiguous)
-* (.) Idiosyncratic syntax with lots of operator overloading (even slices!)
-* (.) Slow (http://www.quora.com/What-is-the-best-parser-generator-for-Python/answer/Matthew-Lloyd)
-* (+) Excellent docs
-
-PLY
----
-* (.) LALR or, optionally, SLR (Can SLR look ahead farther? No: actually, it has no lookahead.)
-* (-) LALR(1), which is grossly insufficient for MW. I think it's about a lookahead of 4, which we'd have to take care of ourselves, probably making the code hard to comprehend in the process.
-* (+) Modal lexer and parser. (Is this necessary? Understand what can't be BNF'd about apostrophe jungles and lists.)
-* (+) Easy to translate into C later
-* (.) Can turn off magic autodiscovery
-* (-) Potential for yacc to guess wrong about how to assemble symbols: reduce/reduce and shift/reduce errors
-* (+) Much faster than PyParsing? http://www.mefeedia.com/watch/29412150 at 23:58 suggests it's 5.5x faster. More benchmarks (ANTLR and more): http://www.dalkescientific.com/writings/diary/archive/2007/11/03/antlr_java.html
-* (-) A bit more verbose (but very clear)
-
-PyParsing
----------
-* (.) Recursive descent (LL) of PEGs
-* (+) Packrat, so O(n)
-* (+) Easy to write
-* (-) "[An LL(k) parser] may defer error detection to a different branch of the grammar due to backtracking, often making errors harder to localize across disjunctions with long common prefixes."—Wikipedia. I had that problem when writing a simple italics/bold parser: you have to keep the recursion stack in your head to make any sense of the debug info. I eventually gave up trying to fix it.
-
-PyBison
--------
-Not researched in depth.
-
-* (+) Claims to be nearly as fast as C
-* (-) Requires a C build step
-
-ANTLR
------
-* (-) Separate code generation step
-* (-) Slow because it generates a lot of function calls
-* (+) Can parse LL(k) grammars (arbitrary lookahead)
-
-SPARK
------
-* (+) Has an implementation of an Earley parser, which can do arbitrary lookahead in n^3 worst case.
-
-NLTK
-----
-* (+) Another Earley parser
-* (+) Long-lived. Under active development by multiple authors. Last released 4/2011.
-* (.) There's a good, free book about the project: http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html. Not sure how good the documentation about the code itself is, though.
-* (-) An enormous dependency
-
-PyGgy (http://pypi.python.org/pypi/pyggy/0.3)
----------------------------------------------
-* (.) Untested
-* (.) GLR parser
-* (+) Public domain
-* (-) Might be dead (the home page has disappeared: http://www.lava.net/~newsham/pyggy/)
-* (-) "PyGgy was written and tested with Python 2.2.3." (in 2003)
-
-Pijnu (http://spir.wikidot.com/pijnu)
--------------------------------------
-* (+) PEG. Easy, easy grammar definition.
-* (.) Looks promising but not mature. Author has given no thought to speed but much to clarity.
-* (-) Build step
-* (-) Currently no Unicode support
-* (+) Great docs: http://spir.wikidot.com/pijnu-user-guide
-* (+) Great error feedback
-* (+) The generated code looks like what you have to hand-write for PyParsing (see the user guide).
-* (+) Can handle having Unicode chars in the input.
-* (.) Can it handle having Unicode chars as part of parse rules? We might need guillemets.
-* (-) Eek, no tests! Throws DeprecationWarnings on import. Very unique coding style.
-
-PyMeta (https://launchpad.net/pymeta)
--------------------------------------
-* (.) PEG. Grammar defined in a DSL.
-* (+) No build step; converts grammar from a DSL at runtime.
-* (+) Good docs in the code
-* (-) Nobody's touched it for a year.
-
-PyMeta2 (http://www.allbuttonspressed.com/projects/pymeta)
-----------------------------------------------------------
-* (.) Is a port of PyMeta to "the simplified OMeta 2 syntax" (new DSL syntax).
-
-Ppeg (https://bitbucket.org/pmoore/ppeg/)
------------------------------------------
-* (-) Not in Python: Python code (21 kB) code is just an API for a C parser (172 kB)
-
-pyPEG (http://fdik.org/pyPEG/)
-------------------------------
-* (.) Only 340 lines of Python
-* (-) Similar to Pijnu but much less easy to use
-
-
-Previous implementations
-========================
-See: http://www.mediawiki.org/wiki/Alternative_parsers
-
-Py-wikimarkup (https://github.com/dcramer/py-wikimarkup)
---------------------------------------------------------
-* (+) Probably works (untested)
-* (-) Direct transformation from wikitext to HTML (generates no AST)
-* (-) As a direct port of the MW PHP, it is very difficult to understand or extend.
-* (-) Because it is based on a sequence of perilously combined regexes which interact in surprising ways, it, like MW proper, sometimes yields surprising output.
-
-mwlib (http://code.pediapress.com/wiki/wiki/mwlib)
---------------------------------------------------
-* (+) Works well, lots of unittests already defined and successfully passed
-* (+) Generates an AST
-* (.) Implements its own lexer/parser (see mwlib/refine/core.py and mwlib/refine/_core.pyx: compiled token walker)
-* (.) Seems to: tokenize the text and then apply ~20 different parsers one by one (see mwlib/refine/core.py#928 and #635)
-* (-) Structure of the code somewhat hard to understand (uparser.py vs old_uparser.py, etc.)
-* (-) Lot of code not related to parsing (fetching articles, (un)zip files, API stuff, output for ODF, Latex, etc. that should be more isolated from the parsing part)
-
-mediawiki_parser (this one)
----------------------------
-* (+) Good start (parser + lexer, unittests)
-* (.) Currently using PLY but will be abandoned due to the lack of lookahead
-* (-) Currently incomplete syntax
-* (-) Currently generates no AST
-
-Sweble (http://sweble.org/gitweb/)
-----------------------------------
-* (+) Works well: demo here: http://sweble.org/crystalball/
-* (.) Interesting description of the parser philosophy: http://sweble.org/gitweb/?p=sweble-wikitext.git;f=swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser/Content.rats;h=e6f0e250b01c3c76ce85a38ba75eb0fcbe636d7a;hb=899a68c087fb6439b4d60c3e6d3c7c025ac0d663
-* (.) Same for preprocessor: http://sweble.org/gitweb/?p=sweble-wikitext.git;a=blob;f=swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/preprocessor/Grammar.rats;h=c13e8a662178516f730d4c63115ba59210aa2481;hb=899a68c087fb6439b4d60c3e6d3c7c025ac0d663
-* (.) Uses the packrat xtc parser: http://www.cs.nyu.edu/rgrimm/xtc/rats.html
-* (-) Not simple...
+The current simplest way to test the tool is to put wikitext inside the wikitext.txt file. Then, run:
-Algorithms
-==========
+::
-Lexer + parser (e.g. PLY)
--------------------------
-* (+) Easy to use and debug
-* (+) Stateful (specific simple rules for each context)
-* (-) Not enough lookahead in the case of LR(1) parser
+ python parser.py
+
+and the wikitext will be rendered as HTML in the article.htm file.
-Recursive descent of CFGs
-------------------------------------------
-* (+) No separate lexer and parser
-* (+) Memoization ("packrat") makes it run in O(n)
-* (.) Recursive
-* (-) May require large amounts of memory
-* (-) Quite hard to read and debug
+Other ways might be implemented in the future.
-Recursive descent of PEGs (e.g. Rats, PyParsing)
--------------------------------------
-* (+) No separate lexer and parser
-* (+) O(n) with packrat
-* (+) Resolves ambiguity by having precedence orders for productions. As a result, it is easy to extend a PEG with productions for use in special situations without wrecking the wider grammar. This could be a very big deal for our extensibility story.
-* (+) We can rip off Sweble's grammar.
-Earley parser (e.g. Spark, NLTK)
---------------------------------
-* (.) O(n³) in the general case, O(n²) for unambiguous grammars and O(n) for almost all LR(k) grammars
-* (.) Meant for context-free grammars, but may also work in context-free subsections of context-sensitive grammars according to this publication: http://danielmattosroberts.com/earley/context-sensitive-earley.pdf
-GLR parser (e.g. Pyggy)
------------------------
-* (.) Supports ambiguous grammars (which MW isn't)
-* (+) O(n) on deterministic grammars
+How to use in a program
+=======================
+Example for HTML
+----------------
+In order to use this tool to render wikitext into HTML in a Python program, you can use the following lines:
-Previous work
-=============
-* (+) OCaml lexer implementation: http://www.mediawiki.org/wiki/MediaWiki_lexer
-* (+) Markup spec: http://www.mediawiki.org/wiki/Markup_spec
-* (+) BNF grammar: http://www.mediawiki.org/wiki/Markup_spec/BNF
+::
- * (+) Corresponds closely to yacc input format
- * (+) Pretty comprehensive: lots of English describing corner cases and error recovery
- * (.) Also discusses render phase
+ templates = {}
+ allowed_tags = []
+ allowed_self_closing_tags = []
+ allowed_attributes = []
+ interwiki = {}
+ namespaces = {}
+
+ from preprocessor import make_parser
+ preprocessor = make_parser(templates)
+
+ from html import make_parser
+ parser = make_parser(allowed_tags, allowed_self_closing_tags, allowed_attributes, interwiki, namespaces)
+
+ preprocessed_text = preprocessor.parse(source)
+ output = parser.parse(preprocessed_text.leaves())
+
+`output` will contain the rendered HTML. You should describe the behavior you expect by filling the variables of the first lines:
+ * if the wikitext calls foreign templates, put their names and content in the `templates` dict (e.g.: `{'my template': 'my template content'}`)
+ * if some HTML tags are allowed on your wiki, list them in the `allowed_tags` list (e.g.: `['center', 'big', 'small', 'span']`; avoid `'script'` and some others, for security reasons)
+ * if some self-closing HTML tags are allowed on your wiki, list them in the `allowed_self_closing_tags` list (e.g.: `['br', 'hr']`; avoid `'script'` and some others, for security reasons)
+ * if some HTML tags are allowed on your wiki, list the attributes they can use the `allowed_attributes` list (e.g.: `['style', 'class']`; avoid `'onclick'` and some others, for security reasons)
+ * if you want to be able to use interwiki links, list the foreign wikis in the `interwiki` dict (e.g.: `{'fr': 'http://fr.wikipedia.org/wiki/'}`)
+ * if you want to be able to distinguish between standard links, file inclusions or categories, list the namespaces of your wiki in the `namespaces` dict (e.g.: `{'Template': 10, 'Category': 14, 'File': 6}` where the numbers are the namespace codes used in MW)
+
+Example for text
+----------------
+In order to use this tool to render wikitext into text in a Python program, you can use the following lines:
-* (+) EBNF grammar: http://www.mediawiki.org/wiki/Markup_spec/EBNF
+::
- * (+) Well-organized and concise
- * (-) Nothing about error recovery
- * (-) Wrong in some places (like the header rules that chew up whitespace)
+ templates = {}
+
+ from preprocessor import make_parser
+ preprocessor = make_parser(templates)
-* (+) flex implementation: http://www.mediawiki.org/wiki/Markup_spec/flex
+ from text import make_parser
+ parser = make_parser()
- * (-) Prints HTML directly; doesn't seem to have a consume/parse/render flow
- * (-) Doesn't seem very comprehensive. I converted it quickly to a PLY lex implementation (fixed the \135 codes and such), and it didn't seem to do a particularly good job recognizing things. There are some heuristics we can glean from it, however, like stripping any trailing comma or period off a scanned URL. Another example is that it doesn't look like it handles the "== H2 ===" case correctly.
+ preprocessed_text = preprocessor.parse(source)
+ output = parser.parse(preprocessed_text.leaves())
+`output` will contain the rendered text.
+If the wikitext calls foreign templates, put their names and content in the `templates` dict (e.g.: `{'my template': 'my template content'}`)
-Milestones
-==========
-* Understand what's so hard about apostrophes and lists (http://www.mediawiki.org/wiki/Markup_spec/BNF/Inline_text).
+Example for templates substitution
+----------------------------------
+If you just want to replace the templates in a given wikitext, you can just call the preprocessor and no rendering postprocessor:
+
+::
- * This claims MW isn't context-free and has C code on how to hack through the apostrophe jungle: http://web.archiveorange.com/archive/v/e7MXfq0OoW0nCOGyX0oa
- * This claims that MW is probably context-free: http://www.mediawiki.org/wiki/User_talk:Kanor#Response_to_article_in_Meatball
- * Useful background discussion by the folks who wrote the BNF attempt: http://www.mediawiki.org/wiki/Talk:Markup_spec
- * The flex markup looks to have naive apostrophe jungle state rules: http://www.mediawiki.org/wiki/Markup_spec/flex
- * mwlib has a pretty clean, decoupled Python impl. See styleanalyzer.py.
- * When rebalancing '''hi''' <b>''mo</b>m'', the algorithm seems to be something like this: read left to right, building a tag stack as we go. If we hit a closer that doesn't match what's on the top of the stack (1), close what's on the top (2), and let the closer through. HOWEVER, also put (1) onto another stack (or single var?) and, after doing step (2), push that stack onto the tag stack.
+ templates = {}
+
+ from preprocessor import make_parser
+ preprocessor = make_parser(templates)
-* (Done.) Get a parse tree out of a lib.
-* Think about extensibility
-* Get apostrophes working.
-* Implement productions, tag by tag
+ output = preprocessor.parse(source)
+`output` will contain the rendered wikitext.
+Put the templates names and content in the `templates` dict (e.g.: `{'my template': 'my template content'}`)
-Notes
-=====
-If we build the parse tree in custom lexer callbacks, we can make it an ElementTree or whatever we want--meaning we can use XPath on it later if we want.
+Known bugs
+==========
-Quasi Gantt chart
-=================
+This tool should be able to render any wikitext page into text or HTML.
-::
+However, it does not intent to be bug-for-bug compatible with MW. For instance, using HTML entities in template calls (e.g.: `'{{temp&copy;late}}`') is currently not supported.
- Re-examing parsing algorithm,
- & implement links |----|----|---- Bold/Italics/Apostrophe Jungles (3 weeks) |----|----|---- HTML formatter |---- Showfor support |--
- & other long-lookahead productions
- (3 weeks) Simple productions:
- Paragraphs (3 days) |--
- HRs (1 day) |
- magic words (3 days) |--
-
- Tables (long lookahead?) (1 week) |----
-
- One person should do these:
- Includes (long lookahead?) (2 weeks) |----|----
- Templates w/params (long lookahead?) (2 weeks) |----|----
-
- Redirects (3 days) |--
- Naked URLs (long lookahead but doable in lexer?) (1 day) |
- Headers (long lookahead but doable in lexer) (done for now)
- Entities (done for now)
- Behavior switches (optional) (4 days--will require some architecture thinking) |---
-
- HTML tags: probably just tokenize and preserve them through the parser and |----|----|----
- then have a separate post-parse step to balance and validate them and, for
- example, escape any invalid ones (3 weeks)
+Please don't hesitate to report bugs that you may find when using this tool.
View
264 parsers.rst
@@ -0,0 +1,264 @@
+This is an archive about the research that was made by Erik Rose and Peter Potrowl in order to find and finally build a good Python parser for MediaWiki's syntax.
+
+Goals
+=====
+* Possible (to represent MediaWiki syntax). There is no such thing as invalid wiki markup, so we have to make some sense out of anything.
+* Extensible per project (for {for} syntax, DekiWiki conditionals, includes, templates, etc.)
+* Easier to comprehend than MW's existing formatter
+* Must work with Unicode input (Japanese, for example)
+
+
+Ungoals
+=======
+* Implement those MediaWiki language features which make no sense outside the MediaWiki product: for example, category links.
+* Be bug-for-bug compatible with MW. We can't constrain ourselves to give the exact same output for egregiously bad wikisyntax; we'll make a mess, which kills our goal of being easy to understand. We should catch common errors like bad quote nesting but not go out of our minds to be bug-for-bug compatible. After all, MW itself changes the language in every release. Let them chase us.
+
+
+MW language properties
+======================
+* Unambiguous. Rats-based Sweble parses it, and Rats is a PEG-based lib, and PEGs can't represent unambiguous grammars, according to http://en.wikipedia.org/wiki/Parsing_expression_grammar.
+
+
+Parser libs
+===========
+See:
+
+* http://wiki.python.org/moin/LanguageParsing
+* http://en.wikipedia.org/wiki/Comparison_of_parser_generators
+
+In the following lists, (+) signifies a pro, (-) a con, and (.) a neutral point.
+
+LEPL
+----
+* (.) Supports ambiguous grammars (doesn't matter: MW is unambiguous)
+* (.) Idiosyncratic syntax with lots of operator overloading (even slices!)
+* (.) Slow (http://www.quora.com/What-is-the-best-parser-generator-for-Python/answer/Matthew-Lloyd)
+* (+) Excellent docs
+
+PLY
+---
+* (.) LALR or, optionally, SLR (Can SLR look ahead farther? No: actually, it has no lookahead.)
+* (-) LALR(1), which is grossly insufficient for MW. I think it's about a lookahead of 4, which we'd have to take care of ourselves, probably making the code hard to comprehend in the process.
+* (+) Modal lexer and parser. (Is this necessary? Understand what can't be BNF'd about apostrophe jungles and lists.)
+* (+) Easy to translate into C later
+* (.) Can turn off magic autodiscovery
+* (-) Potential for yacc to guess wrong about how to assemble symbols: reduce/reduce and shift/reduce errors
+* (+) Much faster than PyParsing? http://www.mefeedia.com/watch/29412150 at 23:58 suggests it's 5.5x faster. More benchmarks (ANTLR and more): http://www.dalkescientific.com/writings/diary/archive/2007/11/03/antlr_java.html
+* (-) A bit more verbose (but very clear)
+
+PyParsing
+---------
+* (.) Recursive descent (LL) of PEGs
+* (+) Packrat, so O(n)
+* (+) Easy to write
+* (-) "[An LL(k) parser] may defer error detection to a different branch of the grammar due to backtracking, often making errors harder to localize across disjunctions with long common prefixes."—Wikipedia. I had that problem when writing a simple italics/bold parser: you have to keep the recursion stack in your head to make any sense of the debug info. I eventually gave up trying to fix it.
+
+PyBison
+-------
+Not researched in depth.
+
+* (+) Claims to be nearly as fast as C
+* (-) Requires a C build step
+
+ANTLR
+-----
+* (-) Separate code generation step
+* (-) Slow because it generates a lot of function calls
+* (+) Can parse LL(k) grammars (arbitrary lookahead)
+
+SPARK
+-----
+* (+) Has an implementation of an Earley parser, which can do arbitrary lookahead in n^3 worst case.
+
+NLTK
+----
+* (+) Another Earley parser
+* (+) Long-lived. Under active development by multiple authors. Last released 4/2011.
+* (.) There's a good, free book about the project: http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html. Not sure how good the documentation about the code itself is, though.
+* (-) An enormous dependency
+
+PyGgy (http://pypi.python.org/pypi/pyggy/0.3)
+---------------------------------------------
+* (.) Untested
+* (.) GLR parser
+* (+) Public domain
+* (-) Might be dead (the home page has disappeared: http://www.lava.net/~newsham/pyggy/)
+* (-) "PyGgy was written and tested with Python 2.2.3." (in 2003)
+
+Pijnu (http://spir.wikidot.com/pijnu)
+-------------------------------------
+* (+) PEG. Easy, easy grammar definition.
+* (.) Looks promising but not mature. Author has given no thought to speed but much to clarity.
+* (-) Build step
+* (-) Currently no Unicode support
+* (+) Great docs: http://spir.wikidot.com/pijnu-user-guide
+* (+) Great error feedback
+* (+) The generated code looks like what you have to hand-write for PyParsing (see the user guide).
+* (+) Can handle having Unicode chars in the input.
+* (.) Can it handle having Unicode chars as part of parse rules? We might need guillemets.
+* (-) Eek, no tests! Throws DeprecationWarnings on import. Very unique coding style.
+
+PyMeta (https://launchpad.net/pymeta)
+-------------------------------------
+* (.) PEG. Grammar defined in a DSL.
+* (+) No build step; converts grammar from a DSL at runtime.
+* (+) Good docs in the code
+* (-) Nobody's touched it for a year.
+
+PyMeta2 (http://www.allbuttonspressed.com/projects/pymeta)
+----------------------------------------------------------
+* (.) Is a port of PyMeta to "the simplified OMeta 2 syntax" (new DSL syntax).
+
+Ppeg (https://bitbucket.org/pmoore/ppeg/)
+-----------------------------------------
+* (-) Not in Python: Python code (21 kB) code is just an API for a C parser (172 kB)
+
+pyPEG (http://fdik.org/pyPEG/)
+------------------------------
+* (.) Only 340 lines of Python
+* (-) Similar to Pijnu but much less easy to use
+
+
+Previous implementations
+========================
+See: http://www.mediawiki.org/wiki/Alternative_parsers
+
+Py-wikimarkup (https://github.com/dcramer/py-wikimarkup)
+--------------------------------------------------------
+* (+) Probably works (untested)
+* (-) Direct transformation from wikitext to HTML (generates no AST)
+* (-) As a direct port of the MW PHP, it is very difficult to understand or extend.
+* (-) Because it is based on a sequence of perilously combined regexes which interact in surprising ways, it, like MW proper, sometimes yields surprising output.
+
+mwlib (http://code.pediapress.com/wiki/wiki/mwlib)
+--------------------------------------------------
+* (+) Works well, lots of unittests already defined and successfully passed
+* (+) Generates an AST
+* (.) Implements its own lexer/parser (see mwlib/refine/core.py and mwlib/refine/_core.pyx: compiled token walker)
+* (.) Seems to: tokenize the text and then apply ~20 different parsers one by one (see mwlib/refine/core.py#928 and #635)
+* (-) Structure of the code somewhat hard to understand (uparser.py vs old_uparser.py, etc.)
+* (-) Lot of code not related to parsing (fetching articles, (un)zip files, API stuff, output for ODF, Latex, etc. that should be more isolated from the parsing part)
+
+mediawiki_parser (this one)
+---------------------------
+* (+) Good start (parser + lexer, unittests)
+* (.) Currently using PLY but will be abandoned due to the lack of lookahead
+* (-) Currently incomplete syntax
+* (-) Currently generates no AST
+
+Sweble (http://sweble.org/gitweb/)
+----------------------------------
+* (+) Works well: demo here: http://sweble.org/crystalball/
+* (.) Interesting description of the parser philosophy: http://sweble.org/gitweb/?p=sweble-wikitext.git;f=swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser/Content.rats;h=e6f0e250b01c3c76ce85a38ba75eb0fcbe636d7a;hb=899a68c087fb6439b4d60c3e6d3c7c025ac0d663
+* (.) Same for preprocessor: http://sweble.org/gitweb/?p=sweble-wikitext.git;a=blob;f=swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/preprocessor/Grammar.rats;h=c13e8a662178516f730d4c63115ba59210aa2481;hb=899a68c087fb6439b4d60c3e6d3c7c025ac0d663
+* (.) Uses the packrat xtc parser: http://www.cs.nyu.edu/rgrimm/xtc/rats.html
+* (-) Not simple...
+
+
+Algorithms
+==========
+
+Lexer + parser (e.g. PLY)
+-------------------------
+* (+) Easy to use and debug
+* (+) Stateful (specific simple rules for each context)
+* (-) Not enough lookahead in the case of LR(1) parser
+
+Recursive descent of CFGs
+------------------------------------------
+* (+) No separate lexer and parser
+* (+) Memoization ("packrat") makes it run in O(n)
+* (.) Recursive
+* (-) May require large amounts of memory
+* (-) Quite hard to read and debug
+
+Recursive descent of PEGs (e.g. Rats, PyParsing)
+-------------------------------------
+* (+) No separate lexer and parser
+* (+) O(n) with packrat
+* (+) Resolves ambiguity by having precedence orders for productions. As a result, it is easy to extend a PEG with productions for use in special situations without wrecking the wider grammar. This could be a very big deal for our extensibility story.
+* (+) We can rip off Sweble's grammar.
+
+Earley parser (e.g. Spark, NLTK)
+--------------------------------
+* (.) O(n³) in the general case, O(n²) for unambiguous grammars and O(n) for almost all LR(k) grammars
+* (.) Meant for context-free grammars, but may also work in context-free subsections of context-sensitive grammars according to this publication: http://danielmattosroberts.com/earley/context-sensitive-earley.pdf
+
+GLR parser (e.g. Pyggy)
+-----------------------
+* (.) Supports ambiguous grammars (which MW isn't)
+* (+) O(n) on deterministic grammars
+
+
+Previous work
+=============
+* (+) OCaml lexer implementation: http://www.mediawiki.org/wiki/MediaWiki_lexer
+* (+) Markup spec: http://www.mediawiki.org/wiki/Markup_spec
+* (+) BNF grammar: http://www.mediawiki.org/wiki/Markup_spec/BNF
+
+ * (+) Corresponds closely to yacc input format
+ * (+) Pretty comprehensive: lots of English describing corner cases and error recovery
+ * (.) Also discusses render phase
+
+* (+) EBNF grammar: http://www.mediawiki.org/wiki/Markup_spec/EBNF
+
+ * (+) Well-organized and concise
+ * (-) Nothing about error recovery
+ * (-) Wrong in some places (like the header rules that chew up whitespace)
+
+* (+) flex implementation: http://www.mediawiki.org/wiki/Markup_spec/flex
+
+ * (-) Prints HTML directly; doesn't seem to have a consume/parse/render flow
+ * (-) Doesn't seem very comprehensive. I converted it quickly to a PLY lex implementation (fixed the \135 codes and such), and it didn't seem to do a particularly good job recognizing things. There are some heuristics we can glean from it, however, like stripping any trailing comma or period off a scanned URL. Another example is that it doesn't look like it handles the "== H2 ===" case correctly.
+
+
+Milestones
+==========
+* Understand what's so hard about apostrophes and lists (http://www.mediawiki.org/wiki/Markup_spec/BNF/Inline_text).
+
+ * This claims MW isn't context-free and has C code on how to hack through the apostrophe jungle: http://web.archiveorange.com/archive/v/e7MXfq0OoW0nCOGyX0oa
+ * This claims that MW is probably context-free: http://www.mediawiki.org/wiki/User_talk:Kanor#Response_to_article_in_Meatball
+ * Useful background discussion by the folks who wrote the BNF attempt: http://www.mediawiki.org/wiki/Talk:Markup_spec
+ * The flex markup looks to have naive apostrophe jungle state rules: http://www.mediawiki.org/wiki/Markup_spec/flex
+ * mwlib has a pretty clean, decoupled Python impl. See styleanalyzer.py.
+ * When rebalancing '''hi''' <b>''mo</b>m'', the algorithm seems to be something like this: read left to right, building a tag stack as we go. If we hit a closer that doesn't match what's on the top of the stack (1), close what's on the top (2), and let the closer through. HOWEVER, also put (1) onto another stack (or single var?) and, after doing step (2), push that stack onto the tag stack.
+
+* (Done.) Get a parse tree out of a lib.
+* Think about extensibility
+* Get apostrophes working.
+* Implement productions, tag by tag
+
+
+Notes
+=====
+If we build the parse tree in custom lexer callbacks, we can make it an ElementTree or whatever we want--meaning we can use XPath on it later if we want.
+
+
+Quasi Gantt chart
+=================
+
+::
+
+ Re-examing parsing algorithm,
+ & implement links |----|----|---- Bold/Italics/Apostrophe Jungles (3 weeks) |----|----|---- HTML formatter |---- Showfor support |--
+ & other long-lookahead productions
+ (3 weeks) Simple productions:
+ Paragraphs (3 days) |--
+ HRs (1 day) |
+ magic words (3 days) |--
+
+ Tables (long lookahead?) (1 week) |----
+
+ One person should do these:
+ Includes (long lookahead?) (2 weeks) |----|----
+ Templates w/params (long lookahead?) (2 weeks) |----|----
+
+ Redirects (3 days) |--
+ Naked URLs (long lookahead but doable in lexer?) (1 day) |
+ Headers (long lookahead but doable in lexer) (done for now)
+ Entities (done for now)
+ Behavior switches (optional) (4 days--will require some architecture thinking) |---
+
+ HTML tags: probably just tokenize and preserve them through the parser and |----|----|----
+ then have a separate post-parse step to balance and validate them and, for
+ example, escape any invalid ones (3 weeks)
Please sign in to comment.
Something went wrong with that request. Please try again.