Permalink
Browse files

Add error reporting. Fix #28.

* Now, rather than returning None, parse() and match() raise ParseError if they don't succeed. ParseError gives you a nice, human-readable unicode representation as well as some attributes that let you construct your own presentation.
* Grammar construction now raises ParseError rather than BadGrammar if it can't parse your rules.
* Make UndefinedLabel subclass StrAndRepr so __str__ returns the right type.
* Remove some things from the future-directions sections of the readme that are done or just stupid.
* Add Grammar to top-level package namespace. I'm tired of typing ".grammar" at the REPL.
  • Loading branch information...
1 parent 82b8c55 commit f498d0600322aceaf609e9f84d7adc27dec5e563 @erikrose committed Apr 27, 2013
View
@@ -64,10 +64,9 @@ Status
in the future, so you can write grammars without fear.
* It may be slow and use a lot of RAM; I haven't measured either yet. However,
I have yet to begin optimizing in earnest.
-* Error reporting is fairly uninformative, and debugging is nonexistent.
- However, ``repr`` methods of expressions, grammars, and nodes are very clear
- and helpful. Ones of ``Grammar`` objects are even round-trippable! Huge
- things are planned for grammar debugging in the future.
+* Error reporting is now in place. ``repr`` methods of expressions, grammars,
+ and nodes are clear and helpful and well. Ones of ``Grammar`` objects are
+ even round-trippable!
* The grammar extensibility story is underdeveloped at the moment. You should
be able to extend a grammar by simply concatening more rules onto the
existing ones; later rules of the same name should override previous ones.
@@ -81,6 +80,7 @@ Coming Soon
* Optimizations to make Parsimonious worthy of its name
* Tighter RAM use
* Better-thought-out grammar extensibility story
+* Amazing grammar debugging
A Little About PEG Parsers
@@ -109,15 +109,18 @@ Thus, ambiguity is resolved by always yielding the first successful recognition.
Writing Grammars
================
-Grammars are defined by a series of rules, one per line. The syntax should be
-familiar to anyone who uses regexes or reads programming language manuals. An
-example will serve best::
+Grammars are defined by a series of rules. The syntax should be familiar to
+anyone who uses regexes or reads programming language manuals. An example will
+serve best::
styled_text = bold_text / italic_text
bold_text = "((" text "))"
italic_text = "''" text "''"
text = ~"[A-Z 0-9]*"i
+You can wrap a rule across multiple lines if you like; the syntax is very
+forgiving.
+
Syntax Reference
----------------
@@ -289,10 +292,6 @@ Optimizations
* We could possibly compile the grammar into VM instructions, like in "A
parsing machine for PEGs" by Medeiros.
* If the recursion gets too deep in practice, use trampolining to dodge it.
-* It looks like we could make an architecture-independent ``.o`` file and use LLVM
- to JIT it to whatever arch we're on: https://github.com/dabeaz/bitey/. Of
- course, then everybody has to have LLVM, which is even harder to set up than
- a vanilla C toolchain.
Niceties
--------
@@ -309,6 +308,19 @@ Version History
===============
0.5
+ .. warning::
+
+ This release makes some backward-incompatible changes. See below.
+
+ * Add error reporting. Now, rather than returning ``None``, ``parse()`` and
+ ``match()`` raise ``ParseError`` if they don't succeed. This makes more
+ sense, since you'd rarely attempt to parse something and not care if it
+ succeeds. It was too easy before to forget to check for a ``None`` result.
+ ``ParseError`` gives you a human-readable unicode representation as well as
+ some attributes that let you construct your own custom presentation.
+ * Grammar construction now raises ``ParseError`` rather than ``BadGrammar``
+ if it can't parse your rules.
+ * Make the ``_str__()`` method of ``UndefinedLabel`` return the right type.
* Support splitting rules across multiple lines, interleaving comments,
putting multiple rules on one line (but don't do that) and all sorts of
other horrific behavior.
View
@@ -1 +1 @@
-
+from parsimonious.grammar import Grammar
View
@@ -1,5 +1,34 @@
-class BadGrammar(Exception):
- """The rule definitions passed to Grammar contain syntax errors."""
+from parsimonious.utils import StrAndRepr
+
+
+class ParseError(StrAndRepr, Exception):
+ """A call to ``Expression.parse()`` or ``match()`` didn't match."""
+
+ def __init__(self, text, pos=-1, expr=None):
+ # TODO: It would be nice to use self.args, but I don't want to pay a
+ # penalty to call descriptors or have the confusion of numerical
+ # indices in Expression._match().
+ self.text = text
+ self.pos = pos
+ self.expr = expr
+
+ def __unicode__(self):
+ if self.expr.name:
+ rule_name = u"'%s'" % self.expr.name
+ else:
+ rule_name = unicode(self.expr)
+ return u"Rule %s didn't match at '%s'." % (rule_name, self.text[self.pos:self.pos + 20])
+
+ # TODO: Add line, col, and separated-out error message so callers can build
+ # their own presentation.
+
+
+class IncompleteParseError(ParseError):
+ """A call to ``parse()`` matched a whole Expression but did not consume the
+ entire text."""
+
+ def __unicode__(self):
+ return u"Top-level rule '%s' completed, but it didn't consume the entire text. The non-matching portion of the text begins with '%s'." % (self.expr.name, self.text[self.pos:self.pos + 20])
class VisitationError(Exception):
@@ -31,7 +60,7 @@ def __init__(self, exc, exc_class, node):
node.prettily(error=node)))
-class UndefinedLabel(VisitationError):
+class UndefinedLabel(StrAndRepr, VisitationError):
"""A rule referenced in a grammar was never defined.
Circular references and forward references are okay, but you have to define
@@ -43,5 +72,3 @@ def __init__(self, label):
def __unicode__(self):
return u'The label "%s" was never defined.' % self.label
-
- __str__ = __unicode__
View
@@ -8,6 +8,7 @@
import re
+from parsimonious.exceptions import ParseError, IncompleteParseError
from parsimonious.nodes import Node, RegexNode
from parsimonious.utils import StrAndRepr
@@ -28,25 +29,51 @@ class Expression(StrAndRepr):
def __init__(self, name=''):
self.name = name
- def parse(self, text):
+ def parse(self, text, pos=0):
"""Return a parse tree of ``text``.
- Return ``None`` if the expression doesn't match the full string.
+ Raise ``ParseError`` if the expression wasn't satisfied. Raise
+ ``IncompleteParseError`` if the expression was satisfied but didn't
+ consume the full string.
"""
- node = self.match(text)
- if node is None or node.end - node.start != len(text): # TODO: Why not test just end here? Are we going to add a pos kwarg or something?
- # If it was not a complete parse, return None:
- return None
+ error = ParseError(text)
+ node = self._match(text, pos, {}, error)
+ if node is None:
+ # It was not a complete parse.
+ raise error
+ if node.end < len(text):
+ raise IncompleteParseError(text, node.end, self)
return node
- def match(self, text, pos=0, cache=None):
- """Return the ``Node`` matching this expression at the given position.
+ def match(self, text, pos=0):
+ """Return the parse tree matching this expression at the given
+ position, not necessarily extending all the way to the end of ``text``.
- Return ``None`` if it doesn't match there. Check the cache first.
+ Raise ``ParseError`` if there is no match there.
:arg pos: The index at which to start matching
+ """
+ error = ParseError(text)
+ node = self._match(text, pos, {}, error)
+ if node is None:
+ raise error
+ return node
+
+ def _match(self, text, pos, cache, error):
+ """Internal-only guts of ``match()``
+
+ :arg cache: The packrat cache::
+
+ {(oid, pos): Node tree matched by object `oid` at index `pos` ...}
+
+ :arg error: A ParseError instance with ``text`` already filled in but
+ otherwise blank. We update the error reporting info on this object
+ as we go. (Sticking references on an existing instance is faster
+ than allocating a new one for each expression that fails.) We
+ return None rather than raising and catching ParseErrors because
+ catching is slow.
"""
# TODO: Optimize. Probably a hot spot.
#
@@ -62,18 +89,25 @@ def match(self, text, pos=0, cache=None):
# only the results of entire rules, not subexpressions (probably a
# horrible idea for rules that need to backtrack internally a lot). (2)
# Age stuff out of the cache somehow. LRU? (3) Cuts.
- if cache is None:
- # The packrat cache. {(oid, pos): Node tree matched by object `oid`
- # at index `pos`
- # ...}
- cache = {}
expr_id = id(self)
- cached = cache.get((expr_id, pos), ())
- if cached is not ():
- return cached
- uncached = self._uncached_match(text, pos, cache)
- cache[(expr_id, pos)] = uncached
- return uncached
+ node = cache.get((expr_id, pos), ())
+ if node is ():
+ node = cache[(expr_id, pos)] = self._uncached_match(text,
+ pos,
+ cache,
+ error)
+
+ # Record progress for error reporting:
+ if node is None and pos >= error.pos and (
+ self.name or getattr(error.expr, 'name', None) is None):
+ # Don't bother reporting on unnamed expressions (unless that's all
+ # we've seen so far), as they're hard to track down for a human.
+ # Perhaps we could include the unnamed subexpressions later as
+ # auxilliary info.
+ error.expr = self
+ error.pos = pos
+
+ return node
def __unicode__(self):
return u'<%s %s at 0x%s>' % (
@@ -117,7 +151,7 @@ def __init__(self, literal, name=''):
super(Literal, self).__init__(name)
self.literal = literal
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
if text.startswith(self.literal, pos):
return Node(self.name, text, pos, pos + len(self.literal))
@@ -145,7 +179,7 @@ def __init__(self, pattern, name='', ignore_case=False, locale=False,
(unicode and re.U) |
(verbose and re.X))
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
"""Return length of match, ``None`` if no match."""
m = self.re.match(text, pos)
if m is not None:
@@ -184,12 +218,12 @@ class Sequence(_Compound):
after another.
"""
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
new_pos = pos
length_of_sequence = 0
children = []
for m in self.members:
- node = m.match(text, new_pos, cache)
+ node = m._match(text, new_pos, cache, error)
if node is None:
return None
children.append(node)
@@ -209,9 +243,9 @@ class OneOf(_Compound):
wins.
"""
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
for m in self.members:
- node = m.match(text, pos, cache)
+ node = m._match(text, pos, cache, error)
if node is not None:
# Wrap the succeeding child in a node representing the OneOf:
return Node(self.name, text, pos, node.end, children=[node])
@@ -228,8 +262,8 @@ class Lookahead(_Compound):
# Downside: pretty-printed grammars might be spelled differently than what
# went in. That doesn't bother me.
- def _uncached_match(self, text, pos, cache):
- node = self.members[0].match(text, pos, cache)
+ def _uncached_match(self, text, pos, cache, error):
+ node = self.members[0]._match(text, pos, cache, error)
if node is not None:
return Node(self.name, text, pos, pos)
@@ -243,10 +277,10 @@ class Not(_Compound):
In any case, it never consumes any characters; it's a negative lookahead.
"""
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
# FWIW, the implementation in Parsing Techniques in Figure 15.29 does
# not bother to cache NOTs directly.
- node = self.members[0].match(text, pos, cache)
+ node = self.members[0]._match(text, pos, cache, error)
if node is None:
return Node(self.name, text, pos, pos)
@@ -265,8 +299,8 @@ class Optional(_Compound):
consumes. Otherwise, it consumes nothing.
"""
- def _uncached_match(self, text, pos, cache):
- node = self.members[0].match(text, pos, cache)
+ def _uncached_match(self, text, pos, cache, error):
+ node = self.members[0]._match(text, pos, cache, error)
return (Node(self.name, text, pos, pos) if node is None else
Node(self.name, text, pos, node.end, children=[node]))
@@ -277,11 +311,11 @@ def _as_rhs(self):
# TODO: Merge with OneOrMore.
class ZeroOrMore(_Compound):
"""An expression wrapper like the * quantifier in regexes."""
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
new_pos = pos
children = []
while True:
- node = self.members[0].match(text, new_pos, cache)
+ node = self.members[0]._match(text, new_pos, cache, error)
if node is None or not (node.end - node.start):
# Node was None or 0 length. 0 would otherwise loop infinitely.
return Node(self.name, text, pos, new_pos, children)
@@ -308,11 +342,11 @@ def __init__(self, member, name='', min=1):
super(OneOrMore, self).__init__(member, name=name)
self.min = min
- def _uncached_match(self, text, pos, cache):
+ def _uncached_match(self, text, pos, cache, error):
new_pos = pos
children = []
while True:
- node = self.members[0].match(text, new_pos, cache)
+ node = self.members[0]._match(text, new_pos, cache, error)
if node is None:
break
children.append(node)
View
@@ -7,7 +7,7 @@
"""
import ast
-from parsimonious.exceptions import BadGrammar, UndefinedLabel
+from parsimonious.exceptions import UndefinedLabel
from parsimonious.expressions import (Literal, Regex, Sequence, OneOf,
Lookahead, Optional, ZeroOrMore, OneOrMore, Not)
from parsimonious.nodes import NodeVisitor
@@ -76,11 +76,6 @@ def _expressions_from_rules(self, rules):
"""
tree = rule_grammar.parse(rules)
- if tree is None:
- raise BadGrammar('There is an error in your grammar definition. '
- 'Sorry for the vague error reporting at the '
- 'moment.')
-
return RuleVisitor().visit(tree)
def parse(self, text):
Oops, something went wrong.

0 comments on commit f498d06

Please sign in to comment.