Generalizing and augmenting PEGs #47

keleshev · 2014-07-10T09:49:50Z

I wish to parse (1) binary data using PEGs, or (2) token streams (like #40). Also, I wish to parse (3) more grammars than PEGs allow by somehow augmenting the grammar with some Python code and maybe some state, because I don't want to throw away my grammar and rewrite the parser manually as soon as I want to add a feature that is not parseable by PEGs, like, say, whitespace-sensitivity.

What do you think?

erikrose · 2014-07-11T06:08:48Z

Wow, that's 3 wishes, 3 tickets. :-)

I had originally designed Parsimonious to work against either binary or Unicode data. (I'm not sure I went all the way and actually tested it, but that was my intent or, rather, my failure to commit to a decision one way or the other.) I was just about to declare that it is for Unicode and nothing else, and now you come up with this. :-) What's your use case? Sell me!
I need some time or some more examples to wrap my brain around the use case for that. It might be a neat feature.
One of my core goals has been to avoid cluttering up the grammar with inline Python and such; I'd like them to stay PEG-like. Do you have any suggestions on how we can accomplish extension without an explosion in the grammar definition language?
Out of curiosity (and as motivation), are you still playing around with that docopt branch that uses Parsimonious?

keleshev · 2014-07-11T11:52:48Z

At one of my previous jobs I needed to parse quite a lot of binary formats. I would rather want to parse it with a parser tool, than with code that looks like this.
Maybe something like this:

grammar = Grammar("""
    foo = "bar" / "baz" / quux
    quux = "(" spam* ")"""",
    spam = Rule(lambda source: ...)
)

And in case of a visitor:

class MyLanguage(Language):

    @rule('"bar" / "baz" / quux')
    def foo(self, node, children):
        ...

    @rule(lambda source: ...)
    def spam(self, node, children)
        ...

I'm probably over-engineering, but my (almost impossible) wish was to define most of docopt in terms of PEG, then make sure there are good, interoperable PEG implementations for most popular programming languages, then port all docopt ports to use PEG. That's why I made peg.rb, and was hoping to make some adaptor for parsimonious to parse Ford's notation (or to convince you to move to Ford's notation).

erikrose · 2014-07-11T15:35:45Z

I like the spam = a lot. So cleanly did the extra kwargs match up with the string-ensconced rules above that I didn't even notice the end of the string at first. The beauty could be worth making it harder to add dedicated constructor kwargs in the future.

Now that I think about it, Parsimonious already has a logical extension point for custom expression code: just make something that implements the Expression object's interface. We could even promote simple lambdas into Expressions transparently, though the lambdas would need a few more params than in your sketch:

grammar = Grammar("""
    foo = "bar" / "baz" / quux
    quux = "(" spam* ")"
    """,
    spam=lambda text, pos, cache, error: return Node(...))

We'd probably also want to pass the Grammar itself into the lambda so it could call out to other rules if it liked. A natural way to do this might be to make the lambda appear to be an instance method, receiving the Grammar as an initial self arg.

keleshev · 2014-07-11T16:31:42Z

👍

erikrose · 2014-07-12T04:34:07Z

Here's a little teaser of tonight's work:

def expression(callable, rule_name, grammar):
    """Turn a plain callable into an Expression

    The callable can be of this simple form::

        def foo(text, pos):
            # end_pos = the offset in `text` where I stop matching
            # children = child nodes, if any
            if the expression matched:
                return end_pos, children

    If there are no children, it can return just an int::

        return end_pos

    If the expression doesn't match at the given ``pos`` at all... ::

        return None

    If your callable needs to make sub-calls to other rules in the grammar, it
    can take this form, adding additional arguments::

        def foo(text, pos, cache, error, grammar):
            # Call out to other rules:
            matched = grammar['another_rule']._match(text, pos, cache, error)
            ...
            # Return values as above.

    The return value of the callable, if an int or a tuple, will be
    automatically transmuted into a Node. If it returns a Node-like class
    directly, it will be passed through unchanged.

    """
    num_args = len(getargspec(callable).args)
    if num_args == 2:
        is_simple = True
    elif num_args == 5:
        is_simple = False
    else:
        raise RuntimeError("Custom rule functions must take either 2 or 5 "
                           "arguments, not %s." % num_args)

    class AdHocExpression(Expression):
        def _uncached_match(self, text, pos, cache, error):
            if is_simple:
                result = callable(text, pos)
            else:
                result = callable(text, pos, cache, error, grammar)

            if isinstance((long, int), result):
                end, children = result, None
            elif isinstance(tuple, result):
                end, children = result
            else:
                # Node or None
                return result
            return Node(self.name, text, pos, end, children=children)

        def _as_rhs(self):
            return '{custom function "%s"}' % callable.__name__

    return AdHocExpression(name=rule_name)

Want to know something disgusting? You could actually use this to add state to your parser—not just code. Since the 5-arg version of a custom rule gets passed the cache—the canonical repository of state during a parse—you could abuse it to squirrel away your own stuff, as long as you take care not to collide with the packrat items. And those shouldn't be hard to dodge, since they're all 2-tuples of (pos, expression ID). Evil! :-D

erikrose · 2014-07-12T04:35:21Z

Example:

grammar = Grammar("""
    bracketed_digit = start digit end
    start = '['
    end = ']'""",
    digit = lambda text, pos:
                (pos + 1) if text[pos].isdigit() else None)

erikrose · 2014-07-13T01:09:21Z

@halst, take a look at the new stuff, and let me know what you think! The tests provide pretty good examples: 029cb2f#diff-23015ea821765daa0cb85fb8f8bd0665R301. I added some inspect magic so simple things can be short and complex things can be possible.

erikrose · 2014-07-13T01:11:26Z

Still planning the @rule stuff, but I'm considering that not part of this ticket: rather, a refinement of the docstring stuff I already landed. Also, feel free to open another ticket about binary parsing. I'm not ready to commit to supporting it yet, but we should at least have a place to talk about it.

keleshev · 2014-07-14T18:58:51Z

The two-argument version is nice and readable. The 5-argument version is a bit awkward. But, maybe, it's just part of the essential complexity of the problem.

erikrose · 2014-07-14T20:55:19Z

I think it's unavoidable.

erikrose · 2014-07-14T21:06:46Z

Btw, the unicode vs. binary ticket is #31.

erikrose closed this as completed in 029cb2f Jul 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalizing and augmenting PEGs #47

Generalizing and augmenting PEGs #47

keleshev commented Jul 10, 2014

erikrose commented Jul 11, 2014

keleshev commented Jul 11, 2014

erikrose commented Jul 11, 2014

keleshev commented Jul 11, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 13, 2014

erikrose commented Jul 13, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014

Generalizing and augmenting PEGs #47

Generalizing and augmenting PEGs #47

Comments

keleshev commented Jul 10, 2014

erikrose commented Jul 11, 2014

keleshev commented Jul 11, 2014

erikrose commented Jul 11, 2014

keleshev commented Jul 11, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 13, 2014

erikrose commented Jul 13, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014