Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalizing and augmenting PEGs #47

Closed
keleshev opened this issue Jul 10, 2014 · 11 comments
Closed

Generalizing and augmenting PEGs #47

keleshev opened this issue Jul 10, 2014 · 11 comments

Comments

@keleshev
Copy link
Contributor

I wish to parse (1) binary data using PEGs, or (2) token streams (like #40). Also, I wish to parse (3) more grammars than PEGs allow by somehow augmenting the grammar with some Python code and maybe some state, because I don't want to throw away my grammar and rewrite the parser manually as soon as I want to add a feature that is not parseable by PEGs, like, say, whitespace-sensitivity.

What do you think?

@erikrose
Copy link
Owner

Wow, that's 3 wishes, 3 tickets. :-)

  1. I had originally designed Parsimonious to work against either binary or Unicode data. (I'm not sure I went all the way and actually tested it, but that was my intent or, rather, my failure to commit to a decision one way or the other.) I was just about to declare that it is for Unicode and nothing else, and now you come up with this. :-) What's your use case? Sell me!
  2. I need some time or some more examples to wrap my brain around the use case for that. It might be a neat feature.
  3. One of my core goals has been to avoid cluttering up the grammar with inline Python and such; I'd like them to stay PEG-like. Do you have any suggestions on how we can accomplish extension without an explosion in the grammar definition language?
  4. Out of curiosity (and as motivation), are you still playing around with that docopt branch that uses Parsimonious?

@keleshev
Copy link
Contributor Author

  1. At one of my previous jobs I needed to parse quite a lot of binary formats. I would rather want to parse it with a parser tool, than with code that looks like this.

  2. Maybe something like this:

grammar = Grammar("""
    foo = "bar" / "baz" / quux
    quux = "(" spam* ")"""",
    spam = Rule(lambda source: ...)
)

And in case of a visitor:

class MyLanguage(Language):

    @rule('"bar" / "baz" / quux')
    def foo(self, node, children):
        ...

    @rule(lambda source: ...)
    def spam(self, node, children)
        ...
  1. I'm probably over-engineering, but my (almost impossible) wish was to define most of docopt in terms of PEG, then make sure there are good, interoperable PEG implementations for most popular programming languages, then port all docopt ports to use PEG. That's why I made peg.rb, and was hoping to make some adaptor for parsimonious to parse Ford's notation (or to convince you to move to Ford's notation).

@erikrose
Copy link
Owner

I like the spam = a lot. So cleanly did the extra kwargs match up with the string-ensconced rules above that I didn't even notice the end of the string at first. The beauty could be worth making it harder to add dedicated constructor kwargs in the future.

Now that I think about it, Parsimonious already has a logical extension point for custom expression code: just make something that implements the Expression object's interface. We could even promote simple lambdas into Expressions transparently, though the lambdas would need a few more params than in your sketch:

grammar = Grammar("""
    foo = "bar" / "baz" / quux
    quux = "(" spam* ")"
    """,
    spam=lambda text, pos, cache, error: return Node(...))

We'd probably also want to pass the Grammar itself into the lambda so it could call out to other rules if it liked. A natural way to do this might be to make the lambda appear to be an instance method, receiving the Grammar as an initial self arg.

@keleshev
Copy link
Contributor Author

👍

@erikrose
Copy link
Owner

Here's a little teaser of tonight's work:

def expression(callable, rule_name, grammar):
    """Turn a plain callable into an Expression

    The callable can be of this simple form::

        def foo(text, pos):
            # end_pos = the offset in `text` where I stop matching
            # children = child nodes, if any
            if the expression matched:
                return end_pos, children

    If there are no children, it can return just an int::

        return end_pos

    If the expression doesn't match at the given ``pos`` at all... ::

        return None

    If your callable needs to make sub-calls to other rules in the grammar, it
    can take this form, adding additional arguments::

        def foo(text, pos, cache, error, grammar):
            # Call out to other rules:
            matched = grammar['another_rule']._match(text, pos, cache, error)
            ...
            # Return values as above.

    The return value of the callable, if an int or a tuple, will be
    automatically transmuted into a Node. If it returns a Node-like class
    directly, it will be passed through unchanged.

    """
    num_args = len(getargspec(callable).args)
    if num_args == 2:
        is_simple = True
    elif num_args == 5:
        is_simple = False
    else:
        raise RuntimeError("Custom rule functions must take either 2 or 5 "
                           "arguments, not %s." % num_args)

    class AdHocExpression(Expression):
        def _uncached_match(self, text, pos, cache, error):
            if is_simple:
                result = callable(text, pos)
            else:
                result = callable(text, pos, cache, error, grammar)

            if isinstance((long, int), result):
                end, children = result, None
            elif isinstance(tuple, result):
                end, children = result
            else:
                # Node or None
                return result
            return Node(self.name, text, pos, end, children=children)

        def _as_rhs(self):
            return '{custom function "%s"}' % callable.__name__

    return AdHocExpression(name=rule_name)

Want to know something disgusting? You could actually use this to add state to your parser—not just code. Since the 5-arg version of a custom rule gets passed the cache—the canonical repository of state during a parse—you could abuse it to squirrel away your own stuff, as long as you take care not to collide with the packrat items. And those shouldn't be hard to dodge, since they're all 2-tuples of (pos, expression ID). Evil! :-D

@erikrose
Copy link
Owner

Example:

grammar = Grammar("""
    bracketed_digit = start digit end
    start = '['
    end = ']'""",
    digit = lambda text, pos:
                (pos + 1) if text[pos].isdigit() else None)

@erikrose
Copy link
Owner

@halst, take a look at the new stuff, and let me know what you think! The tests provide pretty good examples: 029cb2f#diff-23015ea821765daa0cb85fb8f8bd0665R301. I added some inspect magic so simple things can be short and complex things can be possible.

@erikrose
Copy link
Owner

Still planning the @rule stuff, but I'm considering that not part of this ticket: rather, a refinement of the docstring stuff I already landed. Also, feel free to open another ticket about binary parsing. I'm not ready to commit to supporting it yet, but we should at least have a place to talk about it.

@keleshev
Copy link
Contributor Author

The two-argument version is nice and readable. The 5-argument version is a bit awkward. But, maybe, it's just part of the essential complexity of the problem.

@erikrose
Copy link
Owner

I think it's unavoidable.

@erikrose
Copy link
Owner

Btw, the unicode vs. binary ticket is #31.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants