Added the ability of parsimonious to parse lists of tokens. #40

bloff · 2014-05-09T10:00:21Z

Not sure you wanted this in parsimonious, but I needed it. I implemented a language with Python-like indentation; this indentation is usually implemented in the lexer. I am using PLY.

However, LALR(1) grammars suck, so I wanted something with more lookahead. But all PEG grammars insisted on doing the lexing themselves (bah!).

But parsimonious was so simply written that it was trivial to change to do what I needed.

Now if you call grammar.parse on a list, it will assume that the list is a list of "tokens": a token is an instance with a member called "type".

To match a token of a given type, you may use the syntax %TOKEN_TYPE% in the grammar.

Example usage:

grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""")
assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)

I've already tested it on my project, it works just fine :-)

A token is an instance with a member called "type". To match a token of a given type, use the syntax %TOKEN_TYPE% in your grammar. Example usage: grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""") assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)

bloff · 2014-05-09T10:23:35Z

Example of ad-hoc python "def" syntax:

defun = %DEF% %SYMBOL% %LPAR% (%SYMBOL% (%COMMA% %SYMBOL%)*) %RPAR% %COLON% %NEWLINE% %INDENT% block %DEDENT%

block = ( statement %NEWLINE%)+

statement = ...

erikrose · 2015-03-03T05:36:26Z

Having a look at this, in case you're still around. Sorry for the ridiculous delay. Your work looks clean and promising!

erikrose · 2015-03-03T06:13:02Z

@keleshev, you also mentioned a desire for this awhile back, so I'd value your input.

Am I correct that there would never be a need to mix token-based and text-based matching in the same Grammar? If so, we could just have a subclass called ListGrammar that would emit a slightly different salad of Expressions: TokenExpressions in the place of Literals, for example. That would save the dispatch between basestrings and lists, improving performance and decoupling the new functionality from the existing code. I'm toying with even reusing the Literal syntax for tokens, since there would be no need for it in a token-oriented grammar. (That has the additional bonus of not having to wait for the grammar composition stuff from #30.) What do you guys think of this sketch?

grammar = ListGrammar(r"""
    defun = "DEF" "SYMBOL" "LPAR" ("SYMBOL" ("COMMA" "SYMBOL")*) "RPAR" "COLON"
            "NEWLINE" "INDENT" block "DEDENT"
    block = (statement "NEWLINE")+
    statement = ...
    """)
grammar.parse([Token('DEF'), ...])

erikrose · 2015-03-03T06:17:55Z

parsimonious/expressions.py

@@ -151,10 +161,32 @@ def _uncached_match(self, text, pos, cache, error):
        if text.startswith(self.literal, pos):
            return Node(self.name, text, pos, pos + len(self.literal))

+    def _uncached_match_list(self, token_list, pos, cache, error):


I'm particularly curious about this method added to Literal, which compares a .text attr of Token that isn't otherwise mentioned or covered in tests.

That is silly, I don't remember why I wrote it like that.

As you can see, what it does is to assume that a "Token" instance also has a "text" attribute (containing, for instance, the text which the lexer put together to form the token, but could be anything, really), and compares it with that. But it has been a while, I don't remember why I needed it, or thought I needed it.

I am actually not using parsimonious for my project anymore, in the end my use case was so unique that I had to come up with a custom parsing method --- I had all sorts of requirements such as the language to be parsed changing dynamically by instructions of the language itself, and this didn't easily conform to grammar-based parsing.

Thanks for taking the time to reply, @bloff! It sounds like I'm on the right track. To be clear, you needed to parse lists of tokens—[TOK1, TOK2, TOK3]—not hybrid lists with tokens and text—[TOK1, "some string", TOK2, TOK3]—correct?

erikrose · 2015-03-03T06:21:23Z

Likewise, I suspect things like Regex wouldn't make sense in a token-based stream.

erikrose · 2015-03-03T07:29:13Z

Pushed my cleanups as #69, so let's continue this over there.

keleshev mentioned this pull request Jul 10, 2014

Generalizing and augmenting PEGs #47

Closed

erikrose reviewed Mar 3, 2015
View reviewed changes

erikrose mentioned this pull request Mar 3, 2015

Token parsing #69

Merged

erikrose closed this Mar 3, 2015

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the ability of parsimonious to parse lists of tokens. #40

Added the ability of parsimonious to parse lists of tokens. #40

bloff commented May 9, 2014

bloff commented May 9, 2014

erikrose commented Mar 3, 2015

erikrose commented Mar 3, 2015

erikrose Mar 3, 2015

bloff Mar 3, 2015

erikrose Mar 3, 2015

erikrose commented Mar 3, 2015

erikrose commented Mar 3, 2015

Added the ability of parsimonious to parse lists of tokens. #40

Added the ability of parsimonious to parse lists of tokens. #40

Conversation

bloff commented May 9, 2014

bloff commented May 9, 2014

erikrose commented Mar 3, 2015

erikrose commented Mar 3, 2015

erikrose Mar 3, 2015

Choose a reason for hiding this comment

bloff Mar 3, 2015

Choose a reason for hiding this comment

erikrose Mar 3, 2015

Choose a reason for hiding this comment

erikrose commented Mar 3, 2015

erikrose commented Mar 3, 2015