Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability of parsimonious to parse lists of tokens. #40

Closed
wants to merge 1 commit into from

Conversation

bloff
Copy link

@bloff bloff commented May 9, 2014

Not sure you wanted this in parsimonious, but I needed it. I implemented a language with Python-like indentation; this indentation is usually implemented in the lexer. I am using PLY.

However, LALR(1) grammars suck, so I wanted something with more lookahead. But all PEG grammars insisted on doing the lexing themselves (bah!).

But parsimonious was so simply written that it was trivial to change to do what I needed.

Now if you call grammar.parse on a list, it will assume that the list is a list of "tokens": a token is an instance with a member called "type".

To match a token of a given type, you may use the syntax %TOKEN_TYPE% in the grammar.

Example usage:

grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""")
assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)

I've already tested it on my project, it works just fine :-)

A token is an instance with a member called "type".

To match a token of a given type, use the syntax %TOKEN_TYPE% in your
grammar.

Example usage:

grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""")
assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)
@bloff
Copy link
Author

bloff commented May 9, 2014

Example of ad-hoc python "def" syntax:

defun = %DEF% %SYMBOL% %LPAR% (%SYMBOL% (%COMMA% %SYMBOL%)*) %RPAR% %COLON% %NEWLINE% %INDENT% block %DEDENT%

block = ( statement %NEWLINE%)+

statement = ...

@erikrose
Copy link
Owner

erikrose commented Mar 3, 2015

Having a look at this, in case you're still around. Sorry for the ridiculous delay. Your work looks clean and promising!

@erikrose
Copy link
Owner

erikrose commented Mar 3, 2015

@keleshev, you also mentioned a desire for this awhile back, so I'd value your input.

Am I correct that there would never be a need to mix token-based and text-based matching in the same Grammar? If so, we could just have a subclass called ListGrammar that would emit a slightly different salad of Expressions: TokenExpressions in the place of Literals, for example. That would save the dispatch between basestrings and lists, improving performance and decoupling the new functionality from the existing code. I'm toying with even reusing the Literal syntax for tokens, since there would be no need for it in a token-oriented grammar. (That has the additional bonus of not having to wait for the grammar composition stuff from #30.) What do you guys think of this sketch?

grammar = ListGrammar(r"""
    defun = "DEF" "SYMBOL" "LPAR" ("SYMBOL" ("COMMA" "SYMBOL")*) "RPAR" "COLON"
            "NEWLINE" "INDENT" block "DEDENT"
    block = (statement "NEWLINE")+
    statement = ...
    """)
grammar.parse([Token('DEF'), ...])

@@ -151,10 +161,32 @@ def _uncached_match(self, text, pos, cache, error):
if text.startswith(self.literal, pos):
return Node(self.name, text, pos, pos + len(self.literal))

def _uncached_match_list(self, token_list, pos, cache, error):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm particularly curious about this method added to Literal, which compares a .text attr of Token that isn't otherwise mentioned or covered in tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is silly, I don't remember why I wrote it like that.

As you can see, what it does is to assume that a "Token" instance also has a "text" attribute (containing, for instance, the text which the lexer put together to form the token, but could be anything, really), and compares it with that. But it has been a while, I don't remember why I needed it, or thought I needed it.

I am actually not using parsimonious for my project anymore, in the end my use case was so unique that I had to come up with a custom parsing method --- I had all sorts of requirements such as the language to be parsed changing dynamically by instructions of the language itself, and this didn't easily conform to grammar-based parsing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to reply, @bloff! It sounds like I'm on the right track. To be clear, you needed to parse lists of tokens—[TOK1, TOK2, TOK3]—not hybrid lists with tokens and text—[TOK1, "some string", TOK2, TOK3]—correct?

@erikrose
Copy link
Owner

erikrose commented Mar 3, 2015

Likewise, I suspect things like Regex wouldn't make sense in a token-based stream.

@erikrose erikrose mentioned this pull request Mar 3, 2015
@erikrose
Copy link
Owner

erikrose commented Mar 3, 2015

Pushed my cleanups as #69, so let's continue this over there.

@erikrose erikrose closed this Mar 3, 2015
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants