-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability of parsimonious to parse lists of tokens. #40
Conversation
A token is an instance with a member called "type". To match a token of a given type, use the syntax %TOKEN_TYPE% in your grammar. Example usage: grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""") assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)
Example of ad-hoc python "def" syntax: defun = %DEF% %SYMBOL% %LPAR% (%SYMBOL% (%COMMA% %SYMBOL%)*) %RPAR% %COLON% %NEWLINE% %INDENT% block %DEDENT% block = ( statement %NEWLINE%)+ statement = ... |
Having a look at this, in case you're still around. Sorry for the ridiculous delay. Your work looks clean and promising! |
@keleshev, you also mentioned a desire for this awhile back, so I'd value your input. Am I correct that there would never be a need to mix token-based and text-based matching in the same Grammar? If so, we could just have a subclass called ListGrammar that would emit a slightly different salad of Expressions: TokenExpressions in the place of Literals, for example. That would save the dispatch between basestrings and lists, improving performance and decoupling the new functionality from the existing code. I'm toying with even reusing the Literal syntax for tokens, since there would be no need for it in a token-oriented grammar. (That has the additional bonus of not having to wait for the grammar composition stuff from #30.) What do you guys think of this sketch? grammar = ListGrammar(r"""
defun = "DEF" "SYMBOL" "LPAR" ("SYMBOL" ("COMMA" "SYMBOL")*) "RPAR" "COLON"
"NEWLINE" "INDENT" block "DEDENT"
block = (statement "NEWLINE")+
statement = ...
""")
grammar.parse([Token('DEF'), ...]) |
@@ -151,10 +161,32 @@ def _uncached_match(self, text, pos, cache, error): | |||
if text.startswith(self.literal, pos): | |||
return Node(self.name, text, pos, pos + len(self.literal)) | |||
|
|||
def _uncached_match_list(self, token_list, pos, cache, error): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm particularly curious about this method added to Literal, which compares a .text
attr of Token
that isn't otherwise mentioned or covered in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is silly, I don't remember why I wrote it like that.
As you can see, what it does is to assume that a "Token" instance also has a "text" attribute (containing, for instance, the text which the lexer put together to form the token, but could be anything, really), and compares it with that. But it has been a while, I don't remember why I needed it, or thought I needed it.
I am actually not using parsimonious for my project anymore, in the end my use case was so unique that I had to come up with a custom parsing method --- I had all sorts of requirements such as the language to be parsed changing dynamically by instructions of the language itself, and this didn't easily conform to grammar-based parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to reply, @bloff! It sounds like I'm on the right track. To be clear, you needed to parse lists of tokens—[TOK1, TOK2, TOK3]—not hybrid lists with tokens and text—[TOK1, "some string", TOK2, TOK3]—correct?
Likewise, I suspect things like Regex wouldn't make sense in a token-based stream. |
Pushed my cleanups as #69, so let's continue this over there. |
Not sure you wanted this in parsimonious, but I needed it. I implemented a language with Python-like indentation; this indentation is usually implemented in the lexer. I am using PLY.
However, LALR(1) grammars suck, so I wanted something with more lookahead. But all PEG grammars insisted on doing the lexing themselves (bah!).
But parsimonious was so simply written that it was trivial to change to do what I needed.
Now if you call grammar.parse on a list, it will assume that the list is a list of "tokens": a token is an instance with a member called "type".
To match a token of a given type, you may use the syntax %TOKEN_TYPE% in the grammar.
Example usage:
grammar = Grammar("""foo = %TOKEN1% %TOKEN2%""")
assert(grammar.parse([Token("TOKEN1"), Token("TOKEN2")]) is not None)
I've already tested it on my project, it works just fine :-)