-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parametrizable rules #45
Comments
Do you have specific use case where this would save you significant amount of work or make something currently impossible possible? |
It makes parsing indentation levels much easier by calling rules that have the level passed as an argument. Also, in a pure DRY logic, when doing things like "stuff delimited by this character with escape sequence such", it is nicer to call something like |
I should have been more clear. By "specific" I was looking for something like "I was working on a grammar of language X and there are 5 rules in there that could have been combined in one, here they are:" That is, I wanted to see real-world use case and real-world code. From that I can judge better in what cases this feature would be useful and for how many people. Please don't take this as I am opposed to this feature per se. I just generally don't want to implement features useful only for a tiny fraction of languages or developers because of complexity and implementation cost. And in this case the cost is relatively high. |
Just writing a parser for javascript, I could have Lately, I've been writing a parser for a language of my own. I have something like this in a PEG framework I wrote for python :
And I can then write:
Since I have about as many levels of priority as in C++ (all its operators plus a few more), I'll let you imagine how useful in can be. I'm not done yet in the parsing expressions, yet I already use it 12 times. |
This would be great if combined with an 'import' feature
|
(This is a bit more complicated than OP's request, but it seemed too close to justify its own thread.) I'm building an R5RS Scheme parser with the help of PEG.js. Everything's rosy except for quasiquotations, which require context-aware parsing. It would be useful to be able to parameterize rules for the sake of on-the-fly rule generation from templates, avoiding a large amount of awkward post-processing. For example, a simplified quasiquotation grammar might look like: quasiquotation = qq[1]
qq[n] = "`" qq_template[n]
qq_template[0] = expression
qq_template[n] = simple_datum / list_qq_template[n] / unquotation[n]
list_qq_template[n] = "(" qq_template[n]* ")" / qq[n+1]
unquotation[n] = "," qq_template[n-1] I am interested in contributing to the development of this feature if there's any interest in adding it to the tool. |
The main reason to do this would be to support grammars sensitive to context, which if I'm not mistaken, most popular languages are (I know for sure that C and python have context specific stuff). According to Trevor Jim, Haskell is also not context free, and asserts that most langauges aren't: http://trevorjim.com/haskell-is-not-context-free/ Using external state in a parser that can backtrack (like PEG can) is dangerous, and can produce issues such as can be seen in this parser:
The above returns 2 instead of the correct answer of 1. Issues like this can be hard to reason about, can create insidious hard-to-find bugs, and when found can be very hard to work around at all, much less doing it elegantly. It unclear to me how to even do this without doing post-processing of data returned by PEG. If somehow your parser itself needs the count, its simply out of luck. Currently, (dangerously) using external state is the only way to parse grammar that are sensitive to context. With parameterized rules, a parser could parse this without risking invalid state:
David, you asked for real situations, and python's whitespace indentation syntax is clearly an example here. I want to do similar white-space indentation syntax in Lima (the programming language I'm making with PEG). But I would not want to implement anything like that when I could inadvertantly create invalid state that blows everything to hell. I could name any parsing construct that requires context, like C's x* y (is it x times y or y being defined as a pointer to an x-typed value?). Note that for grammars sensitive to context to be parsable, one would necessarily need to pass information returned from subexpressions already matched into a parameterized rule - otherwise the parser can't actually use any of the context. Here's a real example of a string type I'm considering for Lima that only works if parameterized parsing is available and can access (as variables) lablels of previously matched expressions:
This would be able parse a string like string[10:abcdefghij] . You can't do that with nice pure PEG.js as it stands. You have do something awful like:
Many many protocols have this kind of parsing need - for example, IPv4 packets have a field describing its total length. You need that context to properly parse the rest of the packet. Same is true for IPv6, UDP, and probably any other packet-based protocol. Most protocols using TCP are also going to need something like this, since one needs to be able to trasmit multiple completely separate objects using the same conceptual character stream. Anyways, I hope I've given some good examples and reasons why I think this is not only a nice feature, not only a powerful feature, but really an essential feature that many parsers are missing (including, for the moment, PEG.js). |
Pegasus (a project that shares most of its syntax with peg.js) solves this by having a Also, by backtracking the state along with the parsing cursor, memoization can be accomplished for stateful rules as well. Peg.js could easily do the same, I reckon. |
How does Pegasus manage backtracking state when rules backtrack? I can imagine that you could keep a snapshot of the whole program state that changed, and revert it back, but that would be expensive. I could imagine keeping a snapshot of only the variables that changed, but that would either require the user to specify it which would add complexity to creating parsers, or would require the parser to somehow figure out all the state changed in some bit of code. None of these sound ideal, so how does Pegasus do it? Theoretically, the parser could avoid invalidly executed actions if A. actions are queued up in closures and only executed once the parser has fully completed, and B. because they execute after the parser has completed they couldn't cancel a rule match. Perhaps that scheme would be more optimal than the state backtracking done in pegasus? Also, fixing the problem of invalid state is very nice indeed, but it doesn't solve the problem of expressibility I brought up related to a string literal like string[10:abcdefghij], but I'm definitely interested in how it works |
It doesn't backtrack the whole program's state. It maintains an immutable dictionary of state. It saves the current state dictionary along with the cursor and whenever the cursor is backtracked, the state dictionary gets backtracked with it. The dictionary is immutable anywhere outside of There is a small performance penalty for setting an extra variable every time you advance the cursor, but this is far outweighed by the ability to memoize stateful rules. Also, this doesn't lead to tons of memory allocation, because the immutable nature of the state dictionary allows it to be shared until it is mutated. For example, if you didn't have state in your parser, there would be only one allocation: a single (empty) state dictionary. JavaScript doesn't (to my knowledge) have the ability to make an object immutable, but that was mostly a safety feature. Peg.js would just need to copy a state dictionary before processing each |
Oh ok, so the user basically does have to specify what state they're changing. Thats pretty cool. But I still don't think it really covers the same benefits that parameterization does. It sounds like its probably useful in its own right for other things. |
I just wrote a fork that supplies an environment, accessible using the variable Here is an example grammar that uses it to parse whitespace-defined blocks a la Python:
Here is an example input for the resulting parser:
|
…ate defining: List<expr, delim> 'Generic list' = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); }; Syntax for template instantiation: CommaSeparatedIntList = List<n, ','>; n = n:$[0-9]+ {return parseInt(n, 10);}
…ting AST and default handling for new AST nodes.
…template rules and correct testcases.
I have the impression that PEG.js doesn't support parameters of any kind on rules - which is surprising. This feature is very important to me. What I need is simpler than the OP's request - the OP wants to modify the grammar itself depending on the parameter, but at a minimum I just need to pass an integer into a rule. Basically I want to translate an LLLPG rule that looks like this (where
My language has 25 precedence levels, and with these rules I have collapsed almost all of them to be processed by a single rule (you can think of Also you can see here that the inner rule So ... is there any way to pass parameters to a PEG.js rule? |
+1! In my case, I simply want to generate a parser for a syntax, where some delimiters are globally configurable. In this case, I can achieve this by replacing the delimiter literals by match anything expressions combined with a predicated, but it would be much more elegant (and also more efficient) if the match-everything could simply be replaced by a variable. |
…template rules and correct testcases.
…ate defining: List<expr, delim> 'Generic list' = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); }; Syntax for template instantiation: CommaSeparatedIntList = List<n, ','>; n = n:$[0-9]+ {return parseInt(n, 10);}
…ting AST and default handling for new AST nodes.
…template rules and correct testcases.
There are two types of arguments that could be passed to parsers:
There should be the way to somehow mark whether the argument is a parser or a value. When I tried to figure out the correct approach, I ended up using global variables for context, and that's obviously a dead end. Does anyone have any ideas on that? |
How about macros? Given:
When:
Then:
This allow us to build rules from the bottom-up 😏 |
I agree, I recommend that developers add this feature :) |
I really need this feature for an updated JavaScript grammar I'm writing, so this is high on my wish list. Will give it a go and see how it works out. |
@samvv I've come across this from a very different route, and haven't read the whole thread yet. That is, in essence: return functions as intermediate parse results. That "trick" is by no means my invention, and probably rather clunky for your purpose I guess. But it might be a work-around for you. I mean until "post v1.0"... :) |
@meisl Cool, thanks for the tip! Will try it out when I find some time. |
@samvv Ooh, ah... I'm afraid I have overlooked something rather important: It makes quite a difference whether you want the parameterized rule to
What I was proposing only helps with the former - while the latter is the actual problem of the OP... However, there is a workaround even for the latter, albeit even MORE clunkier. I'm appending an example for you to try out in https://pegjs.org/online The basic idea is: use global state to remember the current "terminator". That's quite a hack, admittedly, and repetitive.
which differs from the others only in that one very character
... on inputs
p.s.: that is really NOT something one would want to write by hand, I know. But hey, imagine it'd be generated for you... I think in principle that is how. |
Wow it took me a while to remember. 10 years ! In this PR, rules take arguments and can be parametrized. This is meant to be used by the grammar itself to avoid repeating similar-yet-different rules. In #36, the rules are specified outside of the grammar itself. The grammar is thus itself parametrized. I think the scope is different, though one could argue that a grammar is itself a rule and thus this is the same issue. I think it is not though, as #36 would probably mean some slight API changes, while this PR would not. |
So, to abuse C++ ish terminology in a deeply incorrect way, the former are template statics, whereas the latter are constructor calls? |
I guess this analogy somewhat works, yes. |
Thank you for your time in explanation I'll probably have another question for you in ten years. Have a good 2020s |
It would be really useful in removing redundancy of my parser definition. I have a custom grammar that's on purpose very relaxed, and some rules need to be applied in slightly different contexts. |
It would be great to be able to parametrize rules with variables ;
The text was updated successfully, but these errors were encountered: