Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parametrizable rules #45

Open
ceymard opened this issue Aug 25, 2011 · 29 comments
Open

Implement parametrizable rules #45

ceymard opened this issue Aug 25, 2011 · 29 comments
Labels
Milestone

Comments

@ceymard
Copy link

ceymard commented Aug 25, 2011

It would be great to be able to parametrize rules with variables ;

string
   = '\"' parse_contents '\"' ->
   / '\'' parse_contents('\'') '\'' ->
   / '+' parse_contents('+') '+' -> /* sure why not :) */

parse_contents(terminator='\"')
    = ('\\' terminator / !terminator .)+ -> return stuff
@dmajda
Copy link
Contributor

dmajda commented Jan 10, 2012

Do you have specific use case where this would save you significant amount of work or make something currently impossible possible?

@ceymard
Copy link
Author

ceymard commented Jan 10, 2012

It makes parsing indentation levels much easier by calling rules that have the level passed as an argument.

Also, in a pure DRY logic, when doing things like "stuff delimited by this character with escape sequence such", it is nicer to call something like delimited('\'', '\\') than just do the rule (and its actions !) three times.

@dmajda
Copy link
Contributor

dmajda commented Jan 11, 2012

I should have been more clear. By "specific" I was looking for something like "I was working on a grammar of language X and there are 5 rules in there that could have been combined in one, here they are:" That is, I wanted to see real-world use case and real-world code. From that I can judge better in what cases this feature would be useful and for how many people.

Please don't take this as I am opposed to this feature per se. I just generally don't want to implement features useful only for a tiny fraction of languages or developers because of complexity and implementation cost. And in this case the cost is relatively high.

@ceymard
Copy link
Author

ceymard commented Jan 11, 2012

Just writing a parser for javascript, I could have string = delimited_by('\'') / delimited_by('\"') and then later regexp = delimited_by('/').

Lately, I've been writing a parser for a language of my own. I have something like this in a PEG framework I wrote for python :

LeftAssociative(op, subrule): l:subrule rest:(op subrule)* -> a_recursive_function_that_reverses_rest_and_makes_bintrees(l, op, subrule)

And I can then write:

...
exp_mult = LeftAssociative(/[\*\/%]/, exp_paren)
exp_add = LeftAssociative(/[\+-]/, exp_mult)

Since I have about as many levels of priority as in C++ (all its operators plus a few more), I'll let you imagine how useful in can be. I'm not done yet in the parsing expressions, yet I already use it 12 times.

@mcasimir
Copy link

This would be great if combined with an 'import' feature

require(CommaSeparatedList.pegjs)
require(StringLiteral.pegjs)
require(IntegerLiteral.pegjs)

...

Function
 = name:FuncName "(" args:CommaSeparatedList(Literal)  ")" 

Hash
= "{"   props:CommaSeparatedList(Prop)   "}"

Prop
= Key ":" Literal

Literal =
  StringLiteral / IntegerLiteral

@krockot
Copy link

krockot commented Jun 12, 2012

(This is a bit more complicated than OP's request, but it seemed too close to justify its own thread.)

I'm building an R5RS Scheme parser with the help of PEG.js. Everything's rosy except for quasiquotations, which require context-aware parsing. It would be useful to be able to parameterize rules for the sake of on-the-fly rule generation from templates, avoiding a large amount of awkward post-processing. For example, a simplified quasiquotation grammar might look like:

    quasiquotation = qq[1]
    qq[n] = "`" qq_template[n]
    qq_template[0] = expression
    qq_template[n] = simple_datum / list_qq_template[n] / unquotation[n]
    list_qq_template[n] = "(" qq_template[n]* ")" / qq[n+1]
    unquotation[n] = "," qq_template[n-1]

I am interested in contributing to the development of this feature if there's any interest in adding it to the tool.

@fresheneesz
Copy link

The main reason to do this would be to support grammars sensitive to context, which if I'm not mistaken, most popular languages are (I know for sure that C and python have context specific stuff). According to Trevor Jim, Haskell is also not context free, and asserts that most langauges aren't:

http://trevorjim.com/haskell-is-not-context-free/
http://trevorjim.com/how-to-prove-that-a-programming-language-is-context-free/

Using external state in a parser that can backtrack (like PEG can) is dangerous, and can produce issues such as can be seen in this parser:

{   var countCs = 0;
}

start = ((x/y) ws*)* { return countCs }

x = "ab" c "d"
y = "a" bc "e"

c = "c" { countCs++; }
bc = "bc" { countCs++; }

ws = " " / "\n"

The above returns 2 instead of the correct answer of 1. Issues like this can be hard to reason about, can create insidious hard-to-find bugs, and when found can be very hard to work around at all, much less doing it elegantly. It unclear to me how to even do this without doing post-processing of data returned by PEG. If somehow your parser itself needs the count, its simply out of luck.

Currently, (dangerously) using external state is the only way to parse grammar that are sensitive to context. With parameterized rules, a parser could parse this without risking invalid state:

start = countCs:((x<0>/y<0>) ws*)* { return countCs.reduce(function(a,b){return a+b[0];}, 0); }

x<count> = "ab" theCount:c<count> "d" { return theCount; }
y<count> = "a" theCount:bc<count> "e" { return theCount; }

c<count> = "c" { return count++; }
bc<count> = "bc" { return count++; }

ws = " " / "\n"

David, you asked for real situations, and python's whitespace indentation syntax is clearly an example here. I want to do similar white-space indentation syntax in Lima (the programming language I'm making with PEG). But I would not want to implement anything like that when I could inadvertantly create invalid state that blows everything to hell. I could name any parsing construct that requires context, like C's x* y (is it x times y or y being defined as a pointer to an x-typed value?).

Note that for grammars sensitive to context to be parsable, one would necessarily need to pass information returned from subexpressions already matched into a parameterized rule - otherwise the parser can't actually use any of the context. Here's a real example of a string type I'm considering for Lima that only works if parameterized parsing is available and can access (as variables) lablels of previously matched expressions:

literalStringWithExplicitLength = "string[" n:number ":" characters<n> "]"
number = n:[0-9]* {return parseInt(n.join(''));}
characters<n> = c:. { // base case
  if(n>0) return null; // don't match base case unless n is 0
  else return c;
}
/ c:. cs:characters<n-1> {
  ret c+cs
}

This would be able parse a string like string[10:abcdefghij] . You can't do that with nice pure PEG.js as it stands. You have do something awful like:

{ var literalStringLengthLeft=undefined;
}
literalStringWithExplicitLength = "string[" n:specialNumber ":" characters "]"
specialNumber = n:number {
  literalStringLengthLeft = n;
  return n;
}
number = n:[0-9]* {return parseInt(n.join(''));}
characters = c:character cs:characters? {
  return c + cs
}
character = c:. {
  if(literalStringLengthLeft > 0) {
    literalStringLengthLeft--;
    return c;
  } else {
    literalStringLengthLeft = undefined;
    return null; // doesn't match
  }
}

Many many protocols have this kind of parsing need - for example, IPv4 packets have a field describing its total length. You need that context to properly parse the rest of the packet. Same is true for IPv6, UDP, and probably any other packet-based protocol. Most protocols using TCP are also going to need something like this, since one needs to be able to trasmit multiple completely separate objects using the same conceptual character stream.

Anyways, I hope I've given some good examples and reasons why I think this is not only a nice feature, not only a powerful feature, but really an essential feature that many parsers are missing (including, for the moment, PEG.js).

@otac0n
Copy link

otac0n commented Jan 6, 2013

Pegasus (a project that shares most of its syntax with peg.js) solves this by having a #STATE{} expression that is given the ability to mutate a dictionary of state. This dictionary of state is backtracked when rules are backtracked. This allows it to support significant whitespace parsing (see the wiki entry about Significant Whitespace for the details).

Also, by backtracking the state along with the parsing cursor, memoization can be accomplished for stateful rules as well.

Peg.js could easily do the same, I reckon.

@fresheneesz
Copy link

How does Pegasus manage backtracking state when rules backtrack? I can imagine that you could keep a snapshot of the whole program state that changed, and revert it back, but that would be expensive. I could imagine keeping a snapshot of only the variables that changed, but that would either require the user to specify it which would add complexity to creating parsers, or would require the parser to somehow figure out all the state changed in some bit of code. None of these sound ideal, so how does Pegasus do it?

Theoretically, the parser could avoid invalidly executed actions if A. actions are queued up in closures and only executed once the parser has fully completed, and B. because they execute after the parser has completed they couldn't cancel a rule match. Perhaps that scheme would be more optimal than the state backtracking done in pegasus?

Also, fixing the problem of invalid state is very nice indeed, but it doesn't solve the problem of expressibility I brought up related to a string literal like string[10:abcdefghij], but I'm definitely interested in how it works

@otac0n
Copy link

otac0n commented Jan 6, 2013

It doesn't backtrack the whole program's state. It maintains an immutable dictionary of state. It saves the current state dictionary along with the cursor and whenever the cursor is backtracked, the state dictionary gets backtracked with it. The dictionary is immutable anywhere outside of #STATE{} actions, and is COPIED just before each state change.

There is a small performance penalty for setting an extra variable every time you advance the cursor, but this is far outweighed by the ability to memoize stateful rules. Also, this doesn't lead to tons of memory allocation, because the immutable nature of the state dictionary allows it to be shared until it is mutated. For example, if you didn't have state in your parser, there would be only one allocation: a single (empty) state dictionary.

JavaScript doesn't (to my knowledge) have the ability to make an object immutable, but that was mostly a safety feature. Peg.js would just need to copy a state dictionary before processing each #STATE{} code block and backtrack that whenever the cursor is backtracked.

@fresheneesz
Copy link

Oh ok, so the user basically does have to specify what state they're changing. Thats pretty cool. But I still don't think it really covers the same benefits that parameterization does. It sounds like its probably useful in its own right for other things.

@dmajda dmajda changed the title Parametrizable rules Implement parametrizable rules Aug 14, 2015
@tebbi
Copy link

tebbi commented Nov 2, 2015

I just wrote a fork that supplies an environment, accessible using the variable env: https://github.com/tebbi/pegjs
This is the same as the #STATE{} object suggested above.
It is a quick hack, using a (package-)global variable, which is restored whenever a parsing function is left. The copying of env is accomplished with Object.create().

Here is an example grammar that uses it to parse whitespace-defined blocks a la Python:

{
  env.indLevel = -1
}

block =
  empty
  ind:ws* &{
    if (ind.length <= env.indLevel) return false;
    env.indLevel = ind.length;
    return true;
  }
  first:statement? rest:indStatement*
  {
    if (first) rest.unshift(first);
    return rest;
  }

indStatement =
  "\n" empty ind:ws* &{ return env.indLevel === ind.length; }
  stm:statement
  {return stm; }

statement =
    id:identifier ws* ":" ws* "\n"
    bl:block { return [id, bl]; }
  / identifier

identifier = s:[a-z]* { return s.join(""); }

empty = (ws* "\n")*

ws = [ \t\r]

Here is an example input for the resulting parser:

a
b:
   c
   d:
       e
   f
g

Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
…ate defining:

List<expr, delim> 'Generic list'
  = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); };

Syntax for template instantiation:

CommaSeparatedIntList = List<n, ','>;
n = n:$[0-9]+ {return parseInt(n, 10);}
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
…ting AST and default handling for new AST nodes.
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@qwertie
Copy link

qwertie commented May 29, 2016

I have the impression that PEG.js doesn't support parameters of any kind on rules - which is surprising. This feature is very important to me.

What I need is simpler than the OP's request - the OP wants to modify the grammar itself depending on the parameter, but at a minimum I just need to pass an integer into a rule. Basically I want to translate an LLLPG rule that looks like this (where PrefixExpr is a high-precedence expression such as a prefix expression like -x, or an identifier...):

@[LL(1)]
rule Expr(context::Precedence)::LNode @{
    {prec::Precedence;}
    e:PrefixExpr(context)
    greedy
    (   // Infix operator
        &{context.CanParse(prec=InfixPrecedenceOf(LT($LI)))}
        t:(NormalOp|BQString|Dot|Assignment)
        rhs:Expr(prec)
        { ... }
    |   // Method_calls(with arguments), indexers[with indexes], generic!arguments
        &{context.CanParse(P.Primary)}
        e=FinishPrimaryExpr(e)
    |   // Suffix operator
        ...
    )*
    {return e;}
};
// Helper rule that parses one of the syntactically special primary expressions
@[private] rule FinishPrimaryExpr(e::LNode)::LNode @{
(   // call(function)
    "(" list:ExprList(ref endMarker) ")"
    { ... }
    |   // ! operator (generic #of)
        "!" ...
    |   // Indexer / square brackets
        {var args = (new VList!LNode { e });}
        "[" args=ExprList(args) "]"
        { ... }
    )
    {return e;}
};

My language has 25 precedence levels, and with these rules I have collapsed almost all of them to be processed by a single rule (you can think of Precedence as a wrapper around a couple of integers that describe the precedence of an operator). Moreover, my language has an infinite number of operators (basically any sequence of punctuation) and the precedence of an operator is decided after it is parsed. While it would be technically possible to parse the language in the usual way, with a separate rule for each of the 25 kinds of operator, that would be a horrible way to do it.

Also you can see here that the inner rule FinishPrimaryExpr constructs a syntax tree that incorporates a parameter passed from the enclosing rule.

So ... is there any way to pass parameters to a PEG.js rule?

@burnpanck
Copy link

+1! In my case, I simply want to generate a parser for a syntax, where some delimiters are globally configurable. In this case, I can achieve this by replacing the delimiter literals by match anything expressions combined with a predicated, but it would be much more elegant (and also more efficient) if the match-everything could simply be replaced by a variable.

Mingun added a commit to Mingun/pegjs that referenced this issue Jun 12, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Jun 12, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
…ate defining:

List<expr, delim> 'Generic list'
  = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); };

Syntax for template instantiation:

CommaSeparatedIntList = List<n, ','>;
n = n:$[0-9]+ {return parseInt(n, 10);}
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
…ting AST and default handling for new AST nodes.
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
Mingun added a commit to Mingun/pegjs that referenced this issue Oct 14, 2017
@reverofevil
Copy link

reverofevil commented Jan 21, 2018

There are two types of arguments that could be passed to parsers:

  1. Values. Grammars for languages like Python, Nim and Haskell (and also Scheme in a different way) depend on "depth" of the expression inside of the tree. Writing correct grammar requires to somehow pass this context.
  2. Parsers. leftAssociative(element, separator), escapedString(quote) and withPosition(parser) are good examples of that.

There should be the way to somehow mark whether the argument is a parser or a value. When I tried to figure out the correct approach, I ended up using global variables for context, and that's obviously a dead end. Does anyone have any ideas on that?

@audinue
Copy link

audinue commented Sep 9, 2018

How about macros?

Given:

Add <Expression, Add>
  = left:Expression _ '+' _ right:Add
    { return { type: 'add', left, right } }
  / Expression

When:

MyAdd
  = Add <MyExpression, MyAdd>

MyExpression
  = [0-9]+

Then:

MyAdd
  = left:MyExpression _ '+' _ right:MyAdd
    { return { type: 'add', left, right } }
  / MyExpression

MyExpression
  = [0-9]+

This allow us to build rules from the bottom-up 😏

@billdong9
Copy link

I agree, I recommend that developers add this feature :)

@ghost
Copy link

ghost commented Aug 24, 2019

I really need this feature for an updated JavaScript grammar I'm writing, so this is high on my wish list. Will give it a go and see how it works out.

@meisl
Copy link

meisl commented Aug 29, 2019

@samvv I've come across this from a very different route, and haven't read the whole thread yet.
However, in #572, from which I referred to here, I'm showing a technique with which you can simulate parameterized rules.

That is, in essence: return functions as intermediate parse results.

That "trick" is by no means my invention, and probably rather clunky for your purpose I guess. But it might be a work-around for you. I mean until "post v1.0"... :)

@ghost
Copy link

ghost commented Aug 29, 2019

@meisl Cool, thanks for the tip! Will try it out when I find some time.

@meisl
Copy link

meisl commented Aug 29, 2019

@samvv Ooh, ah... I'm afraid I have overlooked something rather important:

It makes quite a difference whether you want the parameterized rule to

  • only be able to produce values, which depend on the parameter
  • or (also) to have its parsing decisions depend on the parameter

What I was proposing only helps with the former - while the latter is the actual problem of the OP...
Sorry for that.

However, there is a workaround even for the latter, albeit even MORE clunkier.
And, the "dependent decisions" part doesn't have anything to do with returning functions...

I'm appending an example for you to try out in https://pegjs.org/online

The basic idea is: use global state to remember the current "terminator". That's quite a hack, admittedly, and repetitive.
But: adding yet another delimiter, say | would just mean to add one more alternative to str:

  / (t:'|' {term = t}) c:conts t:.&{ return isT(t) }  { return c }

which differs from the others only in that one very character |


{
  var term;
  function isT(ch) { return ch === term }
  function isE(ch) { return ch === '\\' }
}
start = str*

str
  = (t:'\"' {term = t}) c:conts t:.&{ return isT(t) }  { return c }
  / (t:'\'' {term = t}) c:conts t:.&{ return isT(t) }  { return c }

conts
  = c:(
        '\\' t:.&{ return isT(t) || isE(t) } { return t }
      /      t:.!{ return isT(t)           } { return t }
    )*
    { return c.join('') }

... on inputs

  • "abc" -> ["abc"]
  • "a\"bc" -> ["a\"bc"]
  • "a\\bc" -> ["a\bc"]
  • "a\\b\"c"'a\\b\'' -> ["a\b\"c", "a\b'c"]

p.s.: that is really NOT something one would want to write by hand, I know. But hey, imagine it'd be generated for you... I think in principle that is how.

@StoneCypher
Copy link

@ceymard - I realize it's ten years later, but I'm curious how this is different than #36

@ceymard
Copy link
Author

ceymard commented Feb 3, 2020

Wow it took me a while to remember. 10 years !

In this PR, rules take arguments and can be parametrized. This is meant to be used by the grammar itself to avoid repeating similar-yet-different rules.

In #36, the rules are specified outside of the grammar itself. The grammar is thus itself parametrized.

I think the scope is different, though one could argue that a grammar is itself a rule and thus this is the same issue. I think it is not though, as #36 would probably mean some slight API changes, while this PR would not.

@StoneCypher
Copy link

So, to abuse C++ ish terminology in a deeply incorrect way, the former are template statics, whereas the latter are constructor calls?

@ceymard
Copy link
Author

ceymard commented Feb 3, 2020

I guess this analogy somewhat works, yes.

@StoneCypher
Copy link

Thank you for your time in explanation

I'll probably have another question for you in ten years. Have a good 2020s

@Rush
Copy link

Rush commented Aug 21, 2020

It would be really useful in removing redundancy of my parser definition. I have a custom grammar that's on purpose very relaxed, and some rules need to be applied in slightly different contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests