Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Grammar in Architecture #41

Open
ilslv opened this issue Nov 8, 2021 · 61 comments
Open

Update Grammar in Architecture #41

ilslv opened this issue Nov 8, 2021 · 61 comments

Comments

@ilslv
Copy link
Contributor

ilslv commented Nov 8, 2021

I'm implementing Cucumber Expressions parser in Rust: WIP. And grammar described in the ARCHITECTURE.md really bothers me for couple of reasons.

  1. Lookbehind and lookahead in alternation definition
    EBNF describes context-free grammars and (correct me if I'm wrong) shouldn't have lookaheads and lookbehinds. This also means that described grammar may have non-linear complexity.
  2. Provided grammar describes superset of Cucumber Expressions and I don't really see reasons to do it.

So implementing Cucumber Expression parser according to the docs leads to unnecessary performance overhead for no obvious reasons.

Can't we just provide exact grammar for Cucumber Expressions without Note: section?


If I didn't miss anything provided grammar should be exact Cucumber Expressions

cucumber-expression     = (alternation
                           | optional
                           | parameter
                           | text)*
text                    = (- text-to-escape) | ('\', text-to-escape)
text-to-escape          = '(' | '{' | '/' | '\' 

alternation             = single-alternation, ('/', single_alternation)+
single-alternation      = ((text-in-alternative+, optional*) 
                            | (optional+, text-in-alternative+))+
text-in-alternative     = (- alternative-to-escape) | ('\', alternative-to-escape)
alternative-to-escape   = ' ' | '(' | '{' | '/' | '\'

optional                = '(', text-in-optional+, ')'
text-in-optional        = (- optional-to-escape) | ('\', optional-to-escape)
optional-to-escape      = '(' | ')' | '{' | '/' | '\'

parameter               = '{', name*, '}'
name                    = (- name_to_escape) | ('\', name-to-escape)
name-to-escape          = '{' | '}' | '(' | '/' | '\'
@aslakhellesoy
Copy link
Contributor

How exciting to have a Rust port! ❤️

This looks good to me. Any feedback @mpkorstanje?

@ilslv would you be interested in donating your parser implementation to this repo (under a rust directory)? That way we can more easily share the same test suite, align the release cycle and keep everything in one place. We'd also set up a GitHub Action to release it as a crate whenever we make a release.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 8, 2021

@aslakhellesoy my implementation may be incompatible with test suit in this repository. As I could see tests under parser directory parse nested optionals just find and error only on expansion to the regex in matching directory. My implementation is 1-step that matches only correct Cucumber Expressions.

@aslakhellesoy
Copy link
Contributor

Do you think it would be possible to make your implementation compatible with the official test suite? If not, do you think there is something we could change in the official implementation's test suite (or implementation) in order to make it easier for your implementation to be compatible?

@mpkorstanje
Copy link
Contributor

@aslakhellesoy you added the EBNF phrase at 0875a40#diff-8f6366fd8e7a5fa30f7f05879999195b7cf5fa1e3d157822eb37f0e91d226cfdR7. Note that the code highlighting should still use EBNF because that makes the pseudo-grammar somewhat readable but the grammar itself is not EBNF.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 8, 2021

@aslakhellesoy

it would be possible to make your implementation compatible with the official test suite

Yes, but it would introduce some performance overhead, which we would like to avoid.

do you think there is something we could change in the official implementation's test suite (or implementation) in order to make it easier for your implementation to be compatible

It would be possible, but I don't think it costs the effort. As I mentioned only subset of the official test suite is incompatible. The most important tests for final matching and regex transformation are ok. I even think that only this test case would be incompatible. This happens because other implementations are using current ARCHITECTURE.md approach of generating AST which is wider than Cucumber Expression and error only later. My implementation generates only right Cucumber Expression from the start.
Mainly because of those difference I would hold on merging Rust implementation here. But you can still reference it using submodules or just mention it in the README.md.

@mpkorstanje
Copy link
Contributor

mpkorstanje commented Nov 8, 2021

@ilslv the grammar as written is context-sensitive and not in EBNF form (EBNF also can't express context sensitive grammars).

Provided grammar describes superset of Cucumber Expressions and I don't really see reasons to do it.

Users of Cucumbers are not typically familiar with the concepts involved in parsers or grammars. As such the typical error messages from a parser are incomprehensible. By accepting a super set of all cucumber expressions we can more clearly provide error messages for typical mistakes such as using optional parameter types, empty optionals, etc, etc by pointing the entire AST node that is in the wrong place rather then a single character.

text                    = (- text-to-escape) | ('\', text-to-escape)

This puts each character into a single AST node. This isn't great for languages with garbage collection and makes debugging much harder.

alternation             = single-alternation, ('/', single_alternation)+

It is now longer obvious that alternation is bounded by whitespace, parameter types or start/end of expression which means that we are losing some of the intention.

parameter               = '{', name*, '}'
name                    = (- name_to_escape) | ('\', name-to-escape)
name-to-escape          = '{' | '}' | '(' | '/' | '\'

The *-to-escape is pretty neat. This is definitively worth including.

Now I've only given this a cursory review but other then the issues mentioned above I don't see any immediate technical problem with the provided EBNF. Might be worth updating the documentation somewhat.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 8, 2021

@mpkorstanje

typical error messages from a parser are incomprehensible

Yes, but only if you are using some sort of parser-generator. Rust implementation has exactly the same errors, which can point at troublesome characters or range of characters. All this while preserving minimal overhead (1 dynamic allocation on every nested expression and even it can be avoided, I just don't think it costs the effort for now).

This puts each character into a single AST node. This isn't great for languages with garbage collection and makes debugging much harder.

There is no need to translate provided grammar 1-to-1. I propose adding it only to provide some formal specification for the Cucumber Expressions language.

My proposal is following: let's add formal EBNF specification for the language while preserving the old one and add documentation that it only exists only for implementation suggestion. This would allow Rust implementation (or any other one) to strictly follow it, because for now there is no way for me to reason that Rust implementation is correct, as there is no formal specification for Cucumber Expressions.
If you are on board with that idea, I will happily prepare PR documenting all this.

@mpkorstanje
Copy link
Contributor

No. If we only have a formal specification without a closely matching implementation and test set in this repository we are only going to see uncontrolled drift.

Though I don't believe a formal specification has to be in EBNF. As long as the context sensitive form can be procedurally be written to EBNF it can serve as a formal specification for your purposes.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 9, 2021

@mpkorstanje

No. If we only have a formal specification without a closely matching implementation and test set in this repository we are only going to see uncontrolled drift.

It looks like there is a misunderstanding, as I never proposed to remove anything, only add a formal specification. Let me sum up everything discussed in this issue.

What are the proposed changes?

I would like to see formal specification with some context-free grammar. Original issue shows the EBNF example.

It is now longer obvious that alternation is bounded by whitespace, parameter types or start/end of expression which means that we are losing some of the intention.

It's not obvious, but alternation is still bounded by whitespace, because of following rules

text-in-alternative     = (- alternative-to-escape) | ('\', alternative-to-escape)
alternative-to-escape   = ' ' | '(' | '{' | '/' | '\'

You have to escape whitespace for it to match.

I agree that it's hard to spot. That's why I'm proposing only to add formal specification, leaving existing one as one of implementation possibilities.

Why do we need this change?

For now there as there is no formal specification, only some non-existing grammar, I can't reason why Rust implementation is correct.

Why not to use existing guidance from ARCHITECTURE.md?

  1. ARCHITECTURE.md doesn't give full picture.
    There is no way to know how escaping works in different AST elements only from it. I had to dig throughout all the test cases to be sure, that I didn't miss anything. We should provide formal specification to reason about implementation correctness.
  2. ARCHITECTURE.md forces to use suboptimal implementation.
    It encourages to generate middleware AST that is wider, than Cucumber Expression and that is using context-sensitive grammar with lookbehinds and lookaheads. As shown by the Rust implementation there is a way to avoid both of those overheads by generating valid Cucumber Expression from the start, basing on context-free grammar. All that while still preserving friendly error messages.

@aslakhellesoy
Copy link
Contributor

@ilslv could you please send a pull request with your proposed changes?

@ilslv
Copy link
Contributor Author

ilslv commented Nov 9, 2021

@aslakhellesoy #44

@mpkorstanje
Copy link
Contributor

There is no misunderstanding.

From the discussion in cucumber-rs/cucumber-expressions#1 I see that you have asked to implement the parser in a way that closely matches the specification. You are now looking to update the specification to match your implementation.

However the proposed grammar is does not accept the superset of all Cucumber valid expressions. As such we do not have a reference implementation or test set that accurately describes it. Additionally if the EBNF can not be derived from the context-sensitive grammar it can not be said to be correct. If it can be derived from the context sensitive grammar it need not be explicitly included.

Currently the EBNF can't derived from the context sensitive grammar because it is not well specified. So I'm specifically requesting a fix to the context-sensitive grammar rather then an addition of an otherwise untested, unverified (in this repo) piece of specification.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 9, 2021

@mpkorstanje

update the specification to match your implementation

No, I'm trying to introduce formal specification as there is none yet.

does not accept the superset of all Cucumber valid expressions

Yes, thats the point to describe Cucumber Expressions through it's own formal specification, rather then relying on superset specification in non-existent grammar with number of notes on post-validation.

EBNF can not be derived from the context-sensitive

Yes, because Cucumber Expressions specification doesn't need to be context-sensitive.

it can not be said to be correct

There is no notion of correctness right now, as no formal specification exists.

fix to the context-sensitive grammar

My grammar is context-free. Can you please provide concrete examples, where my attempt at formalisation fails?

@tyranron
Copy link

tyranron commented Nov 9, 2021

@ilslv @mpkorstanje it seems to me that the discussion becomes a little bit overheated. Let's sum it up a little with facts and coclusions, to push it into a constructive direction.

The current grammar described in ARCHITECTURE.md:

  1. Somewhat implementation-driven as describes some superset of Cucumber Expressions followed by post-validation rules for the sake of better spanning info and error messages in the current parser implementations.
  2. Not context-free, while doesn't have to.
  3. Covered with a tests suite to ensure the current implementations match it.

This imposes problems for alternative implementations:

  • either to follow closely the original implementation (which is too restrictive regarding implementation approaches and performance reasoning);
  • or to implement something new only pretending to call itself Cucumber Expressions, but in fact, having no ability to reason whether it's the right implementation or not (which is bad for future proofing and compatibility).

We'd really like to have implementation freedom, while preserve the ability to formally check correctness. For this purpose @ilslv proposed to add a "formal spec" grammar, which is context-free and implementation-free, but describes the actual Cucumber Expressions precise enough, so any implementation can refer to its as the source of truth.

But @mpkorstanje is against having a formal spec without an ability to verify whether implementations do follow it, as it will introduce an obvious drift between the actual spec and its implementations.

Please, correct me if I'm wrong.

So, from my point of view:
It worths, of course, to provide a precise "formal spec" grammar proposed by @ilslv. But we should also provide a generic set of tests for it which would be suitable for any implementation.
With this, we can satisfy @mpkorstanje point about "actual correctness", while also provide third-party alternative implementations (like ours) with an obvious way to verify its correctness. Having this tests suite will allow to strip out major part of existing tests, leaving only the implementation-specific ones (like spanning/errors handling, or language-dependent stuff).

Is there any arguments against this way? Or is there any issue that I do miss?

@mattwynne
Copy link
Member

@tyranron thanks for pulling this discussion together and @ilslv thanks for all your contributions so far. I share Aslak's excitement at seeing a Cucumber in Rust!

If people would like to talk about this in a more real-time context, we can be found on the https://cucumber.io/community#slack in the #committers channel.

We also have a regular Zoom call on Thursdays at 7am PST (4pm CET) that you'd be more than welcome to join us for. If you hop into the Slack we can share the Zoom link in there or add you to the invite.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 10, 2021

@mattwynne thanks for the invitation! But I'd like to have discussion here manly for 2 reasons:

  1. In my experience Slack conversations have tendency to branch off into some other topic and are hard to follow.
  2. History of Slack conversations gets lost eventually, while it's good to have full discussion of technical decision (especially controversial ones like this). That allows future contributors to better understand what options where available at the time and why exact decision was made.

We also have a regular Zoom call on Thursdays at 7am PST (4pm CET) that you'd be more than welcome to join us for.

If issue won't be resolved by Thursday I thinks I will be able to join you on the call. They have tendency to be more constructive, as time is limited 😅

@mpkorstanje
Copy link
Contributor

@ilslv no, my request is very simple:

To fix the context sensitive grammar in such a way that a the EBNF can be logically derived from it.

I'm not opposed to replacing the context sensitive grammar with an EBNF if the test suite, implementations in this repository follow suit. However that is a significant commitment.

As such fixing the context sensitive grammar in such a way that a the EBNF can be logically derived from it should be a reasonable compromise.

Third party implementations will be able to use a "formal" grammar while this project retains the canonical grammar with closely matching implementation and tests.

In the long term we can then look at improving the test suite, grammar and implementations.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 10, 2021

@mpkorstanje

To fix the context sensitive grammar in such a way that a the EBNF can be logically derived from it.

I don't understand what EBNF can be logically derived from it actually means.

We have 2 grammars:

  1. ARCHITECRURE.md
  • Context-sensitive
  • Describes superset of Cucucumber Expressions
  • Written in non-existent grammar with number of notes on post-validation
  1. Proposed grammar
  • Context-free
  • Describes what I believe is exact Cucucumber Expressions spec
  • Written in EBNF

What do you mean by fix the context sensitive grammar?
Do you want final grammar to be context-sensitive? With EBNF it couldn't be achieved and it leads to unnecessary performance overhead for implementation to be compliant.
Do you still want resulting grammar to represent superset of Cucumber Expression? I'm strongly against it, as it won't let us say that particular implementation is correct. And I don't think this will be abled to achieved with EBNF too. And this also leads to performance overhead I described earlier.
The only other complaint I saw was regarding alternation not been obviously bounded by whitespace and I already responded that actually it is bounded. And there is no other way to make it more clear to my knowledge.

I still don't understand what do you actually want to fix.

@mpkorstanje
Copy link
Contributor

I'm looking to retain a close similarity between the grammar, tests and implementation in this repo while also make it possible for you to derive a grammar in EBNF form without asking you to also change the implementations and tests in this repo.

Now for our purposes we can say that grammars are equivalent if they accept the same language. So if the language accepted by cucumber expressions is context free you should be able to derive the context free grammar from the context sensitive grammar in a structured manner (or be able to proof equivalency in another way).

Currently I don't believe this is possible because the context sensitive grammar contains several errors and ambiguities. This inability to derive a context free grammar from the context sensitive one is what I'm asking you to fix. Additionally the the fixed grammar should still be context sensitive and accept the superset if Cucumber expressions with optional parameter types, empty optionals, ect.

So once fixed we will have one context sensitive grammar that matches our implementation and tests from which you can logically derive a context free one.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 10, 2021

@mpkorstanje

  1. I want to guarantee that Rust implementation is correct. For that I need to formalize Cucumber Expressions language. While you are asking me to fix language which is superset of Cucumber Expressions. Even if I do it, there would be no way to say that Rust implementation is correct.
  2. There is no way to express context-sensitive language in EBNF. If you want to fix existing language while preserving context-sensitivity we need to choose another way of expressing it.
  3. It is mathematically impossible to derive context-free language form context-sensitive in general case.

All in all we have different intentions. You just want to fix existing superset grammar and by doing it to force implementations to be suboptimal to be somehow correct (while still this word can't be applied to language without spec).
While I want to introduce formal specification for Cucumber Expressions language to be able guarantee correctness not only for Rust implementations, but also to every other implementation.

@mpkorstanje
Copy link
Contributor

I don't think it is impossible. This isn't the general case of rewriting any arbitrary grammar but rather this the case of rewriting this specific grammar. Assuming your EBNF is correct we know context free form exists.

But if you do think it is impossible then the only mutually acceptable alternative is to rewrite the grammar, tests, and existing implementations.

@tyranron
Copy link

@mpkorstanje @ilslv

So the least resourceful, but good enough, solution would be to have a programmable way for mapping "formal spec" AST into the current one, or vice versa, or having even another reduced representation, so both "formal spec" AST and the current AST may be translated into for correct comparing. This will allow us to fully reuse exisitng implementation and tests.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 11, 2021

@tyranron

programmable way for mapping "formal spec" AST into the current one, or vice versa

Simply mapping from formal spec into superset grammar won't give us much info. It's almost the same as having grammar that accepts only empty string and will always map in superset.
Mapping from superset grammar into formal spec isn't possible, as first one is superset of another.
Post-validation is key problem here.

The only thing, that will bring us to being somewhat correct is to feed all implementations some input from manual tests and fuzzers and compare resulting Cucumber Expressions AST or errors produced while parsing.
Another problem is that tests inside this repo aren't suitable. parser folder contains expansion of superset AST as seen by this testcase. I don't also know if current implementations can produce exact Cucumber Expressions AST.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 11, 2021

@mattwynne

I also noticed one interesting implementation detail. Whitespaces are treated as a some sort of special case

---
expression: three hungry/blind mice
expected_ast:
  type: EXPRESSION_NODE
  start: 0
  end: 23
  nodes:
  - type: TEXT_NODE
    start: 0
    end: 5
    token: three
  - type: TEXT_NODE
    start: 5
    end: 6
    token: " "
  - type: ALTERNATION_NODE
    start: 6
    end: 18
    nodes:
    - type: ALTERNATIVE_NODE
      start: 6
      end: 12
      nodes:
      - type: TEXT_NODE
        start: 6
        end: 12
        token: hungry
    - type: ALTERNATIVE_NODE
      start: 13
      end: 18
      nodes:
      - type: TEXT_NODE
        start: 13
        end: 18
        token: blind
  - type: TEXT_NODE
    start: 18
    end: 19
    token: " "
  - type: TEXT_NODE
    start: 19
    end: 23
    token: mice

Rust implementation does completely the same thing. So It looks like existing implementation already work around context-sensitive grammar by being context-free. This can mean that we no longer can call them correct. In that case bringing context-free grammar would only make it closer to existing implementations.

@mpkorstanje
Copy link
Contributor

mpkorstanje commented Nov 11, 2021

Rust implementation does completely the same thing. So It looks like existing implementation already work around context-sensitive grammar by being context-free.

I'm not sure about the point you are making.

As we've already established, we can assume the language accepted by the grammar to be context-free. The grammar however is written in context sensitive form for clarity. I'm open to updating the grammar provided the implementations match the grammar.

E.g this would have to change:

private static final Parser alternationParser = (expression, tokens, current) -> {
int previous = current - 1;
if (!lookingAtAny(tokens, previous, START_OF_LINE, WHITE_SPACE, END_PARAMETER)) {
return new Result(0);
}
Result result = parseTokensUntil(expression, alternativeParsers, tokens, current, WHITE_SPACE, END_OF_LINE, BEGIN_PARAMETER);
int subCurrent = current + result.consumed;
if (result.ast.stream().noneMatch(astNode -> astNode.type() == ALTERNATIVE_NODE)) {
return new Result(0);
}
int start = tokens.get(current).start();
int end = tokens.get(subCurrent).start();
// Does not consume right hand boundary token
return new Result(result.consumed,
new Node(ALTERNATION_NODE, start, end, splitAlternatives(start, end, result.ast)));
};

@ilslv
Copy link
Contributor Author

ilslv commented Nov 11, 2021

@mpkorstanje

I'm not sure about the point you are making.

Your previous argument was about not being able to change the grammar into being context-free, as current implementations are based. I'm, on the other hand, saying that argument doesn't make sense to me, if current implementations already don't follow provided grammar.


I've finally looked into the implementation and from I can see ether grammar is more broken, than I thought, or implementation is wrong.

private static final Parser alternationParser = (expression, tokens, current) -> {
int previous = current - 1;
if (!lookingAtAny(tokens, previous, START_OF_LINE, WHITE_SPACE, END_PARAMETER)) {
return new Result(0);
}
Result result = parseTokensUntil(expression, alternativeParsers, tokens, current, WHITE_SPACE, END_OF_LINE, BEGIN_PARAMETER);
int subCurrent = current + result.consumed;
if (result.ast.stream().noneMatch(astNode -> astNode.type() == ALTERNATIVE_NODE)) {
return new Result(0);
}
int start = tokens.get(current).start();
int end = tokens.get(subCurrent).start();
// Does not consume right hand boundary token
return new Result(result.consumed,
new Node(ALTERNATION_NODE, start, end, splitAlternatives(start, end, result.ast)));
};

Here we can see call into parseTokensUntil which should be doing alternative* + ( '/' + alternative* )+ + (?=right-boundary) parsing part. Let's peek inside

private static Result parseTokensUntil(
String expression,
List<Parser> parsers,
List<Token> tokens,
int startAt,
Type... endTokens) {
int current = startAt;
int size = tokens.size();
List<Node> ast = new ArrayList<>();
while (current < size) {
if (lookingAtAny(tokens, current, endTokens)) {
break;
}
Result result = parseToken(expression, parsers, tokens, current);
if (result.consumed == 0) {
// If configured correctly this will never happen
// Keep to avoid infinite loops
throw new IllegalStateException("No eligible parsers for " + tokens);
}
current += result.consumed;
ast.addAll(result.ast);
}
return new Result(current - startAt, ast);
}

As I understand algoritm is following:

  1. Check lookingAtAny(tokens, current, endTokens)
  2. Parse more with Result result = parseToken(expression, parsers, tokens, current);
  3. If parsed something, go to 1, or throw Exception otherwise

If so, this is not how lookahead defined at (?=right-boundary) should work. Lookahead checks after each parsed character. This is what allows us to something like this in regex (stopping greedy parser using lookahead after the first character).

So, if I understand algorithm correctly, we have 3 possible cases:

  1. Implementation is wrong and should implement real lookahead.
  2. Implementation is right and grammar is wrong.
  3. Implementation and grammar are correct and (?=right-boundary) doesn't mean lookahead and means something like apply previous parser and see if next element is right-boundary while not consuming it

Which is it?

@mpkorstanje
Copy link
Contributor

I'd say #3 is the most accurate, but again, I don't see how this changes any of the conclusions made so far. If we change the grammar, we change the implementations and tests to match.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 11, 2021

@mpkorstanje this is significant, because real lookahead requires context-sensitive grammar, while implemented one doesn't. I believe that it's possible to formalise pretty easy-to-understand context-free grammar which fully compliant with implementation.

@mpkorstanje
Copy link
Contributor

Can you update the implementation and grammar to reflect that?

@ilslv
Copy link
Contributor Author

ilslv commented Nov 12, 2021

@mpkorstanje I'm not too comfortable with any languages other implementations are using. But more important thing is that we don't have to update implementation right the way, as provided grammar is equivalent to implemented one.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 12, 2021

@mpkorstanje

Even more

cucumber-expression = ( alternation | optional | parameter | text )*
alternation         = (?<=left-boundary), alternative*, ( "/" + alternative* )+
left-boundary       = whitespace | "}" | ^
alternative         = optional | text
optional            = "(", option*, ")"
option              = optional | parameter | text
parameter           = "{", name*, "}"
name                = whitespace | .
text                = whitespace | ")" | "}" | .

is equivalent to

cucumber-expression = ( alternation | optional | parameter | text )*
alternation         = alternative*, ( "/" + alternative* )+
alternative         = optional | text
optional            = "(", option*, ")"
option              = optional | parameter | text
parameter           = "{", name*, "}"
name                = whitespace | .
text                = whitespace | ")" | "}" | .

This happens because alternation is greedy and comes first in cucumber-expression. So previous characters in the start of alternation parsing are always ^ or something previously parsed by optional | parameter | text which wasn't accepted in alternation. This happens to be only exactly whitespace | "}" | ^ so (?<=left-boundary) is redundant too, as always returns true.

Final grammar which is equivalent to the existing one is just relaxed version of my original proposal. By relaxed I mean that it simply accepts invalid Cucumber Expressions too, while mine doesn't.

@mpkorstanje
Copy link
Contributor

Cheers! We can use this info to update the grammar and implementations sometime in the future then.

@mpkorstanje
Copy link
Contributor

Fixing the grammar / parser to use the (- X-to-escape) | ('\', X-to-escape) doesn't look so hard either.

@mpkorstanje
Copy link
Contributor

How do you "span" the invalid subtrees in error messages without a fully parsed AST?

Example:

Hello (optional {parameter}) world
                ^--------^

The parser would not fail at the first {, how do you know to span the pointers to the next element and not also include other syntax errors like Hello (optional {parameter {} ) world?

@ilslv
Copy link
Contributor Author

ilslv commented Nov 12, 2021

@mpkorstanje

Hello (optional {parameter}) world
                ^---------^

This one is pretty easy to tackle: If I'm started to parse optional there can't be unescaped {. Once I encounter one, I try to parse parameter. If successful, that means I have an entire parameter inside of optional. Otherwise this is an error too, just another one: unescaped reserved character.


So in case of Hello (optional {parameter {} ) world this will error about parameter containing {, because we've started to parse parameter inside an optional, which errored.


Also helpful idea of splitting errors into recoverable once, and irreconcilable. As the name suggests, recoverable allows us to continue parsing with another parser. For example in abc we first try to parse alternative, recoverably fail, then with option and so on, until encounter text. While in case of abc/ we'll try to parse alternative, which will encounter /, after that we know, that all subsequent errors aren't recoverable, as only alternatives can have unescaped / inside. So having empty alternation is immediate irrecoverable error.

@mpkorstanje
Copy link
Contributor

Can you show an example of that?

@ilslv
Copy link
Contributor Author

ilslv commented Nov 12, 2021

@mpkorstanje Rust syntax is quite alien-looking, but here you go

@mpkorstanje
Copy link
Contributor

I also don't see where you would construct the ^---------^ span.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 12, 2021

@mpkorstanje

Mmh. This error message doesn't look quite right

I've united all IlligalIn.. into UnescapedReservedCharacter as final suggestion of escaping this character is the same

I also don't see where you would construct the ^---------^ span.

Actually I don't do it. I allow users to supply string wrapped in a LocatedSpan which will track where the captured slice is coming from. In this case exact error message can be constructed, but there is possibility to skip this tracking by supplying raw string.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 16, 2021

@mpkorstanje

Cheers! We can use this info to update the grammar

So, do you want update grammar to exact Cucumber Expressions or still use superset AST?

@mpkorstanje
Copy link
Contributor

So far I don't see how I can generate the error messages with the ^---------^ span without a superset.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 16, 2021

@mpkorstanje I've already described solution with LocatedSpan.

Codesample:

let err = Expression::regex("I have {int cucumbers in my belly").expect_err();
match err {
    Error::UnescapedReservedCharacter(e) => {
        assert_eq!(*e, "{");
        assert_eq!(
            repeat(' ')
                .take(e.offset())
                .chain(once('^'))
                .chain(repeat('-').take(e.len()))
                .chain(once('^')),
            "       ^-^".chars(),
        )
    }
    e => panic!("wrong err: {}", e),
}

let err = Expression::regex("I have ({int}) cucumbers in my belly").expect_err();
match err {
    Error::ParameterInOptional(e) => {
        assert_eq!(*e, "int");
        assert_eq!(
            repeat(' ')
                .take(e.offset())
                .chain(once('^'))
                .chain(repeat('-').take(e.len()))
                .chain(once('^')),
            "        ^---^".chars(),
        )
    }
    e => panic!("wrong err: {}", e),
}

More code examples are available here

@mpkorstanje
Copy link
Contributor

Unfortunately that doesn't show how to compute the bounds that go into the span.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 16, 2021

@mpkorstanje it does exactly that.

e.offset() returns offset of the span in original input.
e.len() returns length of the span.
repeat(' ').take(e.offset()).chain(once('^')).chain(repeat('-').take(e.len())).chain(once('^')) does exactly what you were looking for.

@mpkorstanje
Copy link
Contributor

So where do those values come from?

@ilslv
Copy link
Contributor Author

ilslv commented Nov 17, 2021

@mpkorstanje as I mentioned before, implementation abstracts over input type: https://github.com/ilslv/cucumber-expressions-1/blob/main/src/parse.rs#L427-L435
Rather than accepting String, it accepts something that implements Offset. So LocatedSpan implements Offset by remembering it and slicing inner String.

@tyranron
Copy link

@mpkorstanje the idea is that we carry the location information along while descending in a parser.

  1. Given the string as input:

    let err = Expression::regex("I have ({int}) cucumbers in my belly").expect_err();
  2. We do wrap it into a LocatedSpan before feeding to actual parser (which operates on an already location capable inputs):

    parse::expression(Spanned::new(value))

    The LocatedSpan::new() sets the positions as:

    offset starts at 0, line starts at 1, and column starts at 1.

  3. Then we descent in a parser and input sub-parts to other functions. When we slice that sub-part of an input, we adjust the offset appropriately. In Rust this implementation detail is abstracted behind Slice implementation, but it can be viewed here: the offset is adjusted when we call slicing operator on a LocatedSpan wrapper.

  4. So, at the moment of emitting an error, we have all the location information in-place. Again, in Rust implementation, the error value itself holds a failed piece of input along with its location info (as long as it's wrapped into LocatedSpan), which, in turn, allows us to call those .offset() and .len() methonds on the returned error value.

@mpkorstanje
Copy link
Contributor

So when encountering a { inside an optional, how do you find the matching } to span the entire parameter without parsing? For example:

Hello (optional {parameter}) world
                ^--------^

Best I can find is:

            Some('{') => {
                if let Ok((_, par)) = peek(parameter)(input.clone()) {
                    return Error::ParameterInOptional(
                        input.take(par.0.input_len() + 2),
                    )
                    .failure();
                }
                return Error::UnescapedReservedCharacter(input.take(1))
                    .failure();
            }

And it looks like peek(parameter) is doing the parsing.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 17, 2021

@mpkorstanje

already described here

This one is pretty easy to tackle: If I'm started to parse optional there can't be unescaped {. Once I encounter one, I try to parse parameter. If successful, that means I have an entire parameter inside of optional. Otherwise this is an error too, just another one: unescaped reserved character.


And it looks like peek(parameter) is doing the parsing.

yes, it tries to parse parameter

Can you please formulate all the questions about Rust implementation, so I can tackle them all at once in depth? Because it looks like we are walking in circles.

@mpkorstanje
Copy link
Contributor

Right so you are parsing invalid ASTs and rejecting them early. This rejection seems rather ad-hoc and I don't see a good reason to do it.

The moment we'd add a test case with multiple AST problems you'd have to change the entire architecture. For example:

Hello (optional {parameter}) (optional {parameter}) 
                ^--------^             ^--------^

@ilslv
Copy link
Contributor Author

ilslv commented Nov 17, 2021

@mpkorstanje

This rejection seems rather ad-hoc and I don't see a good reason to do it.

Most other parsers do exactly the same thing. This is not an ad-hoc solution, but the only possible way to implement zero-cost performant parser. This entire conversation still doesn't have good reason to parse entire superset AST.

The moment we'd add a test case with multiple AST problems you'd have to change the entire architecture.

No, we still can continue parsing and collecting error and return batch of them. This won't require architectural changes. The only reason I've implemented this way is to close to existing implementations, which error early too. Example

Also I don't see error messages as a part of language specification.

@mattwynne
Copy link
Member

Would any of you be interested in an ensemble programming session where we work on this together? I feel like it might speed up the communication a bit if we could work on it in real time, and it could be fun!

I know some people prefer working asynchronously though, so no pressure, I just thought I'd suggest it.

If folks are interested I can do the work to figure out our schedules and find a good time.

@mpkorstanje
Copy link
Contributor

Fundamentally, to keep the implementation and spec close together this means rewriting the existing implementations. This takes time. I currently don't have that sort of time to commit to this.

I'm being asked to make a judgement call about using the superset of Cucumber expressions or exact set in the EBNF. I can't make this judgement call. To cut the task down to size it might even be convenient to in a first iteration make the grammar context-free. And in a second iteration remove the redundant parts of the grammar.

@ilslv
Copy link
Contributor Author

ilslv commented Nov 18, 2021

@mpkorstanje if I understand correctly you want to make every PR atomic. This means after every merge implementation should closely resemble grammar. I don't consider myself comfortable enough this any other implantation language, especially taking into account number of users. So only help I can do is to provide context-free grammar in separate peaces.

So I see following roadmap:

  1. Slightly fix current grammar to remove errors and ambiguities
    This part doesn't require implementation tinkering, so can be performed by me.
  2. Make grammar context-free while still describing superset grammar and correct implementations
    I guess we are talking about this grammar
cucumber-expression = ( alternation | optional | parameter | text )*
alternation         = alternative*, ( "/" + alternative* )+
alternative         = optional | text
optional            = "(", option*, ")"
option              = optional | parameter | text
parameter           = "{", name*, "}"
name                = whitespace | .
text                = whitespace | ")" | "}" | .
  1. If all goes well, maybe update grammar to describe exact Cucumber Expressions and update implementations accordingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants