-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some minor bugs in the Antlr grammar. #50776
Comments
Bug: #50776 Change-Id: I3c81a675ca883d39e617d7eb5c71cce9c7b03562 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/276640 Reviewed-by: William Hesse <whesse@google.com> Commit-Queue: Erik Ernst <eernst@google.com>
@modulovalue, thanks for your input!
Indeed, good catch! Fixed in https://dart-review.googlesource.com/c/sdk/+/276640. However, many of the things you noted are actually expressed in the given way for a reason. Here's why:
If we do that then we can't parse some programs. The reason for this is that we can have things like You could say that we should report an error if there is any whitespace between the In any case, the syntax needs to be specified in a way that enables parsing, and then we can have extra rules about whitespace (intended to prevent code that "looks funny"). There are a bunch of other cases which are similar, and the same explanation applies there. The occurrence of '[' ']' in the So the main points we can make here are the following:
|
About: >>>=, >>=, >=
Consider the following example: void main() {
List<List<List<int>>>== 2;
} Dartpad seems to recognize the >>>= in |
Yes, it should compile. Cf. #50794. |
About: '[]' Thank you for pointing out my error. Another reason why my proposal about changing '[' ']' to '[]' is wrong: operator signatures need to allow whitespace between '[' ']'. I missed that. Here is an adjusted version:
Wouldn't the principle of maximal munch i.e. always consuming the largest input possible, as required by the Dart grammar, ensure that the change listLiteral
: CONST? typeArguments? '[' elements? ']'
; to listLiteral
: CONST? typeArguments? ('[' elements? ']' | INDEX)
;
INDEX
: '[]'
; and operator signatures could be adjusted to accept both a '[' token followed by a ']' token or a single '[]' token. e.g. change operatorSignature
: type? OPERATOR operator formalParameterList
;
operator
: '~'
| binaryOperator
| '[' ']'
...
;
to operatorSignature
: type? OPERATOR (operator | '[' ']') formalParameterList
;
operator
: '~'
| binaryOperator
| INDEX
...
; Do you think that this could work or am I still missing something? |
We'd need to consider the lexical as well as the context free level. At the lexical level it is (I think) universally true that a lexer is expected to consume the longest possible prefix of the remaining input for each token that it produces. If we're using a parser generator then it's typically also a matter of ordering the lexical rules correctly: If we have a rule that I don't think we can make a similar statement at the context free level. For example: <Start> ::= <N> 'a';
<N> ::= 'a' <N> | 'a'; If we're parsing 'aaa' according to this grammar, we'll derive In other words, we certainly can't say that the derivations from any particular non-terminal during a parsing process will consume the longest possible part of the remaining input. If we're using a table driven LR parser (like the one generated by a very old tool like YACC), the information which is used to decide whether we'd consume ('shift') the next token on the input or not is in the table and in the state of the parser (which is computed from the current parser stack using another part of the table), and that table is computed based on global properties of the grammar. In other words, we need to consider every single rule of the entire grammar in order to decide whether we'll be greedy or not in any given concrete situation. I think we'll have to say that 'maximal munch' just doesn't make sense at the context free level.
A certain amount of adjustments like that might be possible, but I'm pretty sure the complexity will explode. For instance, if '>>>=' is unconditionally turned into a
I think it might work for I think it's much simpler and safer to use a grammar which will allow for the actual structure to be detected as soon as possible. E.g., if we are parsing the compound assignment operator '>>>=' as a context free construct containing the four tokens It is not difficult to enforce additional rules about whitespace after parsing, as a regular compile-time error. For instance, we could require that the compound assignment operator occurs as '>>>=' in the source code, with no spaces between the punctuation characters. Another example: I'm pretty sure we'd want to say that symbols like |
So that we are on the same page, can you elaborate why you are referring separately to a lexical and a context free level? MULTI_LINE_COMMENT
: '/*' (MULTI_LINE_COMMENT | .)*? '*/'
{ skip(); }
; Multi line comments nest, so the full lexical grammar can't be regular. To lex multi line comments (and strings with interpolations) we need a stack. Once we have a stack, we have a PDA and can therefore recognize context free languages. Therefore, any lexer for the full lexical grammar of Dart needs to be able to recognize context free languages? (This isn't entirely correct, I think a DPDA is enough, but that's probably not relevant for this comment) |
It is true that ANTLR allows for "lexical" rules to go beyond regular languages, but historically, and in many other parser generator tools, and, crucially, in the lexer that Dart tools use, lexer rules are regular. So we don't want to specify the language in a way that only works with ANTLR, and hence we should only use the regular part of the expressive power of ANTLR in Dart.g. Next, even ANTLR makes a distinction between the lexical level and the context free level: The former will recognize sequences of characters (nothing ignored in each token, but ignoring whitespace between tokens). The context free level never gets to see the whitespace, so we can't make context free rules depend on whitespace (unless we use special ANTLR specific tricks). So it does matter for the treatment of whitespace whether a given rule is lexical or context free.
Good catch! ;-) That's true, the rule about This is a particularly simple case, however: It is sufficient for a lexer to count the number of We have a similar issue at the next level, by the way: The rule about However, the fact that Dart lexical analysis isn't quite covered by regular languages doesn't change the fact that we need to consider |
Have you considered e.g.: const a = """ /* /* ${ /* } */ 0 } */ """; If we just count the number of /* and */ then we will not be able to correctly recognize that there's just one comment. So we need to recognize strings to recognize comments. So I think the following statement:
can't be true because of:
But that's a different issue and I agree that it's not related to this issue. (I have a custom lexer for Dart that seems to be correct. I handle nested comments by pushing an extra state on the lexers stack that I use to switch to a different DFA for parsing multi line comment content.) I agree with most of what you said. But there are some parts that I think I disagree with. However, I'm not absolutely sure that I'm right without code to support my opinion. I'm in the process of building an incremental autogenerated parser for Dart and I have some ideas for how to gracefully handle these issues. I will have to see how that works out. I'm glad that this issue uncovered something. It would be fine with me If we closed it in its current state. |
The counting of
Aha, it sounds like that could be equivalent to the recognizer that I mentioned.
It certainly revealed that the treatment of whitespace in the tools isn't identical to the treatment in the specification documents and in Dart.g. It's very useful to discover that sort of thing, so thanks for the detailed and useful input! I'll close the issue, and then we'll just create new issues if and when something else comes up. ;-) |
This change introduces actions that, until implemented, serve as documentation for all the locations where whitespace is considered significant. See dart-lang#50776 for context around significant operator whitespace. See https://github.com/dart-lang/language/blob/main/accepted/3.0/records/feature-specification.md#ambiguity-with-metadata-annotations for context around significant metadata whitespace. It seems to me that there are two ambiguities related to metadata and records. One can be observed when whitespace is present in front of the arguments and a record follows that metadatum, and one when there's no whitespace. We need to disambiguate both. We disambiguate the case where whitespace is present by requiring the arguments to be preceded by no whitespace. we disambiguate the case where whitespace is not present by requiring the other metadata productions to have trailing whitespace.
I think these are some minor bugs in the Antlr grammar:
sdk/tools/spec_parser/Dart.g
Lines 788 to 789 in 17ccbec
'>' '>' '>' '='
should be a single literal i.e.'>>>='
e.g. becauseIs not valid Dart code.
'>' '>' '='
should be a single literal i.e.'>>='
e.g. becauseis not valid Dart code.
sdk/tools/spec_parser/Dart.g
Lines 864 to 865 in 17ccbec
'>' '>' '>'
should be a single literal i.e.'>>>'
e.g. becauseis not valid Dart code.
'>' '>'
should be a single literal i.e.'>>'
e.g. becauseis not valid Dart code.
sdk/tools/spec_parser/Dart.g
Line 830 in 17ccbec
'>' '='
should be a single literal i.e.'>='
e.g. becauseis not valid Dart code.
sdk/tools/spec_parser/Dart.g
Lines 441 to 442 in 17ccbec
'[' ']'
should be a single literal i.e.'[]'
e.g. becauseis not valid Dart code.
'[' ']' '='
should be a single literal i.e.'[]='
e.g. becauseis not valid Dart code.
sdk/tools/spec_parser/Dart.g
Line 1354 in 17ccbec
The topLevelDefinition in the partDeclaration should allow leading metadata i.e.
e.g. because
Is valid Dart code.
The text was updated successfully, but these errors were encountered: