New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed tagName tokenizer such that finding EOF within tagName will r… #46
Conversation
…eturn a text info. This will ensure that dangling unclosed tags will still show up in the DOM tree
@jmesserly @sigmundch could ya'll take a look? |
Hi, thanks for sending! this project attempts to be a spec-compliant HTML parser. HTML syntax is described here: https://html.spec.whatwg.org/multipage/syntax.html perhaps we could discuss the context where this came up and determine an alternate solution. |
@@ -570,7 +570,8 @@ class HtmlTokenizer implements Iterator<Token> { | |||
} else if (data == ">") { | |||
emitCurrentToken(); | |||
} else if (data == EOF) { | |||
_addToken(new ParseErrorToken("eof-in-tag-name")); | |||
_addToken(new CharactersToken("<" + currentTagToken.name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is not spec compliant. this should be an error, see: https://html.spec.whatwg.org/multipage/syntax.html#tag-name-state
Thank you for the response! The context of this issue is to find a way to still access some sort of DOM node in the case that there is an unclosed at the end of an HTML document. For example, if If that isn't possible by the html spec, we can see if there's a workaround for this. |
This gets into a pretty complex landscape of IDE editors & language parsers. There is often a tension between the language spec and IDE features. Normally you can require correct syntax before a compiler will proceed further, IDEs do not have this luxury as they're dealing with incorrect code most of the time (after all, it's constantly being edited). A typical good solution is to issue an error, but have the parser attempt to recover. Unfortunately the HTML spec is particularly problematic in this respect, because it defines precise error recovery steps (that are used by browsers, and all other HTML compliant tools). That doesn't give us a lot of leeway. Also I believe you will hit this problem in many more cases. Almost anything in the tokenizer&parser that issues a parse error will presumably be a similar problem for your IDE. At some point there was consideration of using an Angular-specific HTML parser (@matanlurey -- were we chatting about this?), that would be a good place to put recovery logic. If that doesn't work we can certainly consider adding modes/options to control package:html's parsing and have its error recovery/DOM nodes work in a different way. I know, for example, there was desire for a round-trip whitespace preserving serializer that is different from the HTML spec serializer. So, I'm really super happy to have this package include more useful HTML tools. Just don't want to change the default parse behavior in a way that makes us spec incompatible. If that makes sense. Cheers :) - Jenny |
Thank you Jenny for a detailed explanation! I agree that the HTML standard compliant spec should not be broken for the sake of a language/IDE. I believe I can implement a temporary workaround solution by accessing the parser errors and looking for 'eof-in-tag-name' specific error. I'll check up on the Angular-specific HTML parser with @matanlurey and later incorporate it into our analyzer. I'll close this PR then since it's not a viable solution. |
ohh, yes if that error token helps that sounds great! feel free to add additional info to it as well if that helps. |
Will do, I'll have to do some analysis on it and re-open this ticket if necessary. |
Hey Jenny, Is the compat: quirks option also part of the spec, or is that made to do more or less what we're doing? Where errors are recovered from more gracefully? I believe we are using that, if that helps. I think Matan's parser is still a ways out, and unfortunately this blocks autocompletion in what might be the most common use case. We don't want to break the spec but if there's any precedent for going around it in the parser so far we would totally use it. |
quirks mode is specified: https://dom.spec.whatwg.org/#concept-document-quirks I would not recommend that in general, it's old IE parser recovery tricks. |
if you want to add a new parsing mode that's fine, as long as it's not the default, and someone can test/maintain/specify (at least at a high level) what that code path is supposed to do. |
Requested change to add Text node on unclosed tag names found at the very end of the document.
This is needed for https://github.com/dart-lang/angular_analyzer_plugin for autocompletion of HTML tags with user-defined selectors and exportAs.