Changed tagName tokenizer such that finding EOF within tagName will r… #46

mk13 · 2016-12-31T08:03:14Z

Requested change to add Text node on unclosed tag names found at the very end of the document.
This is needed for https://github.com/dart-lang/angular_analyzer_plugin for autocompletion of HTML tags with user-defined selectors and exportAs.

…eturn a text info. This will ensure that dangling unclosed tags will still show up in the DOM tree

kevmoo · 2016-12-31T22:11:30Z

@jmesserly @sigmundch could ya'll take a look?

jmesserly · 2017-01-03T20:35:17Z

Hi, thanks for sending!

this project attempts to be a spec-compliant HTML parser.

HTML syntax is described here: https://html.spec.whatwg.org/multipage/syntax.html
I will look up this step for tokenization, but my guess is this change is not spec-compliant.

perhaps we could discuss the context where this came up and determine an alternate solution.

jmesserly · 2017-01-03T20:36:56Z

lib/src/tokenizer.dart

@@ -570,7 +570,8 @@ class HtmlTokenizer implements Iterator<Token> {
    } else if (data == ">") {
      emitCurrentToken();
    } else if (data == EOF) {
-      _addToken(new ParseErrorToken("eof-in-tag-name"));
+      _addToken(new CharactersToken("<" + currentTagToken.name));


this change is not spec compliant. this should be an error, see: https://html.spec.whatwg.org/multipage/syntax.html#tag-name-state

mk13 · 2017-01-03T21:53:16Z

Thank you for the response! The context of this issue is to find a way to still access some sort of DOM node in the case that there is an unclosed at the end of an HTML document.

For example, if <di is at the end of the html document before completion of its full form: <div>. The <di gets dropped entirely by the tokenizer and is never added into the complete DOM tree.

If that isn't possible by the html spec, we can see if there's a workaround for this.

jmesserly · 2017-01-03T22:23:30Z

This gets into a pretty complex landscape of IDE editors & language parsers.

There is often a tension between the language spec and IDE features. Normally you can require correct syntax before a compiler will proceed further, IDEs do not have this luxury as they're dealing with incorrect code most of the time (after all, it's constantly being edited).

A typical good solution is to issue an error, but have the parser attempt to recover.

Unfortunately the HTML spec is particularly problematic in this respect, because it defines precise error recovery steps (that are used by browsers, and all other HTML compliant tools). That doesn't give us a lot of leeway.

Also I believe you will hit this problem in many more cases. Almost anything in the tokenizer&parser that issues a parse error will presumably be a similar problem for your IDE.

At some point there was consideration of using an Angular-specific HTML parser (@matanlurey -- were we chatting about this?), that would be a good place to put recovery logic.

If that doesn't work we can certainly consider adding modes/options to control package:html's parsing and have its error recovery/DOM nodes work in a different way. I know, for example, there was desire for a round-trip whitespace preserving serializer that is different from the HTML spec serializer.

So, I'm really super happy to have this package include more useful HTML tools. Just don't want to change the default parse behavior in a way that makes us spec incompatible. If that makes sense.

Cheers :) - Jenny

mk13 · 2017-01-03T22:44:07Z

Thank you Jenny for a detailed explanation! I agree that the HTML standard compliant spec should not be broken for the sake of a language/IDE. I believe I can implement a temporary workaround solution by accessing the parser errors and looking for 'eof-in-tag-name' specific error.

I'll check up on the Angular-specific HTML parser with @matanlurey and later incorporate it into our analyzer. I'll close this PR then since it's not a viable solution.

jmesserly · 2017-01-03T22:49:50Z

ohh, yes if that error token helps that sounds great! feel free to add additional info to it as well if that helps.

mk13 · 2017-01-03T23:01:53Z

Will do, I'll have to do some analysis on it and re-open this ticket if necessary.

MichaelRFairhurst · 2017-01-04T18:30:12Z

Hey Jenny,

Is the compat: quirks option also part of the spec, or is that made to do more or less what we're doing? Where errors are recovered from more gracefully? I believe we are using that, if that helps.

I think Matan's parser is still a ways out, and unfortunately this blocks autocompletion in what might be the most common use case. We don't want to break the spec but if there's any precedent for going around it in the parser so far we would totally use it.

jmesserly · 2017-01-05T01:00:50Z

quirks mode is specified: https://dom.spec.whatwg.org/#concept-document-quirks

I would not recommend that in general, it's old IE parser recovery tricks.

jmesserly · 2017-01-05T01:02:09Z

if you want to add a new parsing mode that's fine, as long as it's not the default, and someone can test/maintain/specify (at least at a high level) what that code path is supposed to do.

Changed tagName tokenizer such that finding EOF within tagName will r…

b88f856

…eturn a text info. This will ensure that dangling unclosed tags will still show up in the DOM tree

googlebot added the cla: yes Google CLA signed label Dec 31, 2016

mk13 mentioned this pull request Dec 31, 2016

Issue140 autocomplete html tags refs #140 dart-archive/angular_analyzer_plugin#202

Merged

kevmoo requested review from jmesserly and sigmundch December 31, 2016 22:11

jmesserly suggested changes Jan 3, 2017

View reviewed changes

mk13 closed this Jan 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed tagName tokenizer such that finding EOF within tagName will r… #46

Changed tagName tokenizer such that finding EOF within tagName will r… #46

mk13 commented Dec 31, 2016

kevmoo commented Dec 31, 2016

jmesserly commented Jan 3, 2017

jmesserly Jan 3, 2017

mk13 commented Jan 3, 2017 •

edited

jmesserly commented Jan 3, 2017

mk13 commented Jan 3, 2017

jmesserly commented Jan 3, 2017

mk13 commented Jan 3, 2017

MichaelRFairhurst commented Jan 4, 2017

jmesserly commented Jan 5, 2017

jmesserly commented Jan 5, 2017

Changed tagName tokenizer such that finding EOF within tagName will r… #46

Changed tagName tokenizer such that finding EOF within tagName will r… #46

Conversation

mk13 commented Dec 31, 2016

kevmoo commented Dec 31, 2016

jmesserly commented Jan 3, 2017

jmesserly Jan 3, 2017

Choose a reason for hiding this comment

mk13 commented Jan 3, 2017 • edited

jmesserly commented Jan 3, 2017

mk13 commented Jan 3, 2017

jmesserly commented Jan 3, 2017

mk13 commented Jan 3, 2017

MichaelRFairhurst commented Jan 4, 2017

jmesserly commented Jan 5, 2017

jmesserly commented Jan 5, 2017

mk13 commented Jan 3, 2017 •

edited