Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed tagName tokenizer such that finding EOF within tagName will r… #46

Closed
wants to merge 1 commit into from

Conversation

mk13
Copy link

@mk13 mk13 commented Dec 31, 2016

Requested change to add Text node on unclosed tag names found at the very end of the document.
This is needed for https://github.com/dart-lang/angular_analyzer_plugin for autocompletion of HTML tags with user-defined selectors and exportAs.

…eturn a text info. This will ensure that dangling unclosed tags will still show up in the DOM tree
@kevmoo
Copy link
Member

kevmoo commented Dec 31, 2016

@jmesserly @sigmundch could ya'll take a look?

@jmesserly
Copy link
Contributor

Hi, thanks for sending!

this project attempts to be a spec-compliant HTML parser.

HTML syntax is described here: https://html.spec.whatwg.org/multipage/syntax.html
I will look up this step for tokenization, but my guess is this change is not spec-compliant.

perhaps we could discuss the context where this came up and determine an alternate solution.

@@ -570,7 +570,8 @@ class HtmlTokenizer implements Iterator<Token> {
} else if (data == ">") {
emitCurrentToken();
} else if (data == EOF) {
_addToken(new ParseErrorToken("eof-in-tag-name"));
_addToken(new CharactersToken("<" + currentTagToken.name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is not spec compliant. this should be an error, see: https://html.spec.whatwg.org/multipage/syntax.html#tag-name-state

@mk13
Copy link
Author

mk13 commented Jan 3, 2017

Thank you for the response! The context of this issue is to find a way to still access some sort of DOM node in the case that there is an unclosed at the end of an HTML document.

For example, if <di is at the end of the html document before completion of its full form: <div>. The <di gets dropped entirely by the tokenizer and is never added into the complete DOM tree.

If that isn't possible by the html spec, we can see if there's a workaround for this.

@jmesserly
Copy link
Contributor

This gets into a pretty complex landscape of IDE editors & language parsers.

There is often a tension between the language spec and IDE features. Normally you can require correct syntax before a compiler will proceed further, IDEs do not have this luxury as they're dealing with incorrect code most of the time (after all, it's constantly being edited).

A typical good solution is to issue an error, but have the parser attempt to recover.

Unfortunately the HTML spec is particularly problematic in this respect, because it defines precise error recovery steps (that are used by browsers, and all other HTML compliant tools). That doesn't give us a lot of leeway.

Also I believe you will hit this problem in many more cases. Almost anything in the tokenizer&parser that issues a parse error will presumably be a similar problem for your IDE.

At some point there was consideration of using an Angular-specific HTML parser (@matanlurey -- were we chatting about this?), that would be a good place to put recovery logic.

If that doesn't work we can certainly consider adding modes/options to control package:html's parsing and have its error recovery/DOM nodes work in a different way. I know, for example, there was desire for a round-trip whitespace preserving serializer that is different from the HTML spec serializer.

So, I'm really super happy to have this package include more useful HTML tools. Just don't want to change the default parse behavior in a way that makes us spec incompatible. If that makes sense.

Cheers :) - Jenny

@mk13
Copy link
Author

mk13 commented Jan 3, 2017

Thank you Jenny for a detailed explanation! I agree that the HTML standard compliant spec should not be broken for the sake of a language/IDE. I believe I can implement a temporary workaround solution by accessing the parser errors and looking for 'eof-in-tag-name' specific error.

I'll check up on the Angular-specific HTML parser with @matanlurey and later incorporate it into our analyzer. I'll close this PR then since it's not a viable solution.

@mk13 mk13 closed this Jan 3, 2017
@jmesserly
Copy link
Contributor

ohh, yes if that error token helps that sounds great! feel free to add additional info to it as well if that helps.

@mk13
Copy link
Author

mk13 commented Jan 3, 2017

Will do, I'll have to do some analysis on it and re-open this ticket if necessary.

@MichaelRFairhurst
Copy link

Hey Jenny,

Is the compat: quirks option also part of the spec, or is that made to do more or less what we're doing? Where errors are recovered from more gracefully? I believe we are using that, if that helps.

I think Matan's parser is still a ways out, and unfortunately this blocks autocompletion in what might be the most common use case. We don't want to break the spec but if there's any precedent for going around it in the parser so far we would totally use it.

@jmesserly
Copy link
Contributor

quirks mode is specified: https://dom.spec.whatwg.org/#concept-document-quirks

I would not recommend that in general, it's old IE parser recovery tricks.

@jmesserly
Copy link
Contributor

if you want to add a new parsing mode that's fine, as long as it's not the default, and someone can test/maintain/specify (at least at a high level) what that code path is supposed to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Google CLA signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants