New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix grammar bug with XML namespaces that are also valid HTML tags and tag names that start with a valid HTML tag before a hyphen #150

Merged
merged 6 commits into from Sep 17, 2017

Conversation

Projects
None yet
2 participants
@MoritzKn
Contributor

MoritzKn commented Jan 10, 2017

Description of the Change

Using XML namespaces in HTML is generally tolerated by the grammar even though it's not valid HTML. For this reason the begin regex in line 328 contains a colon. However when using a namespace that is also a valid HTML tag, this rule won't apply and instead a rule for a HTML tag specific scope like meta.tag.inline.any.html will apply and only capture the namespace as tag name and capture the actual tag name as attribute. This PR adds a lookbehind after the regexes for specific HTML tags to make sure only
actual tag names are captured and no namespaces.
This PR also adds specs for this behavior.

Before this PR, this:

<div:foo>

would be falsely tokenized as:

  • <: text.html.basic, meta.tag.block.any.html, punctuation.definition.tag.begin.html
  • div: text.html.basic, meta.tag.block.any.html, entity.name.tag.block.any.html
  • :foo: text.html.basic, meta.tag.block.any.html, entity.other.attribute-name.html
  • >: text.html.basic, meta.tag.block.any.html, punctuation.definition.tag.end.html

Now it will be tokenized as just:

  • <: text.html.basic, meta.tag.block.any.html, punctuation.definition.tag.begin.html
  • div:foo: text.html.basic, meta.tag.other.html, entity.name.tag.other.html
  • >: text.html.basic, meta.tag.block.any.html, punctuation.definition.tag.end.html

Alternate Designs

  • None

Benefits

  • This fixes a problem with the JSP Grammar which is based on the HTML grammar. JSP allows custom tags which are written as <ns:tag> and if the namespace is for example a, this caused problems.

Possible Drawbacks

  • I am aware of none

Applicable Issues

  • I am aware of none
@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Jan 10, 2017

Contributor

I just realized this problem also applies for hyphens and added another commit to fix that too.
I also found a issue for this: #123

Contributor

MoritzKn commented Jan 10, 2017

I just realized this problem also applies for hyphens and added another commit to fix that too.
I also found a issue for this: #123

@MoritzKn MoritzKn changed the title from Fix grammar bug with XML namespaces that are also valid HTML tags to Fix grammar bug with XML namespaces that are also valid HTML tags and tag names that start with a valid HTML tag before a hyphen Jan 10, 2017

@50Wliu 50Wliu added the needs-review label Jan 10, 2017

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Jun 13, 2017

Member

If I am understanding the problem correctly, would the following change to the non-script/style rules produce the same result?

  • Replacing (?:[:-])\\b with (?=\\s|/?>)
Member

50Wliu commented Jun 13, 2017

If I am understanding the problem correctly, would the following change to the non-script/style rules produce the same result?

  • Replacing (?:[:-])\\b with (?=\\s|/?>)
@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Jun 14, 2017

Contributor

@50Wliu Yeah, should work too. Why would you prefer this?

Contributor

MoritzKn commented Jun 14, 2017

@50Wliu Yeah, should work too. Why would you prefer this?

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Jun 14, 2017

Member

We have slowly been transitioning away from using \\b to denote "end-of-word-match". It works well in text documents, but not as well in many programming languages. In addition, the positive lookahead is clearer in its purpose: only allow a space or a closing bracket. Contrast that with \\b (or the negative lookahead + \\b), which is harder to understand at first glance.

Member

50Wliu commented Jun 14, 2017

We have slowly been transitioning away from using \\b to denote "end-of-word-match". It works well in text documents, but not as well in many programming languages. In addition, the positive lookahead is clearer in its purpose: only allow a space or a closing bracket. Contrast that with \\b (or the negative lookahead + \\b), which is harder to understand at first glance.

@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Jun 15, 2017

Contributor

Makes sense. Should I change it?

Contributor

MoritzKn commented Jun 15, 2017

Makes sense. Should I change it?

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Jun 15, 2017

Member

Not quite yet - let me figure out something for the script/style tags so that you only need to push one commit.

Member

50Wliu commented Jun 15, 2017

Not quite yet - let me figure out something for the script/style tags so that you only need to push one commit.

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Jun 15, 2017

Member

Okay - I think it should be safe to remove the (?![^>]*/>) in those patterns. See if you can find a problem with the following regexes:

  • (?:^\\s+)?(<)((?i:style))(?=\\s|>)
  • (?:^\\s+)?(<)((?i:script))(?=\\s+[^>]*\\btype\\s*=\\s*[\'"]?text/(x-handlebars|(x-(handlebars-)?|ng-)?template|html)[\'"]?)
  • (?:^\\s+)?(<)((?i:script))(?=\\s+[^>]*\\btype\\s*=\\s*[\'"]?text/coffeescript[\'"]?)
  • (?:^\\s+)?(<)((?i:script))(?=\\s|>)
Member

50Wliu commented Jun 15, 2017

Okay - I think it should be safe to remove the (?![^>]*/>) in those patterns. See if you can find a problem with the following regexes:

  • (?:^\\s+)?(<)((?i:style))(?=\\s|>)
  • (?:^\\s+)?(<)((?i:script))(?=\\s+[^>]*\\btype\\s*=\\s*[\'"]?text/(x-handlebars|(x-(handlebars-)?|ng-)?template|html)[\'"]?)
  • (?:^\\s+)?(<)((?i:script))(?=\\s+[^>]*\\btype\\s*=\\s*[\'"]?text/coffeescript[\'"]?)
  • (?:^\\s+)?(<)((?i:script))(?=\\s|>)
@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Jun 21, 2017

Contributor

@50Wliu I had to alter two of your regexes to prevent the pattern from matching self closing tags. I also added some tests for that. I'm not entirely happy with my solution but it's the best i came up with.

What I changed:
Original: (?![:-])\\b(?![^>]*/>)
Your solution: (?=\\s|>)
My solution: (?:(?=\\s)(?![^>]*/>)|(?=>))

Contributor

MoritzKn commented Jun 21, 2017

@50Wliu I had to alter two of your regexes to prevent the pattern from matching self closing tags. I also added some tests for that. I'm not entirely happy with my solution but it's the best i came up with.

What I changed:
Original: (?![:-])\\b(?![^>]*/>)
Your solution: (?=\\s|>)
My solution: (?:(?=\\s)(?![^>]*/>)|(?=>))

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Jun 22, 2017

Member

I also thought about self-closing tags, but decided to use the simplified pattern because neither style nor script can be self-closing. What do you think?

Member

50Wliu commented Jun 22, 2017

I also thought about self-closing tags, but decided to use the simplified pattern because neither style nor script can be self-closing. What do you think?

@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Jun 22, 2017

Contributor

On the one hand the regex is pretty complex right now and a simpler one would be nice, on the other hand if someone still uses self-closing style or script tags, it will break everything without this check.
So because I have no clear opinion, how about just leaving it how it is right now?

Contributor

MoritzKn commented Jun 22, 2017

On the one hand the regex is pretty complex right now and a simpler one would be nice, on the other hand if someone still uses self-closing style or script tags, it will break everything without this check.
So because I have no clear opinion, how about just leaving it how it is right now?

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Sep 17, 2017

Member

@MoritzKn sorry for the delayed response. I think the number of people who are using self-closing style and script tags is so low that we won't have to worry about them (in addition, why would you ever have <style/> or <script/>? What does that even do?).

Member

50Wliu commented Sep 17, 2017

@MoritzKn sorry for the delayed response. I think the number of people who are using self-closing style and script tags is so low that we won't have to worry about them (in addition, why would you ever have <style/> or <script/>? What does that even do?).

MoritzKn added some commits Jan 10, 2017

Fix grammar bug with XML namespaces that are also valid HTML tags
Using XML namespaces in HTML is generally tolerated by the grammar
even though it's not valid HTML. For this reason the `begin` regex
in line 328 contains a colon. However when using a namespace that
is also a valid HTML tag, this rule won't apply and instead a
rule for a HTML tag specific scope like `meta.tag.inline.any.html`
will apply and only capture the namespace as tag name and capture
the actual tag name as attribute. This commit adds a lookbehind
after the regexes for specific HTML tags to make sure only
actual tag names are captured and no namespaces.

This commit also contains specs for this behavior.
Fix problem with hyphens in tagnames
This fixes a problem that occurs when the tag name contains a
hyphen and the part before the hyphen is a valid html tag.
@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Sep 17, 2017

Contributor

@50Wliu:

Sorry for the delayed response.

No problem. You're doing a great job managing all the community contributions. This sure is a whole lot of work; many thanks for that! 🥇

I think the number of people who are using self-closing style and script tags is so low that we won't have to worry about them

I updated the PR, but once I rebased I noticed that 6916c01 already fixed the issue for those tags, so I just left it like it was.

Contributor

MoritzKn commented Sep 17, 2017

@50Wliu:

Sorry for the delayed response.

No problem. You're doing a great job managing all the community contributions. This sure is a whole lot of work; many thanks for that! 🥇

I think the number of people who are using self-closing style and script tags is so low that we won't have to worry about them

I updated the PR, but once I rebased I noticed that 6916c01 already fixed the issue for those tags, so I just left it like it was.

@50Wliu 50Wliu added under-review and removed needs-review labels Sep 17, 2017

@MoritzKn

This comment has been minimized.

Show comment
Hide comment
@MoritzKn

MoritzKn Sep 17, 2017

Contributor

Good point. I changed the spec.

Contributor

MoritzKn commented Sep 17, 2017

Good point. I changed the spec.

@50Wliu

This comment has been minimized.

Show comment
Hide comment
@50Wliu

50Wliu Sep 17, 2017

Member

Thanks!

Member

50Wliu commented Sep 17, 2017

Thanks!

@50Wliu 50Wliu merged commit 3eff4ee into atom:master Sep 17, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment