Add HTML5 tags support to Javadoc antlr4 grammar and related classes #3332

Closed
baratali opened this Issue Jun 30, 2016 · 25 comments

Comments

@baratali
Contributor

baratali commented Jun 30, 2016

ATTENTION: final fix changed token type values, so all plugins/extensions who use JavadocTokenTypes must recompile. Sorry.

We should extend void(singleton) elements list and list of elements with optional end tag.

List of void elements:
basefont, frame, isindex - are not supported in HTML5; but they have to stay in Javadoc grammar.

HTML5 HTML4
area area
base base
br br
col col
hr hr
img img
input input
link link
meta meta
param param
embed
keygen
source
track
wbr
basefont
frame
isindex

List of elements with optional end tag:

HTML5 HTML4
html html
head head
body body
li li
dt dt
dd dd
p p
option option
colgroup colgroup
thead thead
tbody tbody
tfoot tfoot
tr tr
td td
th th
optgroup
rb
rt
rtc
rp

@blerner

This comment has been minimized.

Show comment
Hide comment
@blerner

blerner May 18, 2017

Out of curiosity: why hard-code the list of tags at all? Per the javadoc spec, https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html#CHDBGGIH,

Write Comments in HTML

The text must be written in HTML with HTML entities and HTML tags. You can use whichever version of HTML your browser supports.

And per the HTML5 spec, https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody, any element with an unknown name is handled uniformly by simply creating an HTMLUnknownElement (https://www.w3.org/TR/html5/syntax.html#create-an-element-for-the-token). The only restriction on such tags is that they can't be self-closing. So, rather than having hundreds of nearly-identical lines of code in JavadocParser.g4 for each of the various node cases, why not just hard-code the list of self-closing tags, and then just require that the remaining tags match up?

(The one major complication I can see in parsing Javadoc with HTML tags in it is if the Javadoc contains Foo<T> bits of Java code that weren't wrapped in {@code ...}, which makes it look like a <T> tag has been opened. This could be tricky to handle, so maybe that's why the list of tags is hardcoded?)

blerner commented May 18, 2017

Out of curiosity: why hard-code the list of tags at all? Per the javadoc spec, https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html#CHDBGGIH,

Write Comments in HTML

The text must be written in HTML with HTML entities and HTML tags. You can use whichever version of HTML your browser supports.

And per the HTML5 spec, https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody, any element with an unknown name is handled uniformly by simply creating an HTMLUnknownElement (https://www.w3.org/TR/html5/syntax.html#create-an-element-for-the-token). The only restriction on such tags is that they can't be self-closing. So, rather than having hundreds of nearly-identical lines of code in JavadocParser.g4 for each of the various node cases, why not just hard-code the list of self-closing tags, and then just require that the remaining tags match up?

(The one major complication I can see in parsing Javadoc with HTML tags in it is if the Javadoc contains Foo<T> bits of Java code that weren't wrapped in {@code ...}, which makes it look like a <T> tag has been opened. This could be tricky to handle, so maybe that's why the list of tags is hardcoded?)

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani May 18, 2017

Member

There are bunch of nuances with parsing , but our goal is not to parse but make AST that is convenient for Check creation and allow user with less code create new Check. We keep certain tokens hardcoded to create special AST token, all unknown tags will be in AST to but as general tags.

http://checkstyle.sourceforge.net/writingjavadocchecks.html#Javadoc_parser_behavior_for_current_HTML_version_and_new_HTML_version

If you are interested, read whole page, it describes some points.

Member

romani commented May 18, 2017

There are bunch of nuances with parsing , but our goal is not to parse but make AST that is convenient for Check creation and allow user with less code create new Check. We keep certain tokens hardcoded to create special AST token, all unknown tags will be in AST to but as general tags.

http://checkstyle.sourceforge.net/writingjavadocchecks.html#Javadoc_parser_behavior_for_current_HTML_version_and_new_HTML_version

If you are interested, read whole page, it describes some points.

@blerner

This comment has been minimized.

Show comment
Hide comment
@blerner

blerner May 19, 2017

Thanks for the link. Yes, parsing is always surprisingly subtle. The main reason I thought the repetitive grammar rules might be a problem was because of the example I mentioned above. A student of mine commented out a huge chunk of code containing Foo<T> text...but instead of using /* ... */, used /** ... **/, which marked the code as a Javadoc block But, parsing this comment (as part of other checkstyle rules in an auto-grader for the relevant homework assignment) took unbelievably long, and I had to force-quit the autograder. I truncated the comment to 20 lines, and parsing sped up. Some searching led me to https://groups.google.com/forum/#!topic/antlr-discussion/8uclWFreMSs and #1064. I'm using v7.7 of checkstyle, so I know I have that fix as part of the library...but I'm still getting really bad performance. But as part of reading that group post and issue report, I tried to debug what might be wrong with the grammar, leading me to this suggestion.

I can provide a simplified repro of the problem in a separate issue, if that's more useful...?

blerner commented May 19, 2017

Thanks for the link. Yes, parsing is always surprisingly subtle. The main reason I thought the repetitive grammar rules might be a problem was because of the example I mentioned above. A student of mine commented out a huge chunk of code containing Foo<T> text...but instead of using /* ... */, used /** ... **/, which marked the code as a Javadoc block But, parsing this comment (as part of other checkstyle rules in an auto-grader for the relevant homework assignment) took unbelievably long, and I had to force-quit the autograder. I truncated the comment to 20 lines, and parsing sped up. Some searching led me to https://groups.google.com/forum/#!topic/antlr-discussion/8uclWFreMSs and #1064. I'm using v7.7 of checkstyle, so I know I have that fix as part of the library...but I'm still getting really bad performance. But as part of reading that group post and issue report, I tried to debug what might be wrong with the grammar, leading me to this suggestion.

I can provide a simplified repro of the problem in a separate issue, if that's more useful...?

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach May 19, 2017

Member

commented out a huge chunk of code containing Foo text...but ... used /** ... **/, which marked the code as a Javadoc block

@blerner Your issue sounds very similar to #3727 (comment) which hung in a similar area and failed.
I never looked into where it was hanging, but is most likely a different area than the original text grammar issue.

Member

rnveach commented May 19, 2017

commented out a huge chunk of code containing Foo text...but ... used /** ... **/, which marked the code as a Javadoc block

@blerner Your issue sounds very similar to #3727 (comment) which hung in a similar area and failed.
I never looked into where it was hanging, but is most likely a different area than the original text grammar issue.

@blerner

This comment has been minimized.

Show comment
Hide comment
@blerner

blerner May 19, 2017

Yes...that does sound similar. The reduced test case I was using was basically

class DontCare {
/**
  @Test
  public void someTestMethod() {
    List<Foo> x;
    List<Foo> x;
    ...add more copies of this line to make the runtime worse...
    feefiefoefum
    feefiefoefum
    ...and add more copies of this line, too.
  }
**/
}

As far as I can tell:

  • You need at least two List<Foo> lines before the parser chokes
  • With ~20 lines of "feefiefoefum", each additional List<Foo> line adds ~2 seconds to the parse time
  • With more lines of "feefiefoefum", the runtime grows accordingly
  • If I replace every List<Foo> with List?Foo?, so that the text is identical in length, the runtime is ~ 1sec (including the startup time of my surrounding application, so, basically instantly).

blerner commented May 19, 2017

Yes...that does sound similar. The reduced test case I was using was basically

class DontCare {
/**
  @Test
  public void someTestMethod() {
    List<Foo> x;
    List<Foo> x;
    ...add more copies of this line to make the runtime worse...
    feefiefoefum
    feefiefoefum
    ...and add more copies of this line, too.
  }
**/
}

As far as I can tell:

  • You need at least two List<Foo> lines before the parser chokes
  • With ~20 lines of "feefiefoefum", each additional List<Foo> line adds ~2 seconds to the parse time
  • With more lines of "feefiefoefum", the runtime grows accordingly
  • If I replace every List<Foo> with List?Foo?, so that the text is identical in length, the runtime is ~ 1sec (including the startup time of my surrounding application, so, basically instantly).
@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani May 19, 2017

Member

Let's move this case in separate issue and do discussion in separate thread/issue

Member

romani commented May 19, 2017

Let's move this case in separate issue and do discussion in separate thread/issue

@Vladlis

This comment has been minimized.

Show comment
Hide comment
Member

Vladlis commented Jul 24, 2017

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jul 25, 2017

Member

In html5 void tags are allowed to be self-closing, so the following rule has to be removed

We still need it to support older HTML versions as grammar will not enforce a specific HTML version, right? If so, we can't break compatibility with older versions.

Member

rnveach commented Jul 25, 2017

In html5 void tags are allowed to be self-closing, so the following rule has to be removed

We still need it to support older HTML versions as grammar will not enforce a specific HTML version, right? If so, we can't break compatibility with older versions.

@Vladlis

This comment has been minimized.

Show comment
Hide comment
@Vladlis

Vladlis Jul 25, 2017

Member

We'll only break compatibility by allowing void self-closing tags like <br/> which is a part of update to html5, as well as we are going to break it by adding new tags.

Member

Vladlis commented Jul 25, 2017

We'll only break compatibility by allowing void self-closing tags like <br/> which is a part of update to html5, as well as we are going to break it by adding new tags.

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Aug 7, 2017

Member

@rnveach , @Vladlis

We still need it to support older HTML versions as grammar will not enforce a specific HTML version, right?

we have to support HTML4, it will be in javadoc for long time, most of java engineers do not know all nuances of html5 features. So breaking compatibility is not expected for now. If it is not avoidable - need to be explained to users somewhere in xdoc and too all maintainers to confirm.

Member

romani commented Aug 7, 2017

@rnveach , @Vladlis

We still need it to support older HTML versions as grammar will not enforce a specific HTML version, right?

we have to support HTML4, it will be in javadoc for long time, most of java engineers do not know all nuances of html5 features. So breaking compatibility is not expected for now. If it is not avoidable - need to be explained to users somewhere in xdoc and too all maintainers to confirm.

@ps-sp

This comment has been minimized.

Show comment
Hide comment
@ps-sp

ps-sp Aug 7, 2017

Collaborator

@Vladlis

so the following rule has to be removed:

Please correct me if wrong, but that rule is about void elements not having an end tag, and in html 5 too void tags must not have an end tag. So it shouldn't be removed.

As this page says:

Void elements only have a start tag; end tags must not be specified for void elements.

Collaborator

ps-sp commented Aug 7, 2017

@Vladlis

so the following rule has to be removed:

Please correct me if wrong, but that rule is about void elements not having an end tag, and in html 5 too void tags must not have an end tag. So it shouldn't be removed.

As this page says:

Void elements only have a start tag; end tags must not be specified for void elements.

@Vladlis Vladlis closed this Aug 7, 2017

@Vladlis Vladlis reopened this Aug 7, 2017

@Vladlis

This comment has been minimized.

Show comment
Hide comment
@Vladlis

Vladlis Aug 7, 2017

Member

@ps-sp yes, sorry, I've misunderstood wrongSinletonTag rule

Member

Vladlis commented Aug 7, 2017

@ps-sp yes, sorry, I've misunderstood wrongSinletonTag rule

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 12, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 12, 2017

@ps-sp

This comment has been minimized.

Show comment
Hide comment
@ps-sp

ps-sp Aug 12, 2017

Collaborator

As has been discussed with @Vladlis PR shall be created for each new HTML tag to be added for better PR management.

Collaborator

ps-sp commented Aug 12, 2017

As has been discussed with @Vladlis PR shall be created for each new HTML tag to be added for better PR management.

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 13, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 13, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 13, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 13, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 14, 2017

@Vladlis Vladlis moved this from To Do to In Progress in Javadoc style coverage and parser optimization Aug 16, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 17, 2017

@romani romani added the new feature label Aug 21, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 21, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 21, 2017

Vladlis added a commit that referenced this issue Aug 21, 2017

ArneLimburg pushed a commit to ArneLimburg/checkstyle that referenced this issue Aug 23, 2017

ArneLimburg pushed a commit to ArneLimburg/checkstyle that referenced this issue Aug 23, 2017

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 2, 2018

Member

@ps-sp , thanks a lot for you help.

Member

romani commented Jun 2, 2018

@ps-sp , thanks a lot for you help.

@romani romani removed the GSoC2017 label Jun 3, 2018

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 5, 2018

Member

@ps-sp , please do us big favor make all this changes during this month to make only one release be with damaging for compatibility.

We appreciate your activity in PRs to finish updates, but PRs could wait a bit more, but this Issue is better to do in one release.
Thanks in advance.

Member

romani commented Jun 5, 2018

@ps-sp , please do us big favor make all this changes during this month to make only one release be with damaging for compatibility.

We appreciate your activity in PRs to finish updates, but PRs could wait a bit more, but this Issue is better to do in one release.
Thanks in advance.

@ps-sp

This comment has been minimized.

Show comment
Hide comment
@ps-sp

ps-sp Jun 5, 2018

Collaborator

@romani Will make a PR ASAP.

Collaborator

ps-sp commented Jun 5, 2018

@romani Will make a PR ASAP.

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Jun 9, 2018

Member

@ps-sp , ping.

Member

romani commented Jun 9, 2018

@ps-sp , ping.

@ps-sp

This comment has been minimized.

Show comment
Hide comment
@ps-sp

ps-sp Jun 9, 2018

Collaborator

@romani
Yeah sorry, couldn't work. Will hopefully have a PR by tomorrow.

Collaborator

ps-sp commented Jun 9, 2018

@romani
Yeah sorry, couldn't work. Will hopefully have a PR by tomorrow.

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 11, 2018

@ps-sp ps-sp referenced this issue Jun 11, 2018

Merged

Issue 3332 final #5908

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 13, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 13, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 13, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 13, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 13, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jun 14, 2018

rnveach added a commit that referenced this issue Jun 14, 2018

rnveach added a commit that referenced this issue Jun 14, 2018

rnveach added a commit that referenced this issue Jun 14, 2018

rnveach added a commit that referenced this issue Jun 14, 2018

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jun 14, 2018

Member

Final fix was merged

Member

rnveach commented Jun 14, 2018

Final fix was merged

@rnveach rnveach closed this Jun 14, 2018

@rnveach rnveach added this to the 8.11 milestone Jun 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment