New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javadoc grammar: Delete child nodes in Javadoc TEXT node #3170

Closed
baratali opened this Issue May 12, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@baratali
Contributor

baratali commented May 12, 2016

TEXT node has children of CHAR and WS types. Those children are redundant and make AST more complicated. I don't see any cases when such nodes can be useful.

There are two ways to solve this:

  • edit ANTLR grammar for Javadoc comments
  • OR do post-processing of the Javadoc parse tree and clear all TEXT nodes

Example
Javadoc comment:

Represents a Javadoc tag. Provides methods to query what type of tag it is.
@author Oliver Burn

AST:

JAVADOC -> Represents a Javadoc tag. Provides methods to query what type of tag it is.\r\n@author Oliver Burn<EOF> [0:0]
|--TEXT -> Represents a Javadoc tag. Provides methods to query what type of tag it is. [0:0]
|   |--CHAR -> R [0:0]
|   |--CHAR -> e [0:1]
|   |--CHAR -> p [0:2]
|   |--CHAR -> r [0:3]
|   |--CHAR -> e [0:4]
|   |--CHAR -> s [0:5]
|   |--CHAR -> e [0:6]
|   |--CHAR -> n [0:7]
|   |--CHAR -> t [0:8]
|   |--CHAR -> s [0:9]
|   |--WS ->   [0:10]
|   |--CHAR -> a [0:11]
|   |--WS ->   [0:12]
|   |--CHAR -> J [0:13]
|   |--CHAR -> a [0:14]
|   |--CHAR -> v [0:15]
|   |--CHAR -> a [0:16]
|   |--CHAR -> d [0:17]
|   |--CHAR -> o [0:18]
|   |--CHAR -> c [0:19]
|   |--WS ->   [0:20]
|   |--CHAR -> t [0:21]
|   |--CHAR -> a [0:22]
|   |--CHAR -> g [0:23]
|   |--CHAR -> . [0:24]
|   |--WS ->   [0:25]
|   |--CHAR -> P [0:26]
|   |--CHAR -> r [0:27]
|   |--CHAR -> o [0:28]
|   |--CHAR -> v [0:29]
|   |--CHAR -> i [0:30]
|   |--CHAR -> d [0:31]
|   |--CHAR -> e [0:32]
|   |--CHAR -> s [0:33]
|   |--WS ->   [0:34]
|   |--CHAR -> m [0:35]
|   |--CHAR -> e [0:36]
|   |--CHAR -> t [0:37]
|   |--CHAR -> h [0:38]
|   |--CHAR -> o [0:39]
|   |--CHAR -> d [0:40]
|   |--CHAR -> s [0:41]
|   |--WS ->   [0:42]
|   |--CHAR -> t [0:43]
|   |--CHAR -> o [0:44]
|   |--WS ->   [0:45]
|   |--CHAR -> q [0:46]
|   |--CHAR -> u [0:47]
|   |--CHAR -> e [0:48]
|   |--CHAR -> r [0:49]
|   |--CHAR -> y [0:50]
|   |--WS ->   [0:51]
|   |--CHAR -> w [0:52]
|   |--CHAR -> h [0:53]
|   |--CHAR -> a [0:54]
|   |--CHAR -> t [0:55]
|   |--WS ->   [0:56]
|   |--CHAR -> t [0:57]
|   |--CHAR -> y [0:58]
|   |--CHAR -> p [0:59]
|   |--CHAR -> e [0:60]
|   |--WS ->   [0:61]
|   |--CHAR -> o [0:62]
|   |--CHAR -> f [0:63]
|   |--WS ->   [0:64]
|   |--CHAR -> t [0:65]
|   |--CHAR -> a [0:66]
|   |--CHAR -> g [0:67]
|   |--WS ->   [0:68]
|   |--CHAR -> i [0:69]
|   |--CHAR -> t [0:70]
|   |--WS ->   [0:71]
|   |--CHAR -> i [0:72]
|   |--CHAR -> s [0:73]
|   `--CHAR -> . [0:74]
|--NEWLINE -> \r\n [0:75]
|--JAVADOC_TAG -> @author Oliver Burn [1:0]
|   |--AUTHOR_LITERAL -> @author [1:0]
|   |--WS ->   [1:7]
|   `--DESCRIPTION -> Oliver Burn [1:8]
|       `--TEXT -> Oliver Burn [1:8]
|           |--CHAR -> O [1:8]
|           |--CHAR -> l [1:9]
|           |--CHAR -> i [1:10]
|           |--CHAR -> v [1:11]
|           |--CHAR -> e [1:12]
|           |--CHAR -> r [1:13]
|           |--WS ->   [1:14]
|           |--CHAR -> B [1:15]
|           |--CHAR -> u [1:16]
|           |--CHAR -> r [1:17]
|           `--CHAR -> n [1:18]
`--EOF -> <EOF> [1:19]

@romani romani changed the title from Delete child nodes in Javadoc TEXT node to Javadoc grammar: Delete child nodes in Javadoc TEXT node May 14, 2016

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Nov 12, 2016

Member

I don't think javadoc antlr can be changed easily. Using skip or the hidden channel causes TEXT to lose that text, the combinations of CHAR. We use CHAR in alot of places in the grammar to just remove the token and embed it in TEXT.

Clearing the children of TEXT during the token conversion seems the easiest, but this will probably make CHAR an unreachable and non-existant token.

The following checks fail with the proposed change: JavadocParagraphCheck, JavadocTagContinuationIndentationCheck, SummaryJavadocCheck

@romani
We don't use CHAR directly in these checks, but we count the children of TEXT.
Do you know what this counting should accomplish?
My current thought right now is it is looking for TEXT with just WS (whitespace), but I just wanted to verify.
Edit: And since we combine multiple WS together, this means it can be more than just one space.

Member

rnveach commented Nov 12, 2016

I don't think javadoc antlr can be changed easily. Using skip or the hidden channel causes TEXT to lose that text, the combinations of CHAR. We use CHAR in alot of places in the grammar to just remove the token and embed it in TEXT.

Clearing the children of TEXT during the token conversion seems the easiest, but this will probably make CHAR an unreachable and non-existant token.

The following checks fail with the proposed change: JavadocParagraphCheck, JavadocTagContinuationIndentationCheck, SummaryJavadocCheck

@romani
We don't use CHAR directly in these checks, but we count the children of TEXT.
Do you know what this counting should accomplish?
My current thought right now is it is looking for TEXT with just WS (whitespace), but I just wanted to verify.
Edit: And since we combine multiple WS together, this means it can be more than just one space.

@rnveach rnveach self-assigned this Nov 12, 2016

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Nov 12, 2016

Member

Do you know what this counting should accomplish?

This just check that there is a some content, not very smart but it works, i hope.

I don't think javadoc antlr can be changed easily. Using skip or the hidden channel causes TEXT to lose that text, the combinations of CHAR

Java grammar just do not create token for this, it might reasonable to look at java grammar or ask a question at antlr forum

Member

romani commented Nov 12, 2016

Do you know what this counting should accomplish?

This just check that there is a some content, not very smart but it works, i hope.

I don't think javadoc antlr can be changed easily. Using skip or the hidden channel causes TEXT to lose that text, the combinations of CHAR

Java grammar just do not create token for this, it might reasonable to look at java grammar or ask a question at antlr forum

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Nov 13, 2016

Member

Java grammar just do not create token for this

Grammars aren't the same. Antlr4 is different, it doesn't support AST anymore. This is why we convert it in JavadocDetailNodeParser.
Looking online has said to use custom visitors, which is basically what we are doing. http://stackoverflow.com/a/15185289
We can still ask if you wish.

Member

rnveach commented Nov 13, 2016

Java grammar just do not create token for this

Grammars aren't the same. Antlr4 is different, it doesn't support AST anymore. This is why we convert it in JavadocDetailNodeParser.
Looking online has said to use custom visitors, which is basically what we are doing. http://stackoverflow.com/a/15185289
We can still ask if you wish.

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 13, 2016

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Nov 15, 2016

Member

@rnveach ,

We do not split attribute values (ATTR_VALUE) into CHAR+WS.
Can we do the same for javadoc text ?

The reason why i think we should look at approach of grammar changes is that we waste a execution time to wrap all chars in separate object. But both approaches are not public to user so post-processing approach is ok as fall-back.

/var/tmp$ cat Filter.java 
/**
 * An interface  <i htr="value" />.
 */
public interface Filter {
}

/var/tmp$ java -jar checkstyle-7.2-SNAPSHOT-all.jar -J Filter.java
INTERFACE_DEF -> INTERFACE_DEF [4:0]
|--MODIFIERS -> MODIFIERS [4:0]
|   |--BLOCK_COMMENT_BEGIN -> /* [1:0]
|   |   |--COMMENT_CONTENT -> *\n * An interface  <i htr="value" />.\n  [1:2]
|   |   |   `--JAVADOC -> \n * An interface  <i htr="value" />.\n <EOF> [1:0]
|   |   |       |--NEWLINE -> \n [1:0]
|   |   |       |--LEADING_ASTERISK ->  * [2:0]
|   |   |       |--TEXT ->  An interface   [2:2]
|   |   |       |   |--WS ->   [2:2]
|   |   |       |   |--CHAR -> A [2:3]
|   |   |       |   |--CHAR -> n [2:4]
|   |   |       |   |--WS ->   [2:5]
|   |   |       |   |--CHAR -> i [2:6]
|   |   |       |   |--CHAR -> n [2:7]
|   |   |       |   |--CHAR -> t [2:8]
|   |   |       |   |--CHAR -> e [2:9]
|   |   |       |   |--CHAR -> r [2:10]
|   |   |       |   |--CHAR -> f [2:11]
|   |   |       |   |--CHAR -> a [2:12]
|   |   |       |   |--CHAR -> c [2:13]
|   |   |       |   |--CHAR -> e [2:14]
|   |   |       |   `--WS ->    [2:15]
|   |   |       |--HTML_ELEMENT -> <i htr="value" /> [2:17]
|   |   |       |   `--SINGLETON_ELEMENT -> <i htr="value" /> [2:17]
|   |   |       |       `--SINGLETON_TAG -> <i htr="value" /> [2:17]
|   |   |       |           |--OPEN -> < [2:17]
|   |   |       |           |--HTML_TAG_NAME -> i [2:18]
|   |   |       |           |--WS ->   [2:19]
|   |   |       |           |--ATTRIBUTE -> htr="value" [2:20]
|   |   |       |           |   |--HTML_TAG_NAME -> htr [2:20]
|   |   |       |           |   |--EQUALS -> = [2:23]
|   |   |       |           |   `--ATTR_VALUE -> "value" [2:24]
|   |   |       |           |--WS ->   [2:32]
|   |   |       |           `--SLASH_CLOSE -> /> [2:33]
|   |   |       |--TEXT -> . [2:35]
|   |   |       |   `--CHAR -> . [2:35]
|   |   |       |--NEWLINE -> \n [2:36]
|   |   |       |--TEXT ->   [3:0]
|   |   |       |   `--WS ->   [3:0]
|   |   |       `--EOF -> <EOF> [3:1]
|   |   `--BLOCK_COMMENT_END -> */ [3:1]
|   `--LITERAL_PUBLIC -> public [4:0]
|--LITERAL_INTERFACE -> interface [4:7]
|--IDENT -> Filter [4:17]
`--OBJBLOCK -> OBJBLOCK [4:24]
    |--LCURLY -> { [4:24]
    `--RCURLY -> } [5:0]

Some links:
https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc/JavadocLexer.g4#L380

Member

romani commented Nov 15, 2016

@rnveach ,

We do not split attribute values (ATTR_VALUE) into CHAR+WS.
Can we do the same for javadoc text ?

The reason why i think we should look at approach of grammar changes is that we waste a execution time to wrap all chars in separate object. But both approaches are not public to user so post-processing approach is ok as fall-back.

/var/tmp$ cat Filter.java 
/**
 * An interface  <i htr="value" />.
 */
public interface Filter {
}

/var/tmp$ java -jar checkstyle-7.2-SNAPSHOT-all.jar -J Filter.java
INTERFACE_DEF -> INTERFACE_DEF [4:0]
|--MODIFIERS -> MODIFIERS [4:0]
|   |--BLOCK_COMMENT_BEGIN -> /* [1:0]
|   |   |--COMMENT_CONTENT -> *\n * An interface  <i htr="value" />.\n  [1:2]
|   |   |   `--JAVADOC -> \n * An interface  <i htr="value" />.\n <EOF> [1:0]
|   |   |       |--NEWLINE -> \n [1:0]
|   |   |       |--LEADING_ASTERISK ->  * [2:0]
|   |   |       |--TEXT ->  An interface   [2:2]
|   |   |       |   |--WS ->   [2:2]
|   |   |       |   |--CHAR -> A [2:3]
|   |   |       |   |--CHAR -> n [2:4]
|   |   |       |   |--WS ->   [2:5]
|   |   |       |   |--CHAR -> i [2:6]
|   |   |       |   |--CHAR -> n [2:7]
|   |   |       |   |--CHAR -> t [2:8]
|   |   |       |   |--CHAR -> e [2:9]
|   |   |       |   |--CHAR -> r [2:10]
|   |   |       |   |--CHAR -> f [2:11]
|   |   |       |   |--CHAR -> a [2:12]
|   |   |       |   |--CHAR -> c [2:13]
|   |   |       |   |--CHAR -> e [2:14]
|   |   |       |   `--WS ->    [2:15]
|   |   |       |--HTML_ELEMENT -> <i htr="value" /> [2:17]
|   |   |       |   `--SINGLETON_ELEMENT -> <i htr="value" /> [2:17]
|   |   |       |       `--SINGLETON_TAG -> <i htr="value" /> [2:17]
|   |   |       |           |--OPEN -> < [2:17]
|   |   |       |           |--HTML_TAG_NAME -> i [2:18]
|   |   |       |           |--WS ->   [2:19]
|   |   |       |           |--ATTRIBUTE -> htr="value" [2:20]
|   |   |       |           |   |--HTML_TAG_NAME -> htr [2:20]
|   |   |       |           |   |--EQUALS -> = [2:23]
|   |   |       |           |   `--ATTR_VALUE -> "value" [2:24]
|   |   |       |           |--WS ->   [2:32]
|   |   |       |           `--SLASH_CLOSE -> /> [2:33]
|   |   |       |--TEXT -> . [2:35]
|   |   |       |   `--CHAR -> . [2:35]
|   |   |       |--NEWLINE -> \n [2:36]
|   |   |       |--TEXT ->   [3:0]
|   |   |       |   `--WS ->   [3:0]
|   |   |       `--EOF -> <EOF> [3:1]
|   |   `--BLOCK_COMMENT_END -> */ [3:1]
|   `--LITERAL_PUBLIC -> public [4:0]
|--LITERAL_INTERFACE -> interface [4:7]
|--IDENT -> Filter [4:17]
`--OBJBLOCK -> OBJBLOCK [4:24]
    |--LCURLY -> { [4:24]
    `--RCURLY -> } [5:0]

Some links:
https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc/JavadocLexer.g4#L380

@baratali

This comment has been minimized.

Show comment
Hide comment
@baratali

baratali Nov 16, 2016

Contributor

We do not split attribute values (ATTR_VALUE) into CHAR+WS.
Can we do the same for javadoc text ?

I think we cannot because ATTR_VALUE and TEXT are different entities.
ATTR_VALUE is a lexer rule (or token) and is defined as a regexp (sequence of symbols)
TEXT is a parser rule and is defined as a sequence of tokens.
ANTLR4 uses different types/interfaces in ParseTree for lexer and parser rules. Parser rule object doesn't even have its own input text (http://www.antlr.org/api/Java/org/antlr/v4/runtime/RuleContext.html#getText()).

We also cannot skip CHAR and WS tokens because we lose input text which is covered by them.

Maybe we can do something with inline java code to change TEXT parser rule content (https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc/JavadocParser.g4#L919). But, as I know, ANTLR4 doesn't provide standard ways to change parseTree structure in grammar, so that inline java code can be a little tricky.

Contributor

baratali commented Nov 16, 2016

We do not split attribute values (ATTR_VALUE) into CHAR+WS.
Can we do the same for javadoc text ?

I think we cannot because ATTR_VALUE and TEXT are different entities.
ATTR_VALUE is a lexer rule (or token) and is defined as a regexp (sequence of symbols)
TEXT is a parser rule and is defined as a sequence of tokens.
ANTLR4 uses different types/interfaces in ParseTree for lexer and parser rules. Parser rule object doesn't even have its own input text (http://www.antlr.org/api/Java/org/antlr/v4/runtime/RuleContext.html#getText()).

We also cannot skip CHAR and WS tokens because we lose input text which is covered by them.

Maybe we can do something with inline java code to change TEXT parser rule content (https://github.com/checkstyle/checkstyle/blob/master/src/main/resources/com/puppycrawl/tools/checkstyle/grammars/javadoc/JavadocParser.g4#L919). But, as I know, ANTLR4 doesn't provide standard ways to change parseTree structure in grammar, so that inline java code can be a little tricky.

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Nov 18, 2016

Member

@baratali , thanks a lot.

Member

romani commented Nov 18, 2016

@baratali , thanks a lot.

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 18, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 18, 2016

rnveach added a commit to rnveach/checkstyle that referenced this issue Nov 18, 2016

romani added a commit that referenced this issue Nov 19, 2016

@romani romani added this to the 7.3 milestone Nov 19, 2016

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Nov 19, 2016

Member

fix is merged.

Member

romani commented Nov 19, 2016

fix is merged.

@romani romani closed this Nov 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment