New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JavadocParser: inconsistent AST tree with and without SINGLETON_ELEMENT #3810

Closed
romani opened this Issue Feb 8, 2017 · 6 comments

Comments

@romani
Member

romani commented Feb 8, 2017

from #3570.

$ cat issue.javadoc 
 * <p>
 * Test 1
 * <br/>
 * Test 2
 * </p>

$ java -jar checkstyle-7.4-all.jar -j issue.javadoc
JAVADOC ->  * <p>\n * Test 1\n * <br/>\n * Test 2\n * </p>\n<EOF> [0:0]
|--LEADING_ASTERISK ->  * [0:0]
|--TEXT ->   [0:2]
|--HTML_ELEMENT -> <p> [0:3]
|   `--P_TAG_OPEN -> <p> [0:3]
|       |--OPEN -> < [0:3]
|       |--P_HTML_TAG_NAME -> p [0:4]
|       `--CLOSE -> > [0:5]
|--NEWLINE -> \n [0:6]
|--LEADING_ASTERISK ->  * [1:0]
|--TEXT ->  Test 1 [1:2]
|--NEWLINE -> \n [1:9]
|--LEADING_ASTERISK ->  * [2:0]
|--TEXT ->   [2:2]
|--HTML_ELEMENT -> <br/> [2:3]
|   `--SINGLETON_ELEMENT -> <br/> [2:3]
|       `--BR_TAG -> <br/> [2:3]
|           |--OPEN -> < [2:3]
|           |--BR_HTML_TAG_NAME -> br [2:4]
|           `--SLASH_CLOSE -> /> [2:6]
|--NEWLINE -> \n [2:8]
|--LEADING_ASTERISK ->  * [3:0]
|--TEXT ->  Test 2 [3:2]
|--NEWLINE -> \n [3:9]
|--LEADING_ASTERISK ->  * [4:0]
|--TEXT ->   [4:2]
|--HTML_ELEMENT -> </p> [4:3]
|   `--P_TAG_CLOSE -> </p> [4:3]
|       |--OPEN -> < [4:3]
|       |--SLASH -> / [4:4]
|       |--P_HTML_TAG_NAME -> p [4:5]
|       `--CLOSE -> > [4:6]
|--NEWLINE -> \n [4:7]
`--EOF -> <EOF> [5:0]

$ cat issue.javadoc 
 * <p>
 * Test 1
 * Test 2
 * </p>

$ java -jar checkstyle-7.4-all.jar -j issue.javadoc
JAVADOC ->  * <p>\n * Test 1\n * Test 2\n * </p>\n<EOF> [0:0]
|--LEADING_ASTERISK ->  * [0:0]
|--TEXT ->   [0:2]
|--HTML_ELEMENT -> <p>\n * Test 1\n * Test 2\n * </p> [0:3]
|   `--PARAGRAPH -> <p>\n * Test 1\n * Test 2\n * </p> [0:3]
|       |--P_TAG_OPEN -> <p> [0:3]
|       |   |--OPEN -> < [0:3]
|       |   |--P_HTML_TAG_NAME -> p [0:4]
|       |   `--CLOSE -> > [0:5]
|       |--NEWLINE -> \n [0:6]
|       |--LEADING_ASTERISK ->  * [1:0]
|       |--TEXT ->  Test 1 [1:2]
|       |--NEWLINE -> \n [1:9]
|       |--LEADING_ASTERISK ->  * [2:0]
|       |--TEXT ->  Test 2 [2:2]
|       |--NEWLINE -> \n [2:9]
|       |--LEADING_ASTERISK ->  * [3:0]
|       |--TEXT ->   [3:2]
|       `--P_TAG_CLOSE -> </p> [3:3]
|           |--OPEN -> < [3:3]
|           |--SLASH -> / [3:4]
|           |--P_HTML_TAG_NAME -> p [3:5]
|           `--CLOSE -> > [3:6]
|--NEWLINE -> \n [3:7]
`--EOF -> <EOF> [4:0]

if we add tag that has open and closed tag.

$ cat issue.javadoc 
 * <p>
 * Test 1
 * <font>smth</font>
 * Test 2
 * </p>

$ java -jar checkstyle-7.4-all.jar -j issue.javadoc
JAVADOC ->  * <p>\n * Test 1\n * <font>smth</font>\n * Test 2\n * </p>\n<EOF> [0:0]
|--LEADING_ASTERISK ->  * [0:0]
|--TEXT ->   [0:2]
|--HTML_ELEMENT -> <p>\n * Test 1\n * <font>smth</font>\n * Test 2\n * </p> [0:3]
|   `--PARAGRAPH -> <p>\n * Test 1\n * <font>smth</font>\n * Test 2\n * </p> [0:3]
|       |--P_TAG_OPEN -> <p> [0:3]
|       |   |--OPEN -> < [0:3]
|       |   |--P_HTML_TAG_NAME -> p [0:4]
|       |   `--CLOSE -> > [0:5]
|       |--NEWLINE -> \n [0:6]
|       |--LEADING_ASTERISK ->  * [1:0]
|       |--TEXT ->  Test 1 [1:2]
|       |--NEWLINE -> \n [1:9]
|       |--LEADING_ASTERISK ->  * [2:0]
|       |--TEXT ->   [2:2]
|       |--HTML_TAG -> <font>smth</font> [2:3]
|       |   |--HTML_ELEMENT_OPEN -> <font> [2:3]
|       |   |   |--OPEN -> < [2:3]
|       |   |   |--HTML_TAG_NAME -> font [2:4]
|       |   |   `--CLOSE -> > [2:8]
|       |   |--TEXT -> smth [2:9]
|       |   `--HTML_ELEMENT_CLOSE -> </font> [2:13]
|       |       |--OPEN -> < [2:13]
|       |       |--SLASH -> / [2:14]
|       |       |--HTML_TAG_NAME -> font [2:15]
|       |       `--CLOSE -> > [2:19]
|       |--NEWLINE -> \n [2:20]
|       |--LEADING_ASTERISK ->  * [3:0]
|       |--TEXT ->  Test 2 [3:2]
|       |--NEWLINE -> \n [3:9]
|       |--LEADING_ASTERISK ->  * [4:0]
|       |--TEXT ->   [4:2]
|       `--P_TAG_CLOSE -> </p> [4:3]
|           |--OPEN -> < [4:3]
|           |--SLASH -> / [4:4]
|           |--P_HTML_TAG_NAME -> p [4:5]
|           `--CLOSE -> > [4:6]
|--NEWLINE -> \n [4:7]
`--EOF -> <EOF> [5:0]

Looks like problem is related only to existence of SINGLETON_ELEMENT.
It is a bug in parser or there should be good explanation of this behavior and all checks need to be aware of it.

ATTENTION (update after merge of fix):
Breaking compatibility changes are due to changes in AST structure.

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Feb 9, 2017

Member

I will look into this issue to try and figure out why it is doing this.

@romani This could be why our Javadoc parser is slower than Java.
In Java, when we see a token lets say {, we instantly know what it is. All cases will have a list of statements after it followed by an end }. There is no exceptions or context needed.
In Javadoc, when we see almost any tag, <p> for now, we can't tell if it is a PARAGRAPH token or a HTML_ELEMENT until we almost literally finish parsing the entire document. We need to know if there is a </p> tag or if it remains open. We can't assume the first </p> is what we want because we could be nesting these tags in themselves and we want to maintain hierarchy. It probably has to keep retrying different methods until it gets the one that matches the entire document.
This should also explain the hang in #3727 as it hangs and fails on looking for an end tag (#3727 (comment)) because the document is so huge. Remember it has to almost go through the entire document multiple times looking for different patterns before it succeeds and only fails if all patterns fail.

I'm thinking if we look at the number of adaptivePredict calls or how long it stays in one, we will see this is where all the time is being spent.

Member

rnveach commented Feb 9, 2017

I will look into this issue to try and figure out why it is doing this.

@romani This could be why our Javadoc parser is slower than Java.
In Java, when we see a token lets say {, we instantly know what it is. All cases will have a list of statements after it followed by an end }. There is no exceptions or context needed.
In Javadoc, when we see almost any tag, <p> for now, we can't tell if it is a PARAGRAPH token or a HTML_ELEMENT until we almost literally finish parsing the entire document. We need to know if there is a </p> tag or if it remains open. We can't assume the first </p> is what we want because we could be nesting these tags in themselves and we want to maintain hierarchy. It probably has to keep retrying different methods until it gets the one that matches the entire document.
This should also explain the hang in #3727 as it hangs and fails on looking for an end tag (#3727 (comment)) because the document is so huge. Remember it has to almost go through the entire document multiple times looking for different patterns before it succeeds and only fails if all patterns fail.

I'm thinking if we look at the number of adaptivePredict calls or how long it stays in one, we will see this is where all the time is being spent.

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Feb 9, 2017

Member

if we can not make behavior consistent, we need to drop support of whole tag recognition (should be discussed and explained in detail).
Documentation HAS to be updated to explain nuances of result AST, in other case implementation of javadoc Check become nightmare.

I restored "Known API issues" at http://checkstyle.sourceforge.net/writingjavadocchecks.html

Member

romani commented Feb 9, 2017

if we can not make behavior consistent, we need to drop support of whole tag recognition (should be discussed and explained in detail).
Documentation HAS to be updated to explain nuances of result AST, in other case implementation of javadoc Check become nightmare.

I restored "Known API issues" at http://checkstyle.sourceforge.net/writingjavadocchecks.html

romani added a commit that referenced this issue Feb 9, 2017

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Feb 9, 2017

Member

When forcing grammar into paragraph token, this exception comes up: IllegalArgumentException: [ERROR:0] Javadoc comment at column XX has parse error. Details: no viable alternative at input '<br' while parsing PARAGRAPH

So it is acting like paragraph doesn't have brTag recognition.
brTag is only defined in singletonElement, and singletonElement is only defined in htmlElement and that goes directly back to javadoc.
So it is true, paragraph definition currently says br can't be inside it. This is why it drops 1 paragraph for 2 htmlElements.

It looks like we can fix current issue by adding brTag to paragraph, but we could have similar issues in other tags because who knows if we missed a child tag somewhere else.

Member

rnveach commented Feb 9, 2017

When forcing grammar into paragraph token, this exception comes up: IllegalArgumentException: [ERROR:0] Javadoc comment at column XX has parse error. Details: no viable alternative at input '<br' while parsing PARAGRAPH

So it is acting like paragraph doesn't have brTag recognition.
brTag is only defined in singletonElement, and singletonElement is only defined in htmlElement and that goes directly back to javadoc.
So it is true, paragraph definition currently says br can't be inside it. This is why it drops 1 paragraph for 2 htmlElements.

It looks like we can fix current issue by adding brTag to paragraph, but we could have similar issues in other tags because who knows if we missed a child tag somewhere else.

bamapookie added a commit to bamapookie/checkstyle that referenced this issue Feb 10, 2017

@rnveach rnveach added the easy label Mar 2, 2017

@romani romani removed the easy label Mar 2, 2017

@Vladlis Vladlis added the GSoC2017 label May 10, 2017

@Vladlis Vladlis changed the title from JvadocParser: inconsistent AST tree with and without SINGLETON_ELEMENT to JavadocParser: inconsistent AST tree with and without SINGLETON_ELEMENT May 15, 2017

@Vladlis Vladlis added the javadoc label May 16, 2017

@Vladlis Vladlis moved this from To Do to In Progress in Javadoc style coverage and parser optimization Jul 6, 2017

@Vladlis Vladlis moved this from To Do to In Progress in Javadoc style coverage and parser optimization Jul 10, 2017

@ps-sp

This comment has been minimized.

Show comment
Hide comment
@ps-sp

ps-sp Jul 20, 2017

Collaborator

@rnveach

When forcing grammar into paragraph...

I am unable to decipher in what way exactly did you force grammar into paragraph.

t looks like we can fix current issue by adding brTag to paragraph...

Shouldn't we add singletonElement in paragraph rather. Replace singletonTag in all nested elements with singletonElement

Collaborator

ps-sp commented Jul 20, 2017

@rnveach

When forcing grammar into paragraph...

I am unable to decipher in what way exactly did you force grammar into paragraph.

t looks like we can fix current issue by adding brTag to paragraph...

Shouldn't we add singletonElement in paragraph rather. Replace singletonTag in all nested elements with singletonElement

@rnveach

This comment has been minimized.

Show comment
Hide comment
@rnveach

rnveach Jul 20, 2017

Member

I am unable to decipher in what way exactly did you force grammar into paragraph.

I don't remember exactly what I did, but I modified Java's memory at runtime as I think it was never getting to paragraph tag because adaptivePredict was not letting it in.

Shouldn't we add

I will leave fix to you.

Member

rnveach commented Jul 20, 2017

I am unable to decipher in what way exactly did you force grammar into paragraph.

I don't remember exactly what I did, but I modified Java's memory at runtime as I think it was never getting to paragraph tag because adaptivePredict was not letting it in.

Shouldn't we add

I will leave fix to you.

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jul 20, 2017

@Vladlis Vladlis moved this from In Progress to Review in Javadoc style coverage and parser optimization Jul 21, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jul 26, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jul 26, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Jul 26, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 9, 2017

ps-sp added a commit to ps-sp/checkstyle that referenced this issue Aug 9, 2017

romani added a commit that referenced this issue Aug 21, 2017

@romani romani added this to the 8.2 milestone Aug 21, 2017

@romani

This comment has been minimized.

Show comment
Hide comment
@romani

romani Aug 21, 2017

Member

fix is merged.

Member

romani commented Aug 21, 2017

fix is merged.

@romani romani closed this Aug 21, 2017

ArneLimburg pushed a commit to ArneLimburg/checkstyle that referenced this issue Aug 23, 2017

ArneLimburg pushed a commit to ArneLimburg/checkstyle that referenced this issue Aug 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment