Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML: Issue with spacing around <programlisting> #8590

Closed
Stewori opened this issue Jun 8, 2021 · 22 comments
Closed

XML: Issue with spacing around <programlisting> #8590

Stewori opened this issue Jun 8, 2021 · 22 comments
Labels

Comments

@Stewori
Copy link

Stewori commented Jun 8, 2021

Describe the bug
Processing Java code

/**
 * Tests if blah with a {@code Path}. On blah
 * &quot;{@code foo/bar}&quot; ends with &quot;{@code foo/bar}&quot; and &quot;{@code bar}&quot;. It does
 * not end with &quot;{@code r}&quot; or &quot;{@code /bar}&quot;. Note that blah and so the {@code
 * Path}&quot;{@code foo/bar}&quot; with the {@code String} &quot;{@code bar/}&quot; returns
 * {@code true}.
 */
public class Test {}

results in the following xml snippet:

    <detaileddescription>
<para>Tests if blah with a<programlisting><codeline><highlight class="normal">Path<sp/></highlight></codeline>
</programlisting> . On blah "<programlisting><codeline><highlight class="normal">foo/bar<sp/></highlight></codeline>
</programlisting> " ends with "<programlisting><codeline><highlight class="normal">foo/bar<sp/></highlight></codeline>
</programlisting> " and "<programlisting><codeline><highlight class="normal">bar<sp/></highlight></codeline>
</programlisting> ". It does not end with "<programlisting><codeline><highlight class="normal">r<sp/></highlight></codeline>
</programlisting> " or "<programlisting><codeline><highlight class="normal">/bar<sp/></highlight></codeline>
</programlisting> ". Note that blah and so the<programlisting><codeline><highlight class="normal">Path<sp/></highlight></codeline>
</programlisting> "<programlisting><codeline><highlight class="normal">foo/bar<sp/></highlight></codeline>
</programlisting> " with the<programlisting><codeline><highlight class="normal">String<sp/></highlight></codeline>
</programlisting> "<programlisting><codeline><highlight class="normal">bar/<sp/></highlight></codeline>
</programlisting> " returns <programlisting><codeline><highlight class="keyword">true</highlight><highlight class="normal"><sp/></highlight></codeline>
</programlisting> . </para>
    </detaileddescription>

Observe that there is never put a space before <programlisting>, regardless of whether there originally was a space in the doc comment. In similar manner, there is always put a space after </programlisting>, regardless of whether there originally was a space in the doc comment. This requires tedious postprocessing to avoid sloppy-looking results. To demonstrate this, consider we render the xml block completely as plain text (<sp/> is turned into space). One would get

Tests if blah with aPath  . On blah "foo/bar  " ends with "foo/bar  " and "bar  ". It does not end with "r  " or "/bar  ". Note that blah and so thePath  "foo/bar  " with theString  "bar/  " returns true  . 

Note that this is not due to handling of <sp/>. Leaving them in place would yield

Tests if blah with aPath<sp/> . On blah "foo/bar<sp/> " ends with "foo/bar<sp/> " and "bar<sp/> ". It does not end with "r<sp/> " or "/bar<sp/> ". Note that blah and so thePath<sp/> "foo/bar<sp/> " with theString<sp/> "bar/<sp/> " returns true<sp/> . 

(I noticed there is also a synthetic space before </para> but that one is easy to deal with.)

Expected behavior
I expect that the original space characters before and after <(/)programlisting> are preserved. No original spaces shall be omitted, no additional "synthetic" spaces shall be introduced.

To Reproduce
Find a Java file and doxygen config attached. The issue is demonstrated like mentioned above.
After running doxygen on the example, the files in the subfolder xml contain the snippet shown above.
issue_programlisting_spacing.zip

Version
Tested with doxygen clone from 2021-05-26:

/blah/doxygen/build/bin/doxygen -v
1.9.2 (da0a6b42c349bc06b2804a4b0e8f55c7690f5cf9*)

lsb_release -a:

LSB Version:	core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID:	LinuxMint
Description:	Linux Mint 18 Sarah
Release:	18
Codename:	sarah

Additional context
Consider again the plaintext rendering:

Tests if blah with aPath  . On blah "foo/bar  " ends with "foo/bar  " and "bar  ". It does not end with "r  " or "/bar  ". Note that blah and so thePath  "foo/bar  " with theString  "bar/  " returns true  . 

In most cases one can repair the spacing based on heuristics. However look at Path "foo/bar. There is no way to tell whether it is originally Path" foo/bar or Path "foo/bar, as both sides of " have uncertain spacing (or lack of it) now. One would have to do an even/odd count on all " to distinguish opening from closing ones. This can be error-prone on long and complex docstrings and even more so if it were single quotes.
While maybe somehow feasible, solving this in postprocessing is a nontrivial and overly complex task.

@albert-github
Copy link
Collaborator

Some analysis.

Lets summarize the seen and not seen spaces based on the smaller line:

Tests {@code Path}. BLAH

so we see here the following spaces :

  • before the {
  • after the code code, signalling the end of the command name, so should not appear in the result (but what to do when there are 2 spaces? looks like doxygen will remove both of them.)
  • a space after the .

what is the result in the doxygen xml output:

<para>Tests<programlisting><codeline><highlight class="normal">Path<sp/></highlight></codeline>
</programlisting> . BLAH </para>

here we see the following:

  • a space between the </programlisting> and the .
  • a space after the . (as expected)
  • a space before the </para>]
  • a special space <sp/> before the </highlight>

Interesting to see here is also the result of doxygen's comment conversion, which results in:

Tests @code Path @endcode . BLAH

i.e.

  • {@code is replaced by @code
  • the corresponding } is replaced by @endcode

This is a difficult problem as it looks like the javadoc @code command is intended for inline comments and the doxygen @code is for creating blocks of code in the resulting output.

In the JavaDoc documentation we read:

Inline tags - Can be placed anywhere in the main description or in the comments for block tags. Inline tags are denoted by curly braces: {@tag}.

and

{@code text}
Equivalent to {@literal}.

A small experiment, replacing the {@code with <code> and } with </code> results from comment conversion:

<para>Tests <computeroutput> Path</computeroutput>. BLAH </para>

So we see that we now have an inline type of output (see the HTML output) and that we don't have a program listing anymore but computer output as output type and that most extra spaces are gone, just a space after the <code> is present (but probably easy to get rid of by redefining the lex rule <CComment>"{@code"/[ \t\n] into <CComment>"{@code"[ \t\n] or maybe even <CComment>"{@code"[ \t\n]+ and ignoring the white space, might though interfere with wanted initial space in the code).

@doxygen should we replace the {@code with <code>? looks to me cleaner and more conform the javadoc definition.
@Stewori in case the proposed change should be implemented how much white space should be removed after the original {@code?

(The space before the </para> is a different problem)

@Stewori
Copy link
Author

Stewori commented Jun 9, 2021

in case the proposed change should be implemented how much white space should be removed after the original {@code?

I am not really sure (I was more concerned about the space before {@code). I will check how Javadoc handles it. (Edit: Javadoc deletes exactly one space after {@code, which is required as a delimiter. I think that's indeed the best solution since it has least information loss.)

To make the issue clear more systematically, and make clear I don't bother about adding a white space or not, but rather about information loss.

/**
 * a {@code xxx} b
 * a{@code xxx} b
 * a {@code xxx}b
 * a{@code xxx}b
 * a{@code xxx }b
 * a{@code  xxx}b
 * a{@code  xxx }b
 *
 * a  {@code xxx}  b
 * a{@code xxx}  b
 * a  {@code xxx}b
 */
public class Test {}

becomes all the same (almost):

    <detaileddescription>
<para>a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/><sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/><sp/></highlight></codeline>
</programlisting> b</para>
<para>a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting> b </para>
    </detaileddescription>

That is information loss. One cannot know which was the original text intended by the author.
Expected behavior:

    <detaileddescription>
<para>a <programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting> b a<programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting> b a <programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting>b a<programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting>b a<programlisting><codeline><highlight class="normal">xxx<sp/></highlight></codeline>
</programlisting>b a<programlisting><codeline><highlight class="normal"><sp/>xxx</highlight></codeline>
</programlisting>b a<programlisting><codeline><highlight class="normal"><sp/>xxx<sp/></highlight></codeline>
</programlisting>b</para>
<para>a  <programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting>  b a<programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting>  b a  <programlisting><codeline><highlight class="normal">xxx</highlight></codeline>
</programlisting>b</para>
    </detaileddescription>

I am fine, however, with normalizing double(or more?)-spaces to one, but that's up to you. I think this is the behavior that all other tags already have.

should we replace the {@code with <code>?

Changing the result to <computeroutput> may break existing code (including my own). It would not be too difficult to fix this in my case, but for the overall community it may be more difficult. I'm not sure if that's worth it. Finally, <computeroutput> would, unlike <programlisting>, not feature syntax highlighting. So, for me it would be perfectly fine to stick with <programlisting> if the spacing was preserved like demonstrated.

@albert-github
Copy link
Collaborator

The change from replacing {@code with <code> instead of @code has a number of components:

  • information would be inline and not as a separate block so more inline with the javadoc definition
  • change of <programlisting><codeline><highlight ...> to <computeroutput>:
    • <programlisting> to <computeroutput> could probably be easily overcome in doxygen by some redefinition for this case (maybe defining @code{inline})
    • <codeline> signals that we have a line might be left out in case @code{inline}
    • <highlight> could also automatically be introduced / remain with @code{inline}

so by introducing @code{inline} could probably solve the problem by just leaving out the <codeline> tag.
most likely some "fiddling" regarding the spaces has to be done, good to know that javadoc strips away only a single space just after the @code.

The remark: Changing the result to <computeroutput> may break existing code is important, though I think the current implementation in doxygen (block versus inline) isn't correct accoring to the javadoc definition for {@code. So it would break existing code as it is fixing a bug (and thus can, in my opinion, current incorrect behavior).

@ doxygen what is your opinion?

@doxygen
Copy link
Owner

doxygen commented Jun 13, 2021

@albert-github I agree that the current implementation is wrong and should not have used @code...@encode as a replacement for {@code ...}, but rather <code>..</code> indeed.

I also found an article describing the various ways code fragments can be written in Javadoc: https://reflectoring.io/howto-format-code-snippets-in-javadoc/

@Stewori
Copy link
Author

Stewori commented Jun 13, 2021

Would the <pre> + @code variant from the linked document be supported? @code would have to preserve indentation and newlines for that. I don't think that just replacing it by <code> would be sufficient for that, would it?

@albert-github
Copy link
Collaborator

albert-github commented Jun 13, 2021

I've created 3 variants based on the current master (1.9.2 (6200bc8*)):

  • the current master, see the attachment directory example the directory current
  • instead of translation to @code ... @endcode a translation to <pre>@code ... @endcode</pre> (current_pre directory)
  • instead of translation to @code ... @endcode a translation to <code> ... </code> (current_code directory)

The resulting conversion by means of -d commentcnv is also supplied in the files commentcnv.res on the directories.

When just looking at the HTML output the <code> variant looks reasonable good, the <pre> variant looks just as bad (in my opinion) as the original version.
I didn't look at the XML output, nor at maybe some small space problems, but my conclusion is that, at the moment, the <code> is the best variant.

The results: results.tar.gz

@doxygen
Copy link
Owner

doxygen commented Jun 14, 2021

We could probably also treat <pre>{@code as a different token than a standalone {@code and replace the first with @code and the second with <code>. Only issue is what to do if the end token is not matching (i.e. a } with extra stuff before the </pre>)

@Stewori
Copy link
Author

Stewori commented Jun 14, 2021

@albert-github in the computeroutput variant there still appears to be a bug:

    <detaileddescription>
<para>Tests if blah with a <computeroutput>Path</computeroutput>. On blah "<computeroutput>foo/bar</computeroutput>" ends with "<computeroutput>foo/bar</computeroutput>" and "<computeroutput>bar</computeroutput>". It does not end with "<computeroutput>r</computeroutput>" or "<computeroutput>/bar</computeroutput>". Note that blah and so the <computeroutput> * Path</computeroutput>"<computeroutput>foo/bar</computeroutput>" with the <computeroutput>String</computeroutput> "<computeroutput>bar/</computeroutput>" returns <computeroutput>true</computeroutput>. </para>
    </detaileddescription>

How does the asterisk in <computeroutput> * Path</computeroutput> arise?

The preformatted/programlisting version looks good to me in terms of spacing! However one cannot distinguish <pre>{@code from plain {@code any more, or would the former one result in a double-<pre>? Anyway, it is misleading for inline-use of {@code.

We could probably also treat <pre>{@code as a different token than a standalone {@code

Nasty things like <pre> blah {@code will occur. Not a waterproof solution I suspect :(

Would it be feasible to represent {@code as <computeroutput> but preserving spaces and newlines, e.g. by representing them as <sp/> and <br/> respectively? This would be my favorite behavior, because then whoever is processing the xml is free to choose whether to ignore the linebreaks (by replacing <br/> with spaces) or not. All information would be present.
Second favorite would be to keep @code ... @endcode but somehow fix the spacing to be accurate.

@Stewori
Copy link
Author

Stewori commented Jun 14, 2021

I did some experiments by myself (using a filter plus an alias) to check out the <code> variant with multi-line. It appears that <code> -> <computeroutput> actually does preserve spaces (e.g. indentation) and newlines. I was not aware of that (thought all would be normalized to single space).
Then I agree that using <code> is the best approach (if the surprising asterisk is resolved). One can process a surrounding <pre> in xml processing according to ones needs.

In the approach with the filter and alias I get issues with unescaped sharp brackets though, e.g. {@code <p>example</p> does not work properly. However, I hope @albert-github's approach gets this right.

Another pitfall: if unescaped, certain tags may be recognized by doxygen and get tags on xml-level, e.g. {@code <U>}. This can cause a real mess as it is turned into an open-ended <computeroutput><underline> (not really open-ended; eventually doxygen closes it at some point, but the mess is already complete).
Javadoc authors usually do not expect that sharp bracket escaping is necessary in {@code ... }, so Doxygen has to apply the escaping when turning {@code ... } into <code>...</code>. @albert-github I did not yet look at your approach and whether it accounts for this. Will have a look tomorrow.

@albert-github
Copy link
Collaborator

You mention 3 problems here

  • the asterisk problem
  • the p tag problem
  • the not colosed tag problem (here shown with the U tag)

I used the following:

/**
 * Test with newline at end so "inline" comment continues at next line: {@code
 * Path} rest of comment.
 *
 * test with p tag {@code <p>example</p>}.
 *
 * test with u tag {@code <U>}.
 *
 * some more test ....
 */
public class Test1 {}

And as implementation (commentcnv.l):
instead of:

<CComment>"{@code"/[ \t\n]           {
                                     copyToOutput(yyscanner,"@code",5); 

use:

<CComment>"{@code"[ \t\n]           {
                                     copyToOutput(yyscanner,"<code>",6); 
                                     if (yytext[yyleng-1] == '\n') copyToOutput(yyscanner,"\n",1);

and (in rule <VerbatimCode>"}" {)instead of:

                                         copyToOutput(yyscanner," @endcode ",10);

use:

                                         copyToOutput(yyscanner,"</code>",7);

Which results in:

image

and the following warnings:

.../test1.java:7: warning: found </code> tag while expecting </u>
.../test1.java:10: warning: end of comment block while expecting command </u>
.../test1.java:10: warning: end of comment block while expecting command </code>

Thus the asterisk problem is solved but the other tags are still interpreted.

@doxygen should we use something like <code class="verbatim"> where the new class="verbatim" would leave the special commands (i.e. starting with @, \, <..>) uninterpreted?

@Stewori
Copy link
Author

Stewori commented Jun 15, 2021

... would leave the special commands (...) uninterpreted?

However, this should not affect autolink functionality (i.e. unlike @verbatim ... @endverbatim would). Especially in code samples, autolink is a useful feature.

@doxygen
Copy link
Owner

doxygen commented Jun 19, 2021

We should then probably have a new internal command for javadoc style processing, i.e. map {@code...} to @jdcode..@endjdcode or something like that, or alternatively use @code{javadoc}...@endcode

@albert-github
Copy link
Collaborator

@Stewori
One thing that bothers me a little bit is how are } signs used / handled in the {@code ... } part in javadoc?
Most likely with \}, though it looks like this is currently not handled properly in the {@code ... } either.
We have to be careful with this as \} is seen in doxygen as end of a grouping.

@Stewori
Copy link
Author

Stewori commented Jun 20, 2021

@doxygen may I suggest to use a neutral name, i.e. not Java-specific, since a command for inline code is probably useful more generally.

@albert-github Ithink I once read in the spec that the } character is invalid in inline javadoc tags. However, I observed that in reality javadoc parses {, } inside @code in balanced manner. I think doxygen does this as well in @command{...} environments. So no special handling is required here. Note that in javadoc, \} does not count as escaping. One has to use html-style escaping instead, or \u... unicode notation.

@albert-github
Copy link
Collaborator

  • I second the neutral name
  • @Stewori
    • the doxygen situation with @command{...} is a bit different, the {...} part here are "just" defined options and these don't contain curly brackets
    • Another situation is the situation is the situation with ALIASES setting, but matching brackets is a very difficult problem as for aprogram / rule it is very hard to determine what is intended. When you find a reference to " I think I once read in the spec that the } character is invalid in inline javadoc tags." I think this would be useful (I didn't find it when I searched I didn't find anything, I think best is to declare it invalid (so stopping the {@code part as soon as a } is encountered.

@Stewori
Copy link
Author

Stewori commented Jun 20, 2021

When you find a reference to " I think I once read in the spec that the } character is invalid in inline javadoc tags."

Unfortunately (or maybe fortunately) I cannot spot it any more. Probably it was in an old version of the spec. There appears to be a bug fix about this topic: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4965490

Anyway, in the current spec, the balanced parsing of {...} is explicitly specified. See https://docs.oracle.com/en/java/javase/16/docs/specs/javadoc/doc-comment-spec.html#inline-tags.
(I think that wasn't so clear in older versions. Edit: Indeed the balanced parsing was not explicitly specified for Java 14 and earlier, see https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html#inline-tags. However, javadoc used to handle it this way already in Java 14 and several earlier versions, not sure how many exactly, but I guess at least down to Java 5.).

Stopping at the first } would cause a lot of trouble and would be far worse than the current situation with its spacing issue.

While looking at the spec I found that {@code ...} is specified to be similar to <code>{@literal ...}</code>. So, given the name question, may I suggest to call the new command "literal", i.e. @literal{...} and let it just do the escaping of certain critical characters (<, > and & I think) and autolink while leaving out any other interpretation of the content except that {, } are parsed in balanced manner.

Then, as a side effect this can support the @literal javadoc tag by mapping {@literal ...} to @literal{...}. It would also solve the inline code issue - obviously - by mapping {@code ...} to <code>@literal{...}</code>.

This approach would allow to close #4953 as well.

@albert-github
Copy link
Collaborator

This remains a difficult problem. I think we should split the problem in a number of parts

  • converting {@code ...} to another doxygen command
  • adding the option inline to the doxygen @code command
  • adding the option linkonly / literal / javadoc or whatever name we choose to the doxygen @code command`

Rationale:

  • with the last 2 points we could replace the {@code ...} with @code{inline,linkonly} ... @endocode
  • split the effort in a number of independent pieces of work
  • enhance the doxygen command @code for other usages as well (I think especially the inline could be useful.

@Stewori
Copy link
Author

Stewori commented Jun 21, 2021

+1 on the idea to do this via options, as it is the most modular solution. linkonly is supposed to indicate whether autolink shall be applied, right?

I am aware that the suggestion to do this in one go with adding @literal support may be asked too much. It really just popped up because I happened to read how closely related the @code and @literal commands in Javadoc actually are. So I thought it might be a good fit. Feel free to ignore this idea.
In case you are willing to follow that idea, however:
Then linkonly and literal should be two separate options. literal would indicate to render the result in code-font or not. For xml output I suggest to simply add a literal="yes" attribute to the resulting <computeroutput> tag to indicate whether the literal option was present.

@albert-github
Copy link
Collaborator

  • indeed linkonly only the autolink

  • @literal is a new issue, but can probably be handled in commentcnv.l as well, but I also want to mention my comment in the issue JavaDoc @literal is not recognized (Origin: bugzilla #688386) #4953:

    Not all JavaDoc commands are valid in doxygen, doxygen is not a clone of
    JavaDoc and, I think, it is doubtful whether or not all JavaDoc commands
    should be and can be valid in doxygen.

    but still it should be investigated (later)

albert-github added a commit to albert-github/doxygen that referenced this issue Dec 7, 2021
In the JavaDoc documentation we see:
```
{@code  text}
    Equivalent to <code>{@literal}</code>.
```
and
```
{@literal  text}
    Displays text without interpreting the text as HTML markup or nested javadoc tags
```

These commands have been implemented by means of transforming them into doxygen commands.

This implementation also:
- the `{@literal ..}` command, doxygen#4953 JavaDoc @literal is not recognized (Origin: bugzilla #688386)
- doxygen#4949 JavaDoc @code makes a block element instead of inline element (Origin: bugzilla #688381)
- doxygen#4952 JavaDoc @code generates link (Origin: bugzilla #688385)
@albert-github
Copy link
Collaborator

I've just pushed a proposed patch, pull request #8949

doxygen added a commit that referenced this issue Dec 8, 2021
issue #8590 XML: Issue with spacing around <programlisting>
@albert-github albert-github added the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Dec 9, 2021
@albert-github
Copy link
Collaborator

Code has been integrated in master on GitHub (please don't close the issue as this will be done at the moment of an official release).

@doxygen
Copy link
Owner

doxygen commented Dec 31, 2021

This issue was previously marked 'fixed but not released',
which means it should be fixed in doxygen version 1.9.3.
Please verify if this is indeed the case. Reopen the
issue if you think it is not fixed and please include any additional information
that you think can be relevant (preferably in the form of a self-contained example).

@doxygen doxygen removed the fixed but not released Bug is fixed in github, but still needs to make its way to an official release label Dec 31, 2021
@doxygen doxygen closed this as completed Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants