Skip to content

Conversation

@SingingBush
Copy link
Contributor

fix a few javadoc comments for functions but most of the work is the html markup on the RegularExpression class.

* <li>Character
* <dl>
* <dt class="REGEX"><kbd>.</kbd> (A period)
* <dt><code>.</code> (A period)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why class="REGEX" is removed here. That seems OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom CSS isn't ideal within javadoc. The most common way that javadoc is displayed is directly within an IDE where any referenced CSS class will not be used. Also, kbd was not the right choice for the context of what's being documented and changing to a code tag (which is the best option here) is likely to have affected the style that relates to the REGEX class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, looking at site.css there isn't even a .REGEX class defined. I recommend ditching site.css entirely anyway for the reasons above. Perhaps reading published javadocs was useful up until about twenty years ago but any editor worth using will just render the javadoc into a tooltip these days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class attributes aren't just for CSS, nor is site.css the only css that can be applied to this.

I also don't think it's reasonable to assume that people only use certain IDEs to browse and render this. Search is a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to know if the class is actually used or a hangover from the past. Even for generated html, a sensible dom structure would be better than custom css rules. I'll take another look at the generated docs with & without the classname to see if there's any css rules being applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this using browser dev tools on the generated html docs and can confirm that there's no css rules being brought in for a REGEX class. I checked on the dt element as well as child elements. I also double checked the build/docs/javadocs/xerces2/stylesheet.css file, there's no such class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the class attribute is not just for CSS
  2. We do not and cannot know all CSS stylesheets that might be applied to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll put class name back on dt and as per other comment plan to ditch improper use of kbd

+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without <a href="#COMMA_OPTION">"," option</a>)
+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with <a href="#COMMA_OPTION">"," option</a>)
+ * <dt><code>[R1R2...Rn]</code> (without a {@link #SPECIAL_COMMA} option)</dt>
+ * <dt><code>[R1,R2,...,Rn]</code> (with a {@link #SPECIAL_COMMA} option)</dt>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub element should be OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put it back but it doesn't look right. When making these changes I was viewing the output a lot. It's worth viewing the rendered output when reviewing these html changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing doc is not super-well-written but I think it does need to be clear that n is not literal and the subscript does that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a go at doing the markup with sub and see how it looks.

Copy link
Contributor Author

@SingingBush SingingBush Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at this. The existing approach is ok for html shown in a browser:

image

(new at top, old at bottom in this image)

...but doesn't work out so well within an IDE (old at top - (different line)):
image

I am happy to put some of this stuff back if the html results are the main concern but perhaps it's worth finding a compromise here so that the markup is more readable from an IDE.

For example, by ditching the kbd tag (which is supposed to be for keyboard input anyway), the result is much more readable and retains the var and sub:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a start on the approach suggested in prev comment and pushed work in progress. There's some more to sort out which hopefully will get done over the weekend

* <li>Character
* <dl>
* <dt class="REGEX"><kbd>.</kbd> (A period)
* <dt><code>.</code> (A period)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class attributes aren't just for CSS, nor is site.css the only css that can be applied to this.

I also don't think it's reasonable to assume that people only use certain IDEs to browse and render this. Search is a thing.

* <p>This range matches the character.</p>
* </li>
* <li><code>C1-C2</code>
* <p>This range matches a character which has a code point that is >= <var>C1</var>'s code point and &lt;= <var>C2</var>'s code point.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still using var here, which you took out most other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it here as it rendered ok when viewing it. A lot of the places where var was used it was within a section that really needed a code block but as kbd was used the structure of the markup was getting messed up pretty bad. I am happy to change to code though.

+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><var>R<sub>2</sub></var><var>...</var><var>R<sub>n</sub></var><kbd>]</kbd> (without <a href="#COMMA_OPTION">"," option</a>)
+ * <dt class="REGEX"><kbd>[</kbd><var>R<sub>1</sub></var><kbd>,</kbd><var>R<sub>2</sub></var><kbd>,</kbd><var>...</var><kbd>,</kbd><var>R<sub>n</sub></var><kbd>]</kbd> (with <a href="#COMMA_OPTION">"," option</a>)
+ * <dt><code>[R1R2...Rn]</code> (without a {@link #SPECIAL_COMMA} option)</dt>
+ * <dt><code>[R1,R2,...,Rn]</code> (with a {@link #SPECIAL_COMMA} option)</dt>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing doc is not super-well-written but I think it does need to be clear that n is not literal and the subscript does that

* <li>Character
* <dl>
* <dt class="REGEX"><kbd>.</kbd> (A period)
* <dt><code>.</code> (A period)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the class attribute is not just for CSS
  2. We do not and cannot know all CSS stylesheets that might be applied to this

@SingingBush SingingBush force-pushed the javadoc/XERCES-1781-part-6 branch from 676bf71 to ca157c8 Compare November 1, 2025 13:21
@SingingBush
Copy link
Contributor Author

SingingBush commented Nov 2, 2025

it's worth taking a look at this now. For the comments that have a + at the start, I'm not sure if whoever put that expected them to effectively be removed from the Javadoc but they do get rendered. So with that in mind should they be removed altogether?
If they are not to be removed then the + should be removed as it messes up the generated doc. (see related lines under Character class)

@SingingBush SingingBush requested a review from elharo November 2, 2025 11:19
* @param useNrage Ignored.
* @return This returns no NrageToken.
* @param useNrange ignored
* @return a {@link RangeToken}, returns no NRANGE token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "a {@link RangeToken},"

is "no NRANGE token" supposed to be one enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the previous text being NrageToken I just rewrote it using the naming for the int representation of token in Token.NRANGE.

deleting "a {@link RangeToken}," now

@SingingBush SingingBush requested a review from elharo November 2, 2025 15:39
* <ul>
* <li><code>\ooo</code> (Octal character representations)</li>
* <li><code>\G</code>, <code>\C</code>, <code>\lc</code></li>
* <li><code>\ uc</code>, <code>\L</code>, <code>\U</code></li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's an extra space between \ and uc that shouldn't be there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially, I thought the same when changing from <kbd>\u005c u</kbd>. I don't think that \uc is a thing unless the c is meaning char which should be represented as hexadecimal value. I'll push a commit with it being \uc which is better than the current situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this was an originally an encoded backslash. Probably the backslash didn't need to be encoded here since this was not in a string literal, where it would need to be encoded

Copy link
Contributor Author

@SingingBush SingingBush Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that removing the space has broken the build:

   [xjavac] Compiling 712 source files to /home/runner/work/xerces-j/xerces-j/build/classes
   [xjavac] /home/runner/work/xerces-j/xerces-j/build/src/org/apache/xerces/impl/xpath/regex/RegularExpression.java:92: error: illegal unicode escape
   [xjavac]  *    <li><code>\uc</code>, <code>\L</code>, <code>\U</code></li>
   [xjavac]                    ^
   [xjavac] /home/runner/work/xerces-j/xerces-j/build/src/org/apache/xerces/impl/xpath/regex/RegularExpression.java:73: error: illegal unicode escape
   [xjavac]  *   <li><code>,</code> : The parser treats a comma in a character class as a range separator.
   [xjavac]                                                                                       ^
   [xjavac] 2 errors

can do <code>\u005cuc</code> to fix it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original was correct. \u is recognized as the start of a Unicode escape at a very early stage by the Java lexical analyzer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I should put it back to <code>\ uc</code>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, \u005cuc is correct. The tokenizer will read that as \uc

Unicode escapes are processed before anything else happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, \u005cuc is in the current changes so running CI should be fine

@SingingBush SingingBush requested a review from elharo November 3, 2025 16:29
@elharo elharo merged commit dcacb20 into apache:main Nov 5, 2025
4 checks passed
@SingingBush SingingBush deleted the javadoc/XERCES-1781-part-6 branch November 5, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants