Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageutils sectionID function removing dots and colons: why? #2580

Open
trollkotze opened this issue Nov 18, 2018 · 3 comments
Open

pageutils sectionID function removing dots and colons: why? #2580

trollkotze opened this issue Nov 18, 2018 · 3 comments

Comments

@trollkotze
Copy link
Contributor

trollkotze commented Nov 18, 2018

In inc/pageutils.php, line 231, there is a function named sectionID:

function sectionID($title,&$check) {
    $title = str_replace(array(':','.'),'',cleanID($title));
    $new = ltrim($title,'0123456789_-');
    if(empty($new)){
        $title = 'section'.preg_replace('/[^0-9]+/','',$title); //keep numbers from headline
    }else{
        $title = $new;
    }

    if(is_array($check)){
        // make sure tiles are unique
        if (!array_key_exists ($title,$check)) {
            $check[$title] = 0;
        } else {
            $title .= ++ $check[$title];
        }
    }

    return $title;
}

This function is apparently used to transform fragment identifiers in URLs and id tags of section headers to some "allowed" format, by removing for example dots and colons, among some other things.

It is almost exclusively called from within the function _headerToLink (defined equally in inc/parser/xhtml.php and inc/parser/metadata.php), which, despite its name is also used for rendering the href in link tags from Wiki code like this:

Anchorlink: [[#mv.01.01]]
which becomes (simplified)
Anchorlink: <a href="#mv0101">mv.01.01</a>
(i.e. the dots are removed from the href)

(called, for example, from within the functions locallink and internallink in inc/parser/xhtml.php)

So it is not possible to define anchor ids containing dots, for example, and link to them.

I wonder what the rationale behind this behaviour is, if there is one, and whether it should in fact only pertain to the anchor ids of TOC-relevant section headers, which, for some reason, should not be allowed to contain for example dots and colons. (I understand that colons are used for the namespace hierarchy when not using URL rewrites, so there is an argument for disallowing colons in section ids for better legibility, but since the fragment is separated by a hash '#', this is not a technical necessity.)

Dots and colons are not illegal characters for fragment links and id tags in the XHTML 1.0 specification (see section "C.8. Fragment Identifiers" there). So I see no reason why they are stripped here.

In a DokuWiki installation that I help to administrate, we need dots in some fragment URLs (including some section headers). So I changed the inc/pageutils.php, line 231
from

    $title = str_replace(array(':','.'),'',cleanID($title));

to

    $title = str_replace(array(':'),'',cleanID($title));

without experiencing any problems after that, but being able to use dots in DokuWiki hash links and section ids now.

If there is no actual technical reason why dots (and maybe also colons) should be disallowed, I propose to incorporate that change in the DokuWiki official source.

@ssahara
Copy link
Collaborator

ssahara commented Nov 19, 2018

no dots in headline IDs had implemented around 2009, likely to avoid conflict with css class selector (.)

@splitbrain
Copy link
Collaborator

Also see the discussion in the old bug tracker on that: https://bugs.dokuwiki.org/1627.html

@trollkotze
Copy link
Contributor Author

trollkotze commented Nov 19, 2018

Thanks. I see. I already thought that might be what's behind it: caution in regards to JavaScript/jQuery and CSS selector syntax.

However, in this case the caution seems to be based on a misreading of the W3C spec for CSS2. (Or rather a miswriting: The spec formulation is really badly phrased at this point.)

In the old bug tracker (https://bugs.dokuwiki.org/1627.html), HåkanS quotes the W3C spec:

from: http://www.w3.org/TR/CSS2/syndata.html

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A1 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, or a hyphen followed by a digit. Identifiers can also contain escaped characters and { any ISO 10646 character as a numeric code } (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".

(highlighting of parts in bold by me, as well as marking a self-contained expression with { ... } to avoid a reference error when parsing)

So the first sentence says "only", but then immediately the next sentence continues with "also", which is really confusing. And then the the qualification "as a numeric code" applies only to "any ISO 10646 character", as can be concluded from the given examples there.

I found this website does a good job at phrasing this all more clearly: https://mathiasbynens.be/notes/css-escapes

=== Identifiers and strings in CSS ===

The spec defines identifiers using a token diagram. They may contain the symbols from a to z, from A to Z, from 0 to 9, underscores (_), hyphens -, non-ASCII symbols or escape sequences for any symbol. They cannot start with a digit, or a hyphen (-) followed by a digit. Identifiers require at least one symbol (i.e. the empty string is not a valid identifier).

These are the relevant diagrams ("railroad diagrams"):
https://drafts.csswg.org/css-syntax/#ident-token-diagram
https://drafts.csswg.org/css-syntax/#escape-diagram

The latter diagram makes it clear that any non-newline and non-hex (i.e., not 0-9 and A-F [case insensitive]) character can be inserted in an escaped way, by simply preceding it with a backslash. (Even standard "allowed" characters (except for newline and hex characters), which would not need escaping, can be inserted in this way, e.g. '\g' in a CSS identifier is equivalent to 'g'. [Note that '\n' for example is not escaped to a newline in CSS, but simply to 'n'.])

The standard JavaScript DOM methods, like document.querySelector and document.querySelectorAll, as well as JQuery also handle these escaped CSS identifiers without problems. Only thing one has to keep in mind: Since JavaScript itself uses backslash escapes in string expressions in quite a similar way, one has to insert double backslashes if one wants the backslash to be parsed as an escape in the CSS identifier.

With all that said: This only tells us what's allowed in CSS, and how it must be escaped in CSS (and JavaScript dealing with CSS selectors).

When simply generating HTML and filling the "id" and "class" attributes, one does not have to worry about escaping. So <div id="something.with.dots:or.even.colons"></div> is perfectly legal and could be referenced by this CSS:

#something\.with\.dots\:or\.even\.colons {
}

The CSS standard would allow any character in an escaped form inside identifiers. But HTML or XHTML have more restrictions on what can be used in "id" and "class" attributes:

From the XHTML 1.0 standard under the heading "C.8. Fragment Identifiers":

since the set of legal values for attributes of type ID is much smaller than for those of type CDATA, the type of the name attribute has been changed to NMTOKEN (link inserted by me: referring to a website in German language). This attribute is constrained such that it can only have the same values as type ID, or as the Name production in XML 1.0 Section 2.3, production 5.

/.../

Note that the collection of legal values in XML 1.0 Section 2.3, production 5 is much larger than that permitted to be used in the ID and NAME types defined in HTML 4. When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used.

(emphasis mine; I think the the class attribute allows the same class of characters, but I don't want to look it up now)

So using dots and colons inside class and id attributes makes no problems and can be adressed in CSS using escapes (i.e. \. or \:), and it is allowed by XHTML1.0 and even backwards-compatible to HTML4.

trollkotze added a commit to trollkotze/dokuwiki that referenced this issue Mar 20, 2022
A verbose explanation why including colons and periods in section IDs should be allowable and does not cause problems can be found in this 3 year old issue: dokuwiki#2580
Running a DokuWiki with this small modification since 3 years, allowing '.' and ':' in section ids and anchor links to them etc. has caused no problems in all this time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants