-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some space characters missing from glossary/whitespace.md #642
Comments
I have mixed feelings about this… As far as I can tell, the definition of whitespace is mostly used in cases like "the accessible name is not only whitespaces". And it of course makes sense to say that On the one hand, I do agree that a "ZERO WIDTH SPACE" is kind of a space and having On the other hand, the "non only whitespaces" check is just a very basic test that there is actually an accessible name and not just a lazy developer/designer adding We are clearly not checking whether the accessible name makes sense or not. This does require manual testing anyway. "Not only whitespace" is the basic minimum check we can do, but is extremely barebone. Manual testing (needed to catch Next, I feel that this is getting into corner cases. Why would somebody write I mean, I do know how to produce SPACE or NO-BREAK SPACE on my keyboard. I have no clue how to produce EM QUAD or ZERO WIDTH SPACE and if I want one, I would probably have to copy/paste it from some Unicode page… So, putting Lastly, there is likely a reason as to why Unicode chose to not put the I think it is way easier for us to rely on existing definitions that are going to be used by everybody rather than to come up with our own definitions. And I notice that you've forgotten (and what about characters like So, to sum up, I think that this would only be useful is a handful of corner cases and is thus not worth the cost of using a custom definition rather than Unicode's definition. |
I agree it's unlikely anyone would type this, but U+FEFF zero width no-break space is quite common because it also functions as the byte-order-mark. Every file saved as 'Unicode' or 'UTF-8' by Windows Notepad starts with this character. Every file saved as 'UTF-16' by macOS TextEdit starts with this character. A blank file saved as Unicode by Notepad contains the single character U+FEFF zero width no-break space. This can get onto web pages in many ways - for example by concatenating files or using server side includes:
|
OK. I didn't know it was used in so many places… Then accidental usage may happen… If we want to test these, I think I would prefer to keep This might look a bit more cumbersome but I think it is a good idea to have a definition of whitespace which is the same as the commonly accepted one (Unicode, currently). We could make this easier to read by having a definition of "textual content", or something similar, which would be "not only whitespaces or zero width characters" and then the rules would be about having "an accessible name with textual content". On an automation/maintainability/scalability point of view, it is still a bit annoying to have an explicit list that may evolve rather than relying on a property of the character. If Unicode decides to add or remove the |
A good compromise might be unicode White_Space code points plus U+FEFF zero width no-break space. That's simple to maintain, and the exception is easy to explain. It's worth noting there's very little consistency between W3 specs on the definition of whitespace - HTML has 2 different definitions (in a single spec), and CSS has 2 different definitions - see this issue: I've not checked exactly where whitespace.md is used, but referencing it needs care due to the inconsistent definitions of whitespace in various W3 specs. For example, it might seem reasonable to change the 'ID is unique' rule applicability from:
to
but that wouldn't match the definition of whitespace used by the HTML spec for parsing |
Hmmmm… The difference between exactly empty string ( Which does point in the direction of flagging And digging a bit more into Unicode leads me to a solution which may make everybody (or at least both of us) happy: Unicode has characters categories, such as lowercase letters or uppercase letters. Some of these categories specifically contains characters that are non-printable: The five characters you pointed in this issue all belong to the Cf category. I am not totally sure about all characters in that one (I'm more sure about the Cc category), but maybe the good way to do it would be to prevent names that contains only whitespaces, Cc, Cf. (I don't know anything about Arabic and can't decide whether it makes sense to reject Doing it that way will still leave the burden of maintaining the characters list to Unicode while rejecting a bunch of "obviously" wrong things (if an accessible name is composed only of BELL, CANCEL and RIGHT-TO-LEFT MARK that seems to be a big problem…) And the current definition of whitespace (having the |
It does seem like there may need to be a rethink in how 'whitespace' is used in rules and/or defined in the glossary. @audreymaniez at #597 suggests considering a more positive wording for rules, and includes a couple of relevant examples. The thing most related rules seems to be checking is that something like an accessible name or text alternative is not made up of characters that are not meaningful on their own. This could be whitespace characters, or the control and format characters @Jym77 mentions. Are there any other Unicode character sets that could fall under the umbrella of 'not meaningful on their own'? I agree with Jean-Yves that it would be good to use a widely recognised grouping(s) of characters, rather than create another. |
This is hard to say, depending on what we call "not meaningful on their own" :-D Unicode character property is a nice place to start looking at it. But then, I'm also a bit afraid of getting into the hairy parts of Unicode. Typically, surrogate characters (category |
I guess there's several things to consider:
Unicode 12.1.0 has a section '4.12 Characters with Unusual Properties' which covers 1) and probably 3) as well: Would using the list in 4.12 along with characters with the White_Space property to cover 2) resolve the issue? |
Some of this may be difficult due to how Unicode is represented in the DOM: This is an issue for any characters in the range U+10000 to U+10FFFF (some of which are listed in '4.12 Characters with Unusual Properties'). |
You are right with the surrogates… (thanks UCS-2/UTF-16 for adding to the fun…) There are "not meaningful by themselves" characters out of the BMP (eg, language tag characters, …) Because they are out of the BMP, they will be represented by 2 surrogates characters in Javascript and in DOMStrings which essentially use UTF-16. Thus, they will not be caught by anything that allows surrogates, and we can't prevent surrogates as they can also be used in a meaningful way by themselves. However, let's not loose focus. Whatever set of characters we decide to be "not meaningful by themselves", we're still going to have a lot of accessible names that are tagged as OK but aren't. No definition of "bad" character will rule out I do not think that the list of characters in 4.12 is enough. It does not contain most stuff from the Cc category which are nonetheless obviously not meaningful by themselves. I have mixed feeling about rejecting the combining characters. Specifically, I'm afraid that there exist a combination on one combining character plus a whitespace which would produce something that makes sense. Thus it would be bad to reject names consisting only of combining characters and whitespaces. It seems that 4.12 is a subset of categories Cf and Mc (plus U+0020 SPACE). I think it would be easier to reject Cc, Cf, Zs, Zl, Zp, (Mc) rather than "cherry picking" the full 4.12 (as in easier to check). |
Something else to take into account…
Thus, there are places where it is probably important to use "whitespace" (whatever the AAMs mean by that…) and not something else. |
We are primarily using the "whitespace" definition when talking about accessible names.
Here of course there is no definition for any of the terms "carriage returns", "newlines", "tabs", "form-feeds" or "white space". |
I believe we discussed this in a previous call. Thanks. |
My current position is that the term "whitespace" is used in several places in the specs (as pointed in my previous message, and Anne's message) and therefore we need to stick to the same definition that they use rather than come up with our own (i.e. close the issue). Which is of course easier said than done given that the specs that use "whitespace" do not really define it… That is, if the specs say "aria-roledescription [should not be mapped if it] is empty or whitespace characters", then we cannot change the definition of whitespace to mean "whitespace or zero-width spaces" since that would lead to checks being inconsistent with the specs… |
There are ambiguous situations if you stick to the same definition as the specs without additional guidance. In this example which whitespace definition or definitions do you use?
Note: the headers algorithm uses different whitespace definitions at different steps - ASCII whitespace for splitting
I think you may need 2 or 3 whitespace definitions to handle spec inconsistencies:
Edit: it might be clearer if the last 2 definitions use the phrase 'non-visible characters' rather than 'user-perceivable whitespace' |
More whitespace related issues - the normative definitions of 'empty' are all different for:
|
There's a PR in place for the 'empty cells' definition used by the HTML table headers calculation - it's changing to use ASCII whitespace instead of Unicode Zs whitespace, and is now aligned with the The inconsistency between the |
In terms of resolving this issue, I think there are 2 kinds of whitespace here:
At the moment the Whitespace glossary definition is used for both, which has caused a problem in the 'Autocomplete valid' rule (it's flagging invalid autocomplete values as inapplicable) I'd propose doing the following:
The exact nature of characters the user perceives as whitespace is open for debate, but syntactic whitespace isn't since it has a normative definition. It's also easier to discuss the Whitespace glossary definition in user terms once syntax is excluded from the definition. |
I agree with @dd8 analysis and solution: And we should definitely not mix these. |
I've compiled a Gist of normative definitions of whitespace from different specs: Of note:
|
@dd8 can you make a PR for this? |
A PR is blocked by these issues: |
There are some zero-width Unicode characters that don't have the Unicode White_Space property, but should be treated equivalently to whitespace when detecting empty text alternatives and labels:
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 word joiner
U+FEFF zero width no-break space
The Unicode chars with the White_Space property are:
0009..000D ; White_Space # Cc [5] ..
0020 ; White_Space # Zs SPACE
0085 ; White_Space # Cc
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
2028 ; White_Space # Zl LINE SEPARATOR
2029 ; White_Space # Zp PARAGRAPH SEPARATOR
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE
The text was updated successfully, but these errors were encountered: