Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some space characters missing from glossary/whitespace.md #642

Open
dd8 opened this issue Jul 1, 2019 · 22 comments
Open

Some space characters missing from glossary/whitespace.md #642

dd8 opened this issue Jul 1, 2019 · 22 comments
Assignees
Labels
Blocked Blocked by another PR/Issue Definition

Comments

@dd8
Copy link
Collaborator

dd8 commented Jul 1, 2019

There are some zero-width Unicode characters that don't have the Unicode White_Space property, but should be treated equivalently to whitespace when detecting empty text alternatives and labels:

U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 word joiner
U+FEFF zero width no-break space

The Unicode chars with the White_Space property are:

0009..000D ; White_Space # Cc [5] ..
0020 ; White_Space # Zs SPACE
0085 ; White_Space # Cc
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
2028 ; White_Space # Zl LINE SEPARATOR
2029 ; White_Space # Zp PARAGRAPH SEPARATOR
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE

@Jym77
Copy link
Collaborator

Jym77 commented Jul 10, 2019

I have mixed feelings about this…

As far as I can tell, the definition of whitespace is mostly used in cases like "the accessible name is not only whitespaces". And it of course makes sense to say that <img alt=" "> is not helping anybody…


On the one hand, I do agree that a "ZERO WIDTH SPACE" is kind of a space and having <img alt="[ZERO WIDTH SPACE]"> is not helping at all.


On the other hand, the "non only whitespaces" check is just a very basic test that there is actually an accessible name and not just a lazy developer/designer adding alt=" " all over the place. But this is an extremely basic test. <img alt="aaaaaa"> is not only whitespaces and is not really helping either.

We are clearly not checking whether the accessible name makes sense or not. This does require manual testing anyway. "Not only whitespace" is the basic minimum check we can do, but is extremely barebone. Manual testing (needed to catch <img alt="aaaaa">) will catch <img alt=[ZERO WIDTH SPACE]>, so I don't feel a strong urge of adding the case.


Next, I feel that this is getting into corner cases. Why would somebody write <img alt=[ZERO WIDTH SPACE]">? I can see a lazy person being told to add alt to all images and doing <img alt=" "> all over the place just to meet some goal that has been poorly set. But why going the extra mile to put a ZERO WIDTH SPACE when using "aaaaa" is enough to pass the check?

I mean, I do know how to produce SPACE or NO-BREAK SPACE on my keyboard. I have no clue how to produce EM QUAD or ZERO WIDTH SPACE and if I want one, I would probably have to copy/paste it from some Unicode page… So, putting <img alt="[ZERO WIDTH SPACE]"> is requiring some actual work to specifically go and get that special character. Thus, I do not see this happening accidentally (while " " could be a typo for ""). And I am at a loss as to why somebody would do that intentionally.


Lastly, there is likely a reason as to why Unicode chose to not put the White_Space property on the ZERO WIDTH SPACE. I have not looked into it, and I don't know why they chose to do so. Put I'm pretty sure that this is a deliberate choice on their side. And I am also pretty sure that Unicode is way more competent than I am (and probably than any of us is) in deciding what should or shouldn't be a whitespace.

I think it is way easier for us to rely on existing definitions that are going to be used by everybody rather than to come up with our own definitions.


And I notice that you've forgotten U+180E MONGOLIAN VOWEL SEPARATOR in your list. Which is kinda the reason why we should rely on Unicode property rather than try and build our own list…

(and what about characters like U+2423 OPEN BOX that are specifically designed to make space visible?)


So, to sum up, I think that this would only be useful is a handful of corner cases and is thus not worth the cost of using a custom definition rather than Unicode's definition.

@dd8
Copy link
Collaborator Author

dd8 commented Jul 11, 2019

Next, I feel that this is getting into corner cases. Why would somebody write <img alt=[ZERO WIDTH SPACE]">?

I agree it's unlikely anyone would type this, but U+FEFF zero width no-break space is quite common because it also functions as the byte-order-mark.

Every file saved as 'Unicode' or 'UTF-8' by Windows Notepad starts with this character. Every file saved as 'UTF-16' by macOS TextEdit starts with this character. A blank file saved as Unicode by Notepad contains the single character U+FEFF zero width no-break space.

This can get onto web pages in many ways - for example by concatenating files or using server side includes:

<label><!--#include virtual="label.txt" --></label>

@Jym77
Copy link
Collaborator

Jym77 commented Jul 11, 2019

OK. I didn't know it was used in so many places… Then accidental usage may happen…


If we want to test these, I think I would prefer to keep whitespace as "Unicode white spaces" and add another glossary entry for "zero width characters" or something, and change the rules from "is not only whitespaces" to "is not only whitespaces or zero width characters".

This might look a bit more cumbersome but I think it is a good idea to have a definition of whitespace which is the same as the commonly accepted one (Unicode, currently).


We could make this easier to read by having a definition of "textual content", or something similar, which would be "not only whitespaces or zero width characters" and then the rules would be about having "an accessible name with textual content".


On an automation/maintainability/scalability point of view, it is still a bit annoying to have an explicit list that may evolve rather than relying on a property of the character. If Unicode decides to add or remove the White_space property on some characters, any automated check using this will be automatically updated for the new list. If we decide to add or remove characters to the explicit list of "bad characters", the change needs to be cascaded to all automated tools and that requires some work and takes some time.

@dd8
Copy link
Collaborator Author

dd8 commented Jul 11, 2019

A good compromise might be unicode White_Space code points plus U+FEFF zero width no-break space. That's simple to maintain, and the exception is easy to explain.

It's worth noting there's very little consistency between W3 specs on the definition of whitespace - HTML has 2 different definitions (in a single spec), and CSS has 2 different definitions - see this issue:
w3c/accname#55

I've not checked exactly where whitespace.md is used, but referencing it needs care due to the inconsistent definitions of whitespace in various W3 specs. For example, it might seem reasonable to change the 'ID is unique' rule applicability from:

'Any id attribute which is not the empty string ("")'

to

'Any id attribute which is not whitespace.md'

but that wouldn't match the definition of whitespace used by the HTML spec for parsing id attributes.

@Jym77
Copy link
Collaborator

Jym77 commented Jul 12, 2019

Hmmmm…


The difference between exactly empty string ("") and string with no content (" ") is important in some places. I'm specifically thinking about the alt attribute where alt="" marks things as decorative while alt=" " doesn't (I'm not sure how id is handled).

Which does point in the direction of flagging alt="[ZERO WIDTH SPACE]" as bad since this is not humanly different from alt="" but is going to be handled completely differently by AT and mess up things…


And digging a bit more into Unicode leads me to a solution which may make everybody (or at least both of us) happy:

Unicode has characters categories, such as lowercase letters or uppercase letters. Some of these categories specifically contains characters that are non-printable:

The five characters you pointed in this issue all belong to the Cf category. I am not totally sure about all characters in that one (I'm more sure about the Cc category), but maybe the good way to do it would be to prevent names that contains only whitespaces, Cc, Cf.

(I don't know anything about Arabic and can't decide whether it makes sense to reject alt="[ARABIC FOOTNOTE MARKER]" which seems to have a glyph despite being a format character)

Doing it that way will still leave the burden of maintaining the characters list to Unicode while rejecting a bunch of "obviously" wrong things (if an accessible name is composed only of BELL, CANCEL and RIGHT-TO-LEFT MARK that seems to be a big problem…)


And the current definition of whitespace (having the White_Space property) including all of the Separators categories (Zs, Zl, Zp) plus some characters of the Cc category, we may want to streamline even further by forbidding names consisting only of "Cc, Cf, Zs, Zl, Zp" categories.

@EmmaJP
Copy link
Collaborator

EmmaJP commented Jul 12, 2019

It does seem like there may need to be a rethink in how 'whitespace' is used in rules and/or defined in the glossary. @audreymaniez at #597 suggests considering a more positive wording for rules, and includes a couple of relevant examples.

The thing most related rules seems to be checking is that something like an accessible name or text alternative is not made up of characters that are not meaningful on their own. This could be whitespace characters, or the control and format characters @Jym77 mentions. Are there any other Unicode character sets that could fall under the umbrella of 'not meaningful on their own'?

I agree with Jean-Yves that it would be good to use a widely recognised grouping(s) of characters, rather than create another.

@Jym77
Copy link
Collaborator

Jym77 commented Jul 15, 2019

Are there any other Unicode character sets that could fall under the umbrella of 'not meaningful on their own'?

This is hard to say, depending on what we call "not meaningful on their own" :-D
Especially, current examples accept alt=":-)" where we can argue that punctuation symbols are not really meaningful on their own…

Unicode character property is a nice place to start looking at it.

But then, I'm also a bit afraid of getting into the hairy parts of Unicode. Typically, surrogate characters (category Cs) are not meaningful on their own but become meaningful 2 by 2 (and should always be used 2 by 2). Trying to write that into rules might become more work than needed. Similarly, I'm inclined to say that combining characters are not meaningful on their own, but I am far from sure that they cannot be combined with some other "not meaningful" (another combining, a whitespace, …) and produced something meaningful.

@dd8
Copy link
Collaborator Author

dd8 commented Jul 15, 2019

I guess there's several things to consider:

  1. invisible characters (e.g. zero width space, BIDI formatting characters)
  2. whitespace characters (e.g. space, tab, non-breaking space)
  3. combining characters (https://en.wikipedia.org/wiki/Combining_character) which make no sense when used alone (because they modify the previous character)
  4. silent punctuation (e.g. commas and full stops are not usually voiced, but user can change settings to read these)

Unicode 12.1.0 has a section '4.12 Characters with Unusual Properties' which covers 1) and probably 3) as well:
https://www.unicode.org/versions/Unicode12.1.0/ch04.pdf

Would using the list in 4.12 along with characters with the White_Space property to cover 2) resolve the issue?

@dd8
Copy link
Collaborator Author

dd8 commented Jul 15, 2019

Some of this may be difficult due to how Unicode is represented in the DOM:
https://mathiasbynens.be/notes/javascript-encoding#comment-5

This is an issue for any characters in the range U+10000 to U+10FFFF (some of which are listed in '4.12 Characters with Unusual Properties').

@Jym77
Copy link
Collaborator

Jym77 commented Jul 16, 2019

You are right with the surrogates… (thanks UCS-2/UTF-16 for adding to the fun…)

There are "not meaningful by themselves" characters out of the BMP (eg, language tag characters, …) Because they are out of the BMP, they will be represented by 2 surrogates characters in Javascript and in DOMStrings which essentially use UTF-16. Thus, they will not be caught by anything that allows surrogates, and we can't prevent surrogates as they can also be used in a meaningful way by themselves.


However, let's not loose focus. Whatever set of characters we decide to be "not meaningful by themselves", we're still going to have a lot of accessible names that are tagged as OK but aren't. No definition of "bad" character will rule out etduneiesnau as an accessible name, and we do have a lot of passed examples using :-) as accessible name (thus, using only characters in your set "4. silent punctuation"). We're trying to improve on the "not only whitespaces" description.


I do not think that the list of characters in 4.12 is enough. It does not contain most stuff from the Cc category which are nonetheless obviously not meaningful by themselves.


I have mixed feeling about rejecting the combining characters. Specifically, I'm afraid that there exist a combination on one combining character plus a whitespace which would produce something that makes sense. Thus it would be bad to reject names consisting only of combining characters and whitespaces.


It seems that 4.12 is a subset of categories Cf and Mc (plus U+0020 SPACE). I think it would be easier to reject Cc, Cf, Zs, Zl, Zp, (Mc) rather than "cherry picking" the full 4.12 (as in easier to check).

@Jym77
Copy link
Collaborator

Jym77 commented Jul 19, 2019

Something else to take into account…

  • The Core AAM uses the term "whitespace" (without any definition) in one place:

aria-roledescription [should not be mapped if it] is empty or whitespace characters

  • The SVG AAM uses the term "whitespace" (without any definition) in one place (twice):

SVG user agents MUST provide an accessible object in the accessibility tree for rendered SVG elements that meet any of the following criteria (…):

  • It has at least one direct child title element or desc element that is not empty after trimming whitespace. (…)
  • It has a non-empty (after trimming whitespace) aria-label attribute or aria-roledescription attribute.

Thus, there are places where it is probably important to use "whitespace" (whatever the AAMs mean by that…) and not something else.

@annethyme
Copy link
Collaborator

annethyme commented Jul 25, 2019

We are primarily using the "whitespace" definition when talking about accessible names.
In the Accessible Name and Description Computation 1.1 it mentions:

Flat string
A string of characters where all carriage returns, newlines, tabs, and form-feeds are replaced with a single space, and multiple spaces are reduced to a single space. The string contains only character data; it does not contain any markup.
[..]
C. Otherwise, if computing a name, and if the current node has an aria-label attribute whose value is not the empty string, nor, when trimmed of white space, is not the empty string:

Here of course there is no definition for any of the terms "carriage returns", "newlines", "tabs", "form-feeds" or "white space".
That is what we tried to do with the whitespace definition.

@jeeyyy jeeyyy removed the Agenda item label Aug 7, 2019
@jeeyyy
Copy link
Collaborator

jeeyyy commented Aug 7, 2019

@Jym77

I believe we discussed this in a previous call.
Can you please take the ownership on this one and propose a way forward (if any) and or close the issue otherwise.

Thanks.

@Jym77
Copy link
Collaborator

Jym77 commented Aug 19, 2019

My current position is that the term "whitespace" is used in several places in the specs (as pointed in my previous message, and Anne's message) and therefore we need to stick to the same definition that they use rather than come up with our own (i.e. close the issue). Which is of course easier said than done given that the specs that use "whitespace" do not really define it…

That is, if the specs say "aria-roledescription [should not be mapped if it] is empty or whitespace characters", then we cannot change the definition of whitespace to mean "whitespace or zero-width spaces" since that would lead to checks being inconsistent with the specs…

@dd8
Copy link
Collaborator Author

dd8 commented Aug 20, 2019

There are ambiguous situations if you stick to the same definition as the specs without additional guidance. In this example which whitespace definition or definitions do you use?

<style>
th:last-child::after { content: "\00a0"; } /* non-breaking space */

/* escaped form feed - bypasses https://www.w3.org/TR/css-syntax-3/#input-preprocessing using escape */
td:first-child::before { content: "\000c"; } /* form feed */
td:last-child::after { content: "\00a0"; }  /* non-breaking space */
</style>

<table>
 <tr>
  <th>&#x2003;</th> <!-- EM space in HTML -->
   <th></th> <!-- EM space  inserted by CSS content:-->
 </tr> 
 <tr>
  <td></td> <!-- CSS content is form feed -->
  <td></td> <!-- CSS content is non-breaking space -->
 </tr> 
</table>
  1. For the first-child th above the accessible headers algorithm is very specific about what happens - the TH containing an EM space is removed from the headers list at step 4. because the 'empty cell' definition ignores cells containing Unicode White_Space:
    https://html.spec.whatwg.org/multipage/tables.html#header-and-data-cell-semantics

Note: the headers algorithm uses different whitespace definitions at different steps - ASCII whitespace for splitting headers at step 3, then Unicode White_Space to determine if the header cell should be ignored at step 4.

  1. It's much less clear what happens in for the last-child th - there's no indication of how (if at all) the headers algorithm interacts with the AccName algorithm which does take CSS content: into account.

  2. For the td cells with CSS content: which whitespace definition do you use:

  1. Finally, the AccName calculation itself is inconsistent on whitespace handling - some steps ignore whitespace only strings, others don't:
    Inconsistent name calculations for input elements w3c/html-aam#231 (comment)

I think you may need 2 or 3 whitespace definitions to handle spec inconsistencies:

Edit: it might be clearer if the last 2 definitions use the phrase 'non-visible characters' rather than 'user-perceivable whitespace'

@dd8
Copy link
Collaborator Author

dd8 commented Aug 23, 2019

More whitespace related issues - the normative definitions of 'empty' are all different for:

@dd8
Copy link
Collaborator Author

dd8 commented Aug 28, 2019

There's a PR in place for the 'empty cells' definition used by the HTML table headers calculation - it's changing to use ASCII whitespace instead of Unicode Zs whitespace, and is now aligned with the th:empty definition in Selectors Level 4
whatwg/html#4860

The inconsistency between the empty-cells: property in CSS and the 'empty cells' definition is noted as a legacy features by the CSS spec authors.
whatwg/html#4854 (comment)

@dd8
Copy link
Collaborator Author

dd8 commented Aug 28, 2019

In terms of resolving this issue, I think there are 2 kinds of whitespace here:

  1. syntactic whitespace defined in normative specs (which is different in CSS, SVG and HTML)
  2. characters the user perceives as whitespace

At the moment the Whitespace glossary definition is used for both, which has caused a problem in the 'Autocomplete valid' rule (it's flagging invalid autocomplete values as inapplicable)

I'd propose doing the following:

  1. reference the appropriate normative definition for the exact syntax and type of whitespace needed for the rule (e.g. for 'Autocomplete valid' reference https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#space-separated-tokens).

  2. Add a warning to the Whitespace glossary definition saying it should not be used for syntactic whitespace (e.g. don't use for spaces between IDs in aria-labelledby, spaces between tokens in autocomplete and spaces between attributes in HTML tags)

The exact nature of characters the user perceives as whitespace is open for debate, but syntactic whitespace isn't since it has a normative definition. It's also easier to discuss the Whitespace glossary definition in user terms once syntax is excluded from the definition.

@Jym77
Copy link
Collaborator

Jym77 commented Aug 29, 2019

I agree with @dd8 analysis and solution:
there is (are) "whitespace" as per specs which are "syntactic" whitespace and can be different things depending on context (sometimes ASCII whitespace, sometimes more) and we should link that when needed, and there is our own definition of "whitespace" which is what we consider as an easy "no go" for names and can include other things that what the specs say.

And we should definitely not mix these.

@dd8
Copy link
Collaborator Author

dd8 commented Aug 29, 2019

I've compiled a Gist of normative definitions of whitespace from different specs:
https://gist.github.com/dd8/8a8149c2ec7093dcf8caae6b9645ac0b

Of note:

  • there's a bug in the HTML 4.01 Recommendation that defines U+000C both as whitespace and an unused code point
  • there's very little agreement on non-ASCII whitespace between different specs

@jeeyyy
Copy link
Collaborator

jeeyyy commented Oct 22, 2019

@dd8 can you make a PR for this?

@jeeyyy jeeyyy assigned dd8 and unassigned Jym77 Oct 22, 2019
@dd8
Copy link
Collaborator Author

dd8 commented Oct 23, 2019

A PR is blocked by these issues:

w3c/accname#55
w3c/html-aam#238

@WilcoFiers WilcoFiers added the Blocked Blocked by another PR/Issue label May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Blocked by another PR/Issue Definition
Projects
None yet
Development

No branches or pull requests

6 participants