Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify wording in spec for character groups #604

Closed
wooorm opened this issue Sep 10, 2019 · 9 comments
Closed

Clarify wording in spec for character groups #604

wooorm opened this issue Sep 10, 2019 · 9 comments

Comments

@wooorm
Copy link
Contributor

wooorm commented Sep 10, 2019

Problem

The part of the spec that defines character groups uses labels that are inconsistent: there’s whitespace and there’s unicode whitespace, compared to ascii punctuation and punctuation.

The rest of the spec uses additional groups, or potentially confusing names. For example, taking the leaf blocks:

  • thematic breaks can contain (not at the start) spaces and tabs but may not contain non-whitespace characters except for the marker, leaving room for non-space-or-tab whitespace?
  • ATX headings must have a space character after the opening #, but leading and trailing whitespace is stripped from the content, so #\talpha is invalid but # \talpha is fine?
  • Setext heading underlines can have trailing spaces, but not other trailing whitespace?
  • Link reference definition labels can include infinity line endings (Newlines inside link labels? #586)
  • Blank lines only include spaces or tabs, not other whitespace

Maybe there are very good reasons for those, but I feel that a), as someone implementing the spec, it would help to streamline the names that are used, and b), as a user, it would help to have less different types of white space.

Solution

The unicode groups (“unicode whitespace” and “punctuation”) are only referenced for emphasis/importance. Maybe they can be moved down? That would make it more clear that whitespace / punctuation is about ASCII, as defined “above”.

Maybe it’s also a good idea to not include line endings in whitespace. There are several cases where “whitespace” is used, but line endings cannot occur (e.g., GH-586, although the rest of the link reference definition spec is very good at mentioning that one line ending is allowed).
That way the spec can be explicit about “space or tab characters”, “whitespace”, “white space or line endings”, etc.

If this is of interest, I can work on this!

@jgm
Copy link
Member

jgm commented Sep 11, 2019

The unicode groups (“unicode whitespace” and “punctuation”) are only referenced for emphasis/importance. Maybe they can be moved down?

You mean move these to the Emphasis section? I'm not sure. Actually I think there's some advantage having Unicode Whitespace defined in the same place as Whitespace; that helps you see that there is a distinction being made between them.

Maybe it’s also a good idea to not include line endings in whitespace.

Maybe so. But one would have to be very careful not to break anything in making this change. It might also be worth treating FormFeed specially, and not making it whitespace -- I don't remember if there was a solid reason for that.

#\talpha is invalid but # \talpha is fine?

Well, actually you can use a tab there. Tabs are covered by this passage:

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

Since this is a context where whitespace helps define block structure, the tab acts as if it were expanded to spaces. Now, I agree, it would be much better if the spec were much more explicit about tabs, throughout. The reason it isn't is historical: originally we assumed a preprocessed source in which tabs had already been expanded to spaces. #386 recommends an overhaul of the spec, being explicit about tabs instead of relying on this passage. If you're interested in doing that, it would be welcome.

Setext heading underlines can have trailing spaces, but not other trailing whitespace?

See above.

thematic breaks can contain (not at the start) spaces and tabs but may not contain non-whitespace characters except for the marker, leaving room for non-space-or-tab whitespace?

Agreed, that's a bit of a wart that could be cleaned up.

@wooorm
Copy link
Contributor Author

wooorm commented Sep 11, 2019

You mean move these to the Emphasis section? I'm not sure.

Fine too! And how about making the updating the names, such as:

  • ASCII whitespace / Unicode whitespace; ASCII punctuation / Unicode punctuation
  • Whitespace / Unicode whitespace; punctuation / Unicode punctuation

not include line endings in whitespace

Maybe so. But one would have to be very careful not to break anything in making this change. It might also be worth treating FormFeed specially, and not making it whitespace -- I don't remember if there was a solid reason for that.

Is there a reason line tabulation and form feed are included in whitespace at all? If those weren’t there, it would be easier to disambiguate between “spaces or tabs”, “line endings”, or “whitespace“ (being both).
Also: line tab isn’t part of unicode whitespace, but form feed is 🤔
Having line tab and and form feed in there also leaves the question whether they can indent things, like tabs.

Tabs are covered by this passage:

Right, I suspected that, but because other places explicitly name the tab, it leaves room open to wonder what should happen if it isn’t. And line tab and form feed make this more confusing.


Thanks for the context!

@jgm
Copy link
Member

jgm commented Sep 12, 2019

Is there a reason line tabulation and form feed are included in whitespace at all?

I've been trying to remember that. None I can think of at the moment. I'd be inclined to eject them.

But this would make "ASCII whitespace" problematic, since one might assume this label to apply to all ASCII whitespace characters.

Also: line tab isn’t part of unicode whitespace, but form feed is

Not sure why. That seems irrational.

Anyway, I think a general cleanup in the area would make sense, but it should include the space/tab issue noted above.

@wooorm
Copy link
Contributor Author

wooorm commented Sep 12, 2019

To summarise, are we agreeing on:

  • Use the names whitespace / Unicode whitespace; punctuation / Unicode punctuation
  • Whitespace would be spaces, tabs, carriage return, line feed
  • We revisit every place that space / tab / whitespace is used and carefully decide what can be used, space characters, space characters or tab characters, whitespace characters (potentially including up to X line endings)

I can work on it.


Anyway, I think a general cleanup in the area would make sense, but it should include the space/tab issue noted above.

Are you talking about solving GH-386 together with the above summary, or..?


I’d also like to suggest using a separate word for “space” if it’s about the expanded size. E.g., take and ATX heading: The opening # character may be indented 0-3 spaces in combination with a block quote: the character > together with a following space. In the case of >\t# Alpha, where 1 “space” of the tab is used for the blockquote, and three for the heading.
Perhaps (a) space size / space or (b) space / space character?

@jgm
Copy link
Member

jgm commented Sep 12, 2019

I'm not agreed on "ASCII punctuation" -> "punctuation." I think specifying ASCII is important; too many people will misunderstand if it's not explicit there.

As for "whitespace", it's not the best word, but "ASCII whitespace" also seems wrong if we're excluding e.g. FF. I'm not sure it's bad just to take it as a technical term.

carefully decide what can be used

I think in most cases this is already decided; the work would mainly be replacing talk of spaces with talk of space or tabs, and this could get difficult or confusing in cases where only part of a tab may be used (list indentation for example, or #386).

@Crissov
Copy link
Contributor

Crissov commented Sep 13, 2019

“basic whitespace” := whitespace characters in US-ASCII / ISO 646 / C0 / Basic Latin block? They are all in Unicode and in almost all other encodings.
“whitespace” := every character with certain Unicode properties.

@wooorm
Copy link
Contributor Author

wooorm commented Oct 1, 2019

Rethinking this, I now prefer to be explicit: ASCII punctuation and Unicode punctuation.

For whitespace, if we have Unicode whitespace, and line endings, and are dropping line tab and form feed from whitespace, that only leaves spaces and tabs. In which case we can also be explicit everywhere whether just spaces, or spaces and tabs, (or even line endings) are allowed. This would resolve the whitespace issue.

@jgm
Copy link
Member

jgm commented Oct 1, 2019

Sounds reasonable to me.

@wooorm
Copy link
Contributor Author

wooorm commented May 23, 2020

Closed by GH-618.

@wooorm wooorm closed this as completed May 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants