Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

Merged
merged 7 commits into from
Oct 25, 2023

Conversation

rossberg
Copy link
Member

So far, the Wasm text format only recognises LF as a newline. This PR tweaks that definition to align with Section 5.8 of the Unicode standard, which recommends treating CR, LF, CRLF, and NEL as equivalent newlines in text files, in order to be agnostic to newline translations performed by various tools and conversions between OSes and character sets.

This only affects the interpretation of line comments, where previously, code like

;; comment<CR>
(instr)

would (surprisingly) treat the instruction as part of the comment, but no longer does now. While theoretically a breaking change, it shouldn't affect any well-formatted code. It hence seems okay to consider this a bug fix.

@sunfishcode
Copy link
Member

I propose not recognizing NEL (U+85) as a newline character. It is not recognized as a newline in popular text formats, tools, or editors, so it would be inconvenient to work with if files using NEL newlines were in circulation. An incomplete list of popular text formats that don't recognize NEL:

@rossberg
Copy link
Member Author

Yeah, I'm not feeling strongly. I'd assume there is value in following the official Unicode recommendation, even though NEL hardly matters in practice. I believe the motivation for that recommendation is maximising compatibility with legacy character sets, but I'd be fine limiting that concern to ASCII.

That said, there are various other codepoints in Unicode that represent forms of line breaks, and there likely is no consistency in how editors and tools treat those. So it's doubtful if that form of alignment is even possible, independent of NEL.

@sunfishcode
Copy link
Member

Yes, the situations with plain-CR, LS, and PS are complex, and there's no way to avoid some problems no matter what we do here. But, we can still avoid other problems, which I propose we do here.

@tlively
Copy link
Member

tlively commented Oct 16, 2023

Do NEL characters ever appear in practice? If not, following unicode would not cause any additional problems in practice and we would get the benefits of just doing what unicode recommends (at least for this particular sliver of the design space).

@sunfishcode
Copy link
Member

sunfishcode commented Oct 24, 2023

Do NEL characters ever appear in practice? If not, following unicode would not cause any additional problems in practice and we would get the benefits of just doing what unicode recommends (at least for this particular sliver of the design space).

I'm not aware of any practical benefits of following this particular Unicode recommendation here. NEL's only known use in practice is in text converted from EBCDIC. My understanding is that Unicode added the recommendation about NEL for the benefit of interoperability with EBCDIC documents. I don't believe this is relevant to Wasm.

Additional examples of things that don't interpret U+85 (NEL) as a newline include JS, CSS, HTML, Python, and many more.

@tlively
Copy link
Member

tlively commented Oct 25, 2023

I do think that being able to offload all decision making in this area to unicode is an advantage, but I'll admit it's a small one. I guess being consistent with so many existing text formats is probably a larger advantage, so not treating NEL as a newline sounds fine to me.

@rossberg
Copy link
Member Author

Okay, I don't care enough, so I just removed NEL. PTAL.

\unicode{09} ~|~ \unicode{0A} ~|~ \unicode{0D} \\
\Tnewline ~|~ \unicode{09} \\
\production{newline} & \Tnewline &::=&
\unicode{0A} ~|~ \unicode{0D} ~|~ \unicode{0D}~\unicode{0A} \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we got rid of the last case here, then it would parse as two newlines in a row, which I don't think would break anything. Is it worth making that simplification, or is it better to keep that case for consistency with the standard definition of newline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, currently that should not observably change anything (though an implementation taking that literally would report incorrect line numbers). But yeah, I think it is preferable to use the standard definition for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting and unfortunate that git and GitHub treat this as a binary file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume their heuristic is based on the appearance of 0x00 and other control characters.

@rossberg rossberg merged commit 43d405f into main Oct 25, 2023
5 checks passed
@rossberg rossberg deleted the newline branch October 25, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants