-
Notifications
You must be signed in to change notification settings - Fork 224
Description
This is currently under implementation: implementation issue, feature specification.
Solution to #1.
To make long number literals more readable, allow authors to inject digit group separators inside numbers.
Examples with different possible separators:
100 000 000 000 000 000 000 // space
100,000,000,000,000,000,000 // comma
100.000.000.000.000.000.000 // period
100'000'000'000'000'000'000 // apostrophe (C++)
100_000_000_000_000_000_000 // underscore (many programming languages).
The syntax must work even with just a single separator, so it can't be anything that can already validly seperate two expressions (excludes all infix operators and comma) and should already be part of a number literal (excludes decimal point).
So, the comma and decimal point are probably never going to work, even if they are already the standard "thousands separator" in text in different parts of the world.
Space separation is dangerous because it's hard to see whether it's just space, or it's an accidental tab character. If we allow spacing, should we allow arbitrary whitespace, including line terminators? If so, then this suddenly become quite dangerous. Forget a comma at the end of a line in a multiline list, and two adjacent integers are automatically combined (we already have that problem with strings). So, probably not a good choice, even if it is the preferred formatting for print text.
The apostrope is also the string single-quote character. We don't currently allow adjacent numbers and strings, but if we ever do, then this syntax becomes ambiguous. It's still possible (we disambiguate by assuming it's a digit separator). It is currently used by C++ 14 as a digit group separator, so it is definitely possible.
That leaves underscore, which could be the start of an identifier. Currently 100_000
would be tokenized as "integer literal 100" followed by "identifier _000". However, users would never write an identifier adjacent to another token that contains identifier-valid characters (unlike strings, which have clear delimiters that do not occur anywher else), so this is unlikely to happen in practice. Underscore is already used by a large number of programming languages including Java, Swift, and Python.
We also want to allow multiple separators for higher-level grouping, e.g.,:
100__000_000_000__000_000_000
For this purpose, the underscore extends gracefully. So does space, but has the disadvantage that it collapses when inserted into HTML, whereas ''
looks odd.
For ease of reading and ease of parsing, we should only allow a digit separator that actually separates digits - it must occur between two digits of the number, not at the end or beginning, and if used in double literals, not adjacent to the .
or e{+,-,}
characters, or next to an x
in a hexadecimal literal.
Examples
100__000_000__000_000__000_000 // one hundred million million millions!
0x4000_0000_0000_0000
0.000_000_000_01
0x00_14_22_01_23_45 // MAC address
555_123_4567 // US Phone number
Invalid literals:
100_
0x_00_14_22_01_23_45
0._000_000_000_1
100_.1
1.2e_3
An identifier like _100
is a valid identifier, and _100._100
is a valid member access. If users learn the "separator only between digits" rule quickly, this will likely not be an issue.
Implementation issues
Should be trivial to implement at the parsing level. The only issue is that a parser might need to copy the digits (without the separators) before calling a parse function, where currently it might get away with pointing a native parse function directly at its input bytes.
This should have no effect after the parsing.
Style guides might introduce a preference for digit grouping (say, numbers with more than six digits should use separators) so a formatter or linter may want access to the actual source as well as the numerical value. The front end should make this available for source processing tools.
Library issues
Should int.parse
/double.parse
accept inputs with underscores. I think it's fine to not accept such input. It is not generated by int.toString()
, and if a user has a string containing such an input, they can remove underscores manually before calling int.parse
. That is not an option for source code literals.
I'd prefer to keep int.parse
as efficient as possible, which means not adding a special case in the inner loop.
In JavaScript, parsing uses the built-in parseInt
or Number
functions, which do not accept underscores, so it would add (another) overhead for JavaScript compiled code.
Related work
Java digit separators.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status