New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Issue 10001 - std.format insert underscores into numbers #5303
Conversation
This is a known enhancement request: https://issues.dlang.org/show_bug.cgi?id=10001 |
@burner: have a look at https://github.com/dlang-bots/dlang-bot for letting the bot and changelog know about the issue |
std/format.d
Outdated
@@ -958,6 +994,12 @@ if (is(Unqual!Char == Char)) | |||
int precision = UNSPECIFIED; | |||
|
|||
/** | |||
Breaks. Its value defines how many digits are printed between | |||
underscors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: underscores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. The only thing I'm not quite comfortable with is the Underscore specifier being inserted between the width and the precision -- IMO width and precision should be kept together, and _
inserted after that.
Also, the change to the grammar docs needs to be complete -- the Underscore
production needs to be linked to the rest of the grammar somehow.
std/format.d
Outdated
$(I empty) | ||
$(B '_') | ||
$(B '_') $(I Integer) | ||
$(B '_*') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This grammar change is incomplete. Where does Underscore
fit in with the rest of the grammar?
And on that note, I see in your unittest examples that you inserted it between the width and the precision (i.e., %5_4.5d
). Personally, I don't like this. I think a better ordering would be to keep the width and precision together, and tack on the _
afterwards, like this: %5.5_4d
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ddoc is being stupid and rendering the _
as blanks. You have to write it as __
(two underscores) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The given grammar is just a suggestion. The parser actually allows you to give multiple order of width, precision and underscore. To fix that fact, I added another assert at the last unittest.
Thanks for the underscore tip.
Also, this is a side note and not really the problem of this PR: the format string grammar in the docs doesn't match the code! For example, this compiles and runs with no error and prints
And this prints
Yet according to the grammar, these ought to be malformed format strings. I think this code needs a cleanup (not necessarily in this PR, though). |
12479a9
to
e0d8c00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: right now the unittests and examples only show using _
with integral types. What does it do when you have %s
? Does it do something sane? Should it be allowed?
Nice idea. I'll bring up a related closed PR from the past: #3377 |
I think at the very least, one should be able to format with non-underscores, e.g. |
Hmm. This makes me wonder if we're going down a slippery slope of adding Though I can see (re)designing the specifier syntax so that it's possible to specify which character to use as separator. Perhaps |
There would be no nice way to use alphanumeric separators, though I can't see any reason why anybody would need that ( |
I think since this is for human readability benefit, if we are going to support adding separators, we should at least support what humans are used to reading. I don't think we need to support arbitrary characters. I think it would be possible to define both |
I'm willing to add a single user defined character version. Where a char is from ascii without digits. No dchar stuff. IMO everything is else is just gone be overkill. |
Using locale-dependent interpretation of I agree that allowing arbitrary separators would be unnecessary. I also recommend using |
As for dots vs. commas, in Russian, for example, the decimal mark is a comma and the grouping separator is the dot. So you'd write 1.000.000,50 instead of 1,000,000.50. FYI: https://en.wikipedia.org/wiki/Decimal_mark alludes to a proposed standard to use spaces as separator instead of dots or commas. |
P.S., I seem to have gotten Russian and Spanish mixed up. Russian apparently uses spaces for grouping, whereas Spanish uses dots for grouping, and both use commas for decimal. |
1e703ed
to
3ad6c74
Compare
you can't have '.' in the string as they donate the start of the precision flag. And we can not enforce any order of width, precision, and underscore as no order was every enforced. |
And you can't have space either, because it already means something else in the format string. :-( Looks like we may just have to resort to a separate number-formatting function. Also, when used with non-digit strings, |
As for no order ever being enforced, that was never publicly stated in the documentation. It also allows for nonsensical specs like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is screwy with that assert on line 2257. It's failing on stuff it shouldn't fail on.
Also, the code seems somewhat crude, as there are loops that really aren't necessary and could be replaced by a single expression. Or by standard stuff from std.algorithm
, which ought to reduce the likelihood of bugs and missing corner cases, as well as make the code a bit more concise and clearer to read.
std/format.d
Outdated
} | ||
|
||
assert(firstDigit != size_t.max); | ||
assert(dot != size_t.max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right:
writefln("%.0f", 3.14);
prints 3
, but:
writefln("%,.0f", 3.14);
causes this assert to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, here's another failing case that doesn't involve zero precision:
writefln("%3g", 1_000_000.123456); // prints "1e+06"
writefln("%3,g", 1_000_000.123456); // assert fails
std/format.d
Outdated
{ | ||
++separatorScoreCnt; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, is this loop necessary?? Couldn't you just simplify this to a simple divide-and-round? Namely:
size_t separatorScoreCnt = (firstLen == 0) ? 0 :
(firstLen - 1) / fs.separators;
std/format.d
Outdated
{ | ||
++separatorScoreCnt; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this one too. I didn't work this one through, but surely this is computable via a suitable integer division rather than a loop!
std/format.d
Outdated
// plus, minus prefix | ||
for (j = 0; j < separatorScoreCnt && buf[j] == ' '; ++j) | ||
{ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be simplified to some combination of find
(which is already imported by this function)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure how to use find here, but I'm open to ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Apparently find
isn't quite what we need here. How about:
import std.string : indexOf;
auto j = buf[0 .. separatorScoreCnt].indexOf(' ');
? Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was a little different but basically yes, thanks
std/format.d
Outdated
} | ||
put(w, '.'); | ||
|
||
// digists after dot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: spelling: digits
07222d4
to
57a6485
Compare
std/format.d
Outdated
{ | ||
++separatorScoreCnt; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this loop basically equivalent to:
ptrdiff_t mantissaLen = afterDotLen - (dot + 1);
separatorScoreCnt = (mantissaLen > 0) ? (mantissaLen - 1) / fs.separators : 0;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. afterDotLen
seems a bit misleading, because it's actually an index to the end of the mantissa. So perhaps a better name might be afterDotIdx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
57a6485
to
cf584f0
Compare
Thanks! ping @schveiguy any other comments before we merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the typo, gotta be honest, I don't know how the code works. So I'll assume from the unit tests and from @quickfur's thorough review that it works correctly :) Just fix the typo, and this is good to go.
std/format.d
Outdated
@@ -215,6 +223,18 @@ $(I FormatChar): | |||
preceding the actual argument, is taken as the precision. | |||
If it is negative, it is as if there was no $(I Precision) specifier.) | |||
|
|||
$(DT $(I Separator)) | |||
$(DD Inserts the seperator symbols ',' every $(I X) digits, from right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seperator -> separator
ping @burner Let's fix that typo, clean up the git commit log, and let's merge this already. |
// rest | ||
if (ePos != -1) | ||
{ | ||
put(w, buf[afterDotIdx .. len]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the autotester is reporting a range violation on this line.
std/format.d
Outdated
if (fs.flSeparator && dot != -1) | ||
{ | ||
ptrdiff_t firstDigit = buf.indexOfAny("0123456789"); | ||
ptrdiff_t ePos = buf.indexOf('e'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the problem: buf
is a static array of length 512, so the above searches of buf
will actually search all 512 bytes, since in D, we do not stop upon encountering a null terminator. So lines 2236, 2239, and 2240 should use buf[0 .. len].indexOf(...)
rather than buf.indexOf(...)
.
If e
is not found in the formatted string, for example, the above indexOf
call will search past the end of the formatted string. If you're unlucky enough that previous data included an e
(note that buf
is void-initialized -- yet another reason why auto-zeroing local variables is a good idea), then ePos
will be an invalid index. So later on when you try to slice buf[ePos .. len]
it will throw a RangeError because ePos > len
.
cf584f0
to
37346b9
Compare
@quickfur back from a holiday, applied your fix, fixed the typo. Lets see what the autotester says. |
37346b9
to
4ed58df
Compare
Yay, autotester passed! |
I'm not understanding the coverage report, though. Why is the entire block from l.2238 to l.2297 marked in red? Even though we clearly cover this code in the unittests (which is why it was failing before when we had Or is that because the local unittests didn't cover it, but it just happened to be covered by a use case in another Phobos module? |
std/format.d
Outdated
} | ||
else | ||
{ | ||
// "." was specified, but nothing after it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick: misleading comment, should be "," not ".".
Also, please simplify the commit message. There's no need to list all the previous commit messages now that it's all squashed. Nobody reading the commit log would know (or care) what "rewrite", "minor fix", "typo deleted old code" refer to, for example. It's good enough to just say that this (squashed) commit implements feature XYZ and leave it at that. |
4ed58df
to
7ddc85d
Compare
not sure what is happening with the coverage. I manually checked and every line is hit, except the throw statements. |
No problem, I was just wondering. Anyway, I think that's good enough already. Let's merge!! |
@quickfur, @schveiguy thank you |
Just found some missed cases this morning: https://issues.dlang.org/show_bug.cgi?id=17459 |
I'll have a look |
Big numbers are hard to read. D allows to insert underscores into numbers to aid readability.
This PR gives std.format the ability to insert underscores into numbers.
The numbers of digits between underscores can be specified similar to the way a precision is given.
Todo: