fix Issue 10001 - std.format insert underscores into numbers #5303

burner · 2017-03-23T16:40:13Z

Big numbers are hard to read. D allows to insert underscores into numbers to aid readability.
This PR gives std.format the ability to insert underscores into numbers.

string tmp = format("%_d", 1234567);
assert(tmp == "1_234_567");

tmp = format("%_4f", 1234567.891011);
assert(tmp == "123_4567.8910_11");

The numbers of digits between underscores can be specified similar to the way a precision is given.

Todo:

Complete documentation
squash commits
parse scientific float representation and insert break chars

quickfur · 2017-03-23T20:01:21Z

This is a known enhancement request: https://issues.dlang.org/show_bug.cgi?id=10001

wilzbach · 2017-03-23T20:06:14Z

@burner: have a look at https://github.com/dlang-bots/dlang-bot for letting the bot and changelog know about the issue

quickfur · 2017-03-23T20:08:19Z

std/format.d

@@ -958,6 +994,12 @@ if (is(Unqual!Char == Char))
    int precision = UNSPECIFIED;

    /**
+       Breaks. Its value defines how many digits are printed between
+       underscors.


Typo: underscores.

quickfur

I like this. The only thing I'm not quite comfortable with is the Underscore specifier being inserted between the width and the precision -- IMO width and precision should be kept together, and _ inserted after that.

Also, the change to the grammar docs needs to be complete -- the Underscore production needs to be linked to the rest of the grammar somehow.

quickfur · 2017-03-23T20:18:22Z

std/format.d

+    $(I empty)
+    $(B '_')
+    $(B '_') $(I Integer)
+    $(B '_*')


This grammar change is incomplete. Where does Underscore fit in with the rest of the grammar?

And on that note, I see in your unittest examples that you inserted it between the width and the precision (i.e., %5_4.5d). Personally, I don't like this. I think a better ordering would be to keep the width and precision together, and tack on the _ afterwards, like this: %5.5_4d.

Also, ddoc is being stupid and rendering the _ as blanks. You have to write it as __ (two underscores) instead.

The given grammar is just a suggestion. The parser actually allows you to give multiple order of width, precision and underscore. To fix that fact, I added another assert at the last unittest.

Thanks for the underscore tip.

quickfur · 2017-03-23T20:39:12Z

Also, this is a side note and not really the problem of this PR: the format string grammar in the docs doesn't match the code! For example, this compiles and runs with no error and prints SRSLY?!:

import std.stdio;
void main() {
        writefln("%+#+-#+-#s", "SRSLY?!");
}

And this prints WAT:

import std.stdio;
void main() {
        writefln("%+#-.1+0#-.2+#0#-.4#-+s", "WAT");
}

Yet according to the grammar, these ought to be malformed format strings.

I think this code needs a cleanup (not necessarily in this PR, though).

quickfur

One more thing: right now the unittests and examples only show using _ with integral types. What does it do when you have %s? Does it do something sane? Should it be allowed?

schveiguy · 2017-03-24T19:18:24Z

Nice idea. I'll bring up a related closed PR from the past: #3377

schveiguy · 2017-03-24T19:22:05Z

I think at the very least, one should be able to format with non-underscores, e.g. 1,000,000.

quickfur · 2017-03-24T19:53:37Z

Hmm. This makes me wonder if we're going down a slippery slope of adding _ for 1_000_000, then , for 1,000,000, then . for certain locales where they write 1.000.000... where will it end?

Though I can see (re)designing the specifier syntax so that it's possible to specify which character to use as separator. Perhaps %_3,s for the comma variant... but the trouble with that is that the current implementation, which (IMO wrongly) permits any order of flags including repeats (contrary to the stated grammar), you'd have an ambiguity if you wanted to use . as separator (e.g., %_3.0s: does that mean _3. and 0, or _3 and .0?).

quickfur · 2017-03-24T19:55:38Z

There would be no nice way to use alphanumeric separators, though I can't see any reason why anybody would need that (1a000a000 seems like a really silly thing to do... if you really wanted that you could just write your own formatting function for it).

schveiguy · 2017-03-25T20:06:46Z

This makes me wonder if we're going down a slippery slope

I think since this is for human readability benefit, if we are going to support adding separators, we should at least support what humans are used to reading.

I don't think we need to support arbitrary characters. I think it would be possible to define both _ and , specifiers, which would separate with the respective characters. Using . is problematic, since you have the decimal place specifier. However, in locales that use . as a separator, they wouldn't be using it for a decimal place as well, so you would need a specialized formatter. There's also the possibility that using , means to use whatever the user's locale defines.

burner · 2017-03-26T12:08:12Z

I'm willing to add a single user defined character version. Where a char is from ascii without digits. No dchar stuff. IMO everything is else is just gone be overkill.
Grammar would be something like _[^:digit:]

quickfur · 2017-03-27T05:15:37Z

Using locale-dependent interpretation of , would be best, but that requires being able to query the current locale from the OS and being able to interpret it sufficiently to tell how to render ,. And Phobos currently doesn't have the facility to query locale information.

I agree that allowing arbitrary separators would be unnecessary.

I also recommend using , by default and _ optionally, since we really should be catering to default human-readable conventions like 1,000,000 instead of D-programmer-specific 1_000_000.

quickfur · 2017-03-27T05:21:26Z

As for dots vs. commas, in Russian, for example, the decimal mark is a comma and the grouping separator is the dot. So you'd write 1.000.000,50 instead of 1,000,000.50.

FYI: https://en.wikipedia.org/wiki/Decimal_mark alludes to a proposed standard to use spaces as separator instead of dots or commas.

quickfur · 2017-03-27T05:23:51Z

P.S., I seem to have gotten Russian and Spanish mixed up. Russian apparently uses spaces for grouping, whereas Spanish uses dots for grouping, and both use commas for decimal.

burner · 2017-03-27T11:44:50Z

you can't have '.' in the string as they donate the start of the precision flag. And we can not enforce any order of width, precision, and underscore as no order was every enforced.

quickfur · 2017-03-27T17:31:47Z

And you can't have space either, because it already means something else in the format string. :-( Looks like we may just have to resort to a separate number-formatting function.

Also, when used with non-digit strings, %_s seems kinda pointless. Maybe it's better to implement this in a separate number-formatting function?

quickfur · 2017-03-27T17:34:06Z

As for no order ever being enforced, that was never publicly stated in the documentation. It also allows for nonsensical specs like %1.1.1.1.1+-1.1s, which should not be allowed IMO. I think it should be OK to enforce a specific order. The current implementation is a mess. I think it's very bad that the documentation (grammar) and code don't match right now.

quickfur

Something is screwy with that assert on line 2257. It's failing on stuff it shouldn't fail on.

Also, the code seems somewhat crude, as there are loops that really aren't necessary and could be replaced by a single expression. Or by standard stuff from std.algorithm, which ought to reduce the likelihood of bugs and missing corner cases, as well as make the code a bit more concise and clearer to read.

quickfur · 2017-04-05T18:35:49Z

std/format.d

+        }
+
+        assert(firstDigit != size_t.max);
+        assert(dot != size_t.max);


This doesn't look right:

writefln("%.0f", 3.14);

prints 3, but:

writefln("%,.0f", 3.14);

causes this assert to fail.

Also, here's another failing case that doesn't involve zero precision:

writefln("%3g", 1_000_000.123456); // prints "1e+06" writefln("%3,g", 1_000_000.123456); // assert fails

quickfur · 2017-04-05T18:48:31Z

std/format.d

+            {
+                ++separatorScoreCnt;
+            }
+        }


Whoa, is this loop necessary?? Couldn't you just simplify this to a simple divide-and-round? Namely:

size_t separatorScoreCnt = (firstLen == 0) ? 0 : (firstLen - 1) / fs.separators;

quickfur · 2017-04-05T18:50:52Z

std/format.d

+            {
+                ++separatorScoreCnt;
+            }
+        }


And this one too. I didn't work this one through, but surely this is computable via a suitable integer division rather than a loop!

quickfur · 2017-04-05T18:52:02Z

std/format.d

+        // plus, minus prefix
+        for (j = 0; j < separatorScoreCnt && buf[j] == ' '; ++j)
+        {
+        }


Couldn't this be simplified to some combination of find (which is already imported by this function)?

I'm not really sure how to use find here, but I'm open to ideas.

Hmm. Apparently find isn't quite what we need here. How about:

import std.string : indexOf; auto j = buf[0 .. separatorScoreCnt].indexOf(' ');

? Would that work?

it was a little different but basically yes, thanks

quickfur · 2017-04-05T18:52:32Z

std/format.d

+        }
+        put(w, '.');
+
+        // digists after dot


Nitpick: spelling: digits

quickfur · 2017-04-06T17:13:41Z

std/format.d

+                {
+                    ++separatorScoreCnt;
+                }
+            }


Isn't this loop basically equivalent to:

ptrdiff_t mantissaLen = afterDotLen - (dot + 1); separatorScoreCnt = (mantissaLen > 0) ? (mantissaLen - 1) / fs.separators : 0;

?

P.S. afterDotLen seems a bit misleading, because it's actually an index to the end of the mantissa. So perhaps a better name might be afterDotIdx?

changed both

quickfur · 2017-04-07T23:46:32Z

Thanks!

ping @schveiguy any other comments before we merge?

schveiguy

Other than the typo, gotta be honest, I don't know how the code works. So I'll assume from the unit tests and from @quickfur's thorough review that it works correctly :) Just fix the typo, and this is good to go.

schveiguy · 2017-04-10T22:06:17Z

std/format.d

@@ -215,6 +223,18 @@ $(I FormatChar):
        preceding the actual argument, is taken as the precision.
        If it is negative, it is as if there was no $(I Precision) specifier.)

+        $(DT $(I Separator))
+        $(DD Inserts the seperator symbols ',' every $(I X) digits, from right


seperator -> separator

quickfur · 2017-04-14T02:19:02Z

ping @burner

Let's fix that typo, clean up the git commit log, and let's merge this already.

quickfur · 2017-04-14T02:21:13Z

std/format.d

+        // rest
+        if (ePos != -1)
+        {
+            put(w, buf[afterDotIdx .. len]);


Hmm, the autotester is reporting a range violation on this line.

quickfur · 2017-04-14T02:29:33Z

std/format.d

+    if (fs.flSeparator && dot != -1)
+    {
+        ptrdiff_t firstDigit = buf.indexOfAny("0123456789");
+        ptrdiff_t ePos = buf.indexOf('e');


Found the problem: buf is a static array of length 512, so the above searches of buf will actually search all 512 bytes, since in D, we do not stop upon encountering a null terminator. So lines 2236, 2239, and 2240 should use buf[0 .. len].indexOf(...) rather than buf.indexOf(...).

If e is not found in the formatted string, for example, the above indexOf call will search past the end of the formatted string. If you're unlucky enough that previous data included an e (note that buf is void-initialized -- yet another reason why auto-zeroing local variables is a good idea), then ePos will be an invalid index. So later on when you try to slice buf[ePos .. len] it will throw a RangeError because ePos > len.

burner · 2017-04-18T09:46:48Z

@quickfur back from a holiday, applied your fix, fixed the typo. Lets see what the autotester says.

quickfur · 2017-04-18T16:09:12Z

Yay, autotester passed!

quickfur · 2017-04-18T16:11:54Z

I'm not understanding the coverage report, though. Why is the entire block from l.2238 to l.2297 marked in red? Even though we clearly cover this code in the unittests (which is why it was failing before when we had buf.indexOf(...) instead of buf[0..len].indexOf(...)).

Or is that because the local unittests didn't cover it, but it just happened to be covered by a use case in another Phobos module?

quickfur · 2017-04-18T16:16:55Z

std/format.d

+                }
+                else
+                {
+                    // "." was specified, but nothing after it


Small nitpick: misleading comment, should be "," not ".".

quickfur · 2017-04-18T16:22:06Z

Also, please simplify the commit message. There's no need to list all the previous commit messages now that it's all squashed. Nobody reading the commit log would know (or care) what "rewrite", "minor fix", "typo deleted old code" refer to, for example. It's good enough to just say that this (squashed) commit implements feature XYZ and leave it at that.

burner · 2017-04-19T07:26:00Z

not sure what is happening with the coverage. I manually checked and every line is hit, except the throw statements.

quickfur · 2017-04-19T18:38:32Z

No problem, I was just wondering.

Anyway, I think that's good enough already. Let's merge!!

burner · 2017-04-19T21:57:01Z

@quickfur, @schveiguy thank you

quickfur · 2017-06-01T17:30:12Z

Just found some missed cases this morning: https://issues.dlang.org/show_bug.cgi?id=17459

burner · 2017-06-01T20:49:18Z

I'll have a look

burner added Needs Review Needs Work labels Mar 23, 2017

burner added this to the 2.075.0 milestone Mar 23, 2017

quickfur changed the title ~~std.format insert underscores into numbers~~ Fixes issue 10001: std.format insert underscores into numbers Mar 23, 2017

quickfur reviewed Mar 23, 2017

View reviewed changes

quickfur requested changes Mar 23, 2017

View reviewed changes

quickfur changed the title ~~Fixes issue 10001: std.format insert underscores into numbers~~ fix Issue 10001 - std.format insert underscores into numbers Mar 23, 2017

burner force-pushed the origin/formatunderscore branch from 12479a9 to e0d8c00 Compare March 24, 2017 09:52

dlang-bot added the Bug Fix label Mar 24, 2017

quickfur approved these changes Mar 24, 2017

View reviewed changes

quickfur reviewed Mar 24, 2017

View reviewed changes

burner force-pushed the origin/formatunderscore branch 2 times, most recently from 1e703ed to 3ad6c74 Compare March 27, 2017 09:33

burner removed Needs Work Bug Fix labels Mar 27, 2017

quickfur requested changes Apr 5, 2017

View reviewed changes

burner force-pushed the origin/formatunderscore branch 2 times, most recently from 07222d4 to 57a6485 Compare April 6, 2017 11:50

quickfur requested changes Apr 6, 2017

View reviewed changes

burner force-pushed the origin/formatunderscore branch from 57a6485 to cf584f0 Compare April 7, 2017 09:45

quickfur approved these changes Apr 7, 2017

View reviewed changes

schveiguy approved these changes Apr 10, 2017

View reviewed changes

quickfur reviewed Apr 14, 2017

View reviewed changes

burner force-pushed the origin/formatunderscore branch from cf584f0 to 37346b9 Compare April 18, 2017 09:44

burner force-pushed the origin/formatunderscore branch from 37346b9 to 4ed58df Compare April 18, 2017 10:01

quickfur reviewed Apr 18, 2017

View reviewed changes

format with comma formatspec to add separator into numbers

7ddc85d

burner force-pushed the origin/formatunderscore branch from 4ed58df to 7ddc85d Compare April 19, 2017 07:06

quickfur added auto-merge and removed Needs Review labels Apr 19, 2017

dlang-bot merged commit 5de9af2 into dlang:master Apr 19, 2017

burner deleted the origin/formatunderscore branch April 19, 2017 21:57

quickfur mentioned this pull request Jun 8, 2017

Added %t tuple format #5249

Closed

fix Issue 10001 - std.format insert underscores into numbers #5303

fix Issue 10001 - std.format insert underscores into numbers #5303

Conversation

burner commented Mar 23, 2017 • edited

quickfur commented Mar 23, 2017

wilzbach commented Mar 23, 2017

Choose a reason for hiding this comment

quickfur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quickfur commented Mar 23, 2017

quickfur left a comment

Choose a reason for hiding this comment

schveiguy commented Mar 24, 2017

schveiguy commented Mar 24, 2017

quickfur commented Mar 24, 2017

quickfur commented Mar 24, 2017 • edited

schveiguy commented Mar 25, 2017

burner commented Mar 26, 2017

quickfur commented Mar 27, 2017

quickfur commented Mar 27, 2017

quickfur commented Mar 27, 2017

burner commented Mar 27, 2017

quickfur commented Mar 27, 2017

quickfur commented Mar 27, 2017

quickfur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quickfur commented Apr 7, 2017

schveiguy left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quickfur commented Apr 14, 2017

Choose a reason for hiding this comment

quickfur Apr 14, 2017 • edited

Choose a reason for hiding this comment

burner commented Apr 18, 2017

quickfur commented Apr 18, 2017

quickfur commented Apr 18, 2017 • edited

Choose a reason for hiding this comment

quickfur commented Apr 18, 2017 • edited

burner commented Apr 19, 2017

quickfur commented Apr 19, 2017

burner commented Apr 19, 2017

quickfur commented Jun 1, 2017

burner commented Jun 1, 2017

burner commented Mar 23, 2017 •

edited

quickfur commented Mar 24, 2017 •

edited

schveiguy left a comment •

edited

quickfur Apr 14, 2017 •

edited

quickfur commented Apr 18, 2017 •

edited

quickfur commented Apr 18, 2017 •

edited