Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Issue 10001 - std.format insert underscores into numbers #5303

Merged
merged 1 commit into from Apr 19, 2017

Conversation

burner
Copy link
Member

@burner burner commented Mar 23, 2017

Big numbers are hard to read. D allows to insert underscores into numbers to aid readability.
This PR gives std.format the ability to insert underscores into numbers.

string tmp = format("%_d", 1234567);
assert(tmp == "1_234_567");

tmp = format("%_4f", 1234567.891011);
assert(tmp == "123_4567.8910_11");

The numbers of digits between underscores can be specified similar to the way a precision is given.

Todo:

  • Complete documentation
  • squash commits
  • parse scientific float representation and insert break chars

@burner burner added this to the 2.075.0 milestone Mar 23, 2017
@quickfur quickfur changed the title std.format insert underscores into numbers Fixes issue 10001: std.format insert underscores into numbers Mar 23, 2017
@quickfur
Copy link
Member

This is a known enhancement request: https://issues.dlang.org/show_bug.cgi?id=10001

@wilzbach
Copy link
Member

@burner: have a look at https://github.com/dlang-bots/dlang-bot for letting the bot and changelog know about the issue

std/format.d Outdated
@@ -958,6 +994,12 @@ if (is(Unqual!Char == Char))
int precision = UNSPECIFIED;

/**
Breaks. Its value defines how many digits are printed between
underscors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: underscores.

Copy link
Member

@quickfur quickfur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. The only thing I'm not quite comfortable with is the Underscore specifier being inserted between the width and the precision -- IMO width and precision should be kept together, and _ inserted after that.

Also, the change to the grammar docs needs to be complete -- the Underscore production needs to be linked to the rest of the grammar somehow.

std/format.d Outdated
$(I empty)
$(B '_')
$(B '_') $(I Integer)
$(B '_*')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grammar change is incomplete. Where does Underscore fit in with the rest of the grammar?

And on that note, I see in your unittest examples that you inserted it between the width and the precision (i.e., %5_4.5d). Personally, I don't like this. I think a better ordering would be to keep the width and precision together, and tack on the _ afterwards, like this: %5.5_4d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, ddoc is being stupid and rendering the _ as blanks. You have to write it as __ (two underscores) instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The given grammar is just a suggestion. The parser actually allows you to give multiple order of width, precision and underscore. To fix that fact, I added another assert at the last unittest.

Thanks for the underscore tip.

@quickfur quickfur changed the title Fixes issue 10001: std.format insert underscores into numbers fix Issue 10001 - std.format insert underscores into numbers Mar 23, 2017
@quickfur
Copy link
Member

Also, this is a side note and not really the problem of this PR: the format string grammar in the docs doesn't match the code! For example, this compiles and runs with no error and prints SRSLY?!:

import std.stdio;
void main() {
        writefln("%+#+-#+-#s", "SRSLY?!");
}

And this prints WAT:

import std.stdio;
void main() {
        writefln("%+#-.1+0#-.2+#0#-.4#-+s", "WAT");
}

Yet according to the grammar, these ought to be malformed format strings.

I think this code needs a cleanup (not necessarily in this PR, though).

Copy link
Member

@quickfur quickfur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: right now the unittests and examples only show using _ with integral types. What does it do when you have %s? Does it do something sane? Should it be allowed?

@schveiguy
Copy link
Member

Nice idea. I'll bring up a related closed PR from the past: #3377

@schveiguy
Copy link
Member

I think at the very least, one should be able to format with non-underscores, e.g. 1,000,000.

@quickfur
Copy link
Member

Hmm. This makes me wonder if we're going down a slippery slope of adding _ for 1_000_000, then , for 1,000,000, then . for certain locales where they write 1.000.000... where will it end?

Though I can see (re)designing the specifier syntax so that it's possible to specify which character to use as separator. Perhaps %_3,s for the comma variant... but the trouble with that is that the current implementation, which (IMO wrongly) permits any order of flags including repeats (contrary to the stated grammar), you'd have an ambiguity if you wanted to use . as separator (e.g., %_3.0s: does that mean _3. and 0, or _3 and .0?).

@quickfur
Copy link
Member

quickfur commented Mar 24, 2017

There would be no nice way to use alphanumeric separators, though I can't see any reason why anybody would need that (1a000a000 seems like a really silly thing to do... if you really wanted that you could just write your own formatting function for it).

@schveiguy
Copy link
Member

This makes me wonder if we're going down a slippery slope

I think since this is for human readability benefit, if we are going to support adding separators, we should at least support what humans are used to reading.

I don't think we need to support arbitrary characters. I think it would be possible to define both _ and , specifiers, which would separate with the respective characters. Using . is problematic, since you have the decimal place specifier. However, in locales that use . as a separator, they wouldn't be using it for a decimal place as well, so you would need a specialized formatter. There's also the possibility that using , means to use whatever the user's locale defines.

@burner
Copy link
Member Author

burner commented Mar 26, 2017

I'm willing to add a single user defined character version. Where a char is from ascii without digits. No dchar stuff. IMO everything is else is just gone be overkill.
Grammar would be something like _[^:digit:]

@quickfur
Copy link
Member

Using locale-dependent interpretation of , would be best, but that requires being able to query the current locale from the OS and being able to interpret it sufficiently to tell how to render ,. And Phobos currently doesn't have the facility to query locale information.

I agree that allowing arbitrary separators would be unnecessary.

I also recommend using , by default and _ optionally, since we really should be catering to default human-readable conventions like 1,000,000 instead of D-programmer-specific 1_000_000.

@quickfur
Copy link
Member

As for dots vs. commas, in Russian, for example, the decimal mark is a comma and the grouping separator is the dot. So you'd write 1.000.000,50 instead of 1,000,000.50.

FYI: https://en.wikipedia.org/wiki/Decimal_mark alludes to a proposed standard to use spaces as separator instead of dots or commas.

@quickfur
Copy link
Member

P.S., I seem to have gotten Russian and Spanish mixed up. Russian apparently uses spaces for grouping, whereas Spanish uses dots for grouping, and both use commas for decimal.

@burner burner force-pushed the origin/formatunderscore branch 2 times, most recently from 1e703ed to 3ad6c74 Compare March 27, 2017 09:33
@burner
Copy link
Member Author

burner commented Mar 27, 2017

you can't have '.' in the string as they donate the start of the precision flag. And we can not enforce any order of width, precision, and underscore as no order was every enforced.

@quickfur
Copy link
Member

And you can't have space either, because it already means something else in the format string. :-( Looks like we may just have to resort to a separate number-formatting function.

Also, when used with non-digit strings, %_s seems kinda pointless. Maybe it's better to implement this in a separate number-formatting function?

@quickfur
Copy link
Member

As for no order ever being enforced, that was never publicly stated in the documentation. It also allows for nonsensical specs like %1.1.1.1.1+-1.1s, which should not be allowed IMO. I think it should be OK to enforce a specific order. The current implementation is a mess. I think it's very bad that the documentation (grammar) and code don't match right now.

Copy link
Member

@quickfur quickfur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is screwy with that assert on line 2257. It's failing on stuff it shouldn't fail on.

Also, the code seems somewhat crude, as there are loops that really aren't necessary and could be replaced by a single expression. Or by standard stuff from std.algorithm, which ought to reduce the likelihood of bugs and missing corner cases, as well as make the code a bit more concise and clearer to read.

std/format.d Outdated
}

assert(firstDigit != size_t.max);
assert(dot != size_t.max);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right:

writefln("%.0f", 3.14);

prints 3, but:

writefln("%,.0f", 3.14);

causes this assert to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here's another failing case that doesn't involve zero precision:

writefln("%3g", 1_000_000.123456);  // prints "1e+06"
writefln("%3,g", 1_000_000.123456); // assert fails

std/format.d Outdated
{
++separatorScoreCnt;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, is this loop necessary?? Couldn't you just simplify this to a simple divide-and-round? Namely:

size_t separatorScoreCnt = (firstLen == 0) ? 0 :
                           (firstLen - 1) / fs.separators;

std/format.d Outdated
{
++separatorScoreCnt;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one too. I didn't work this one through, but surely this is computable via a suitable integer division rather than a loop!

std/format.d Outdated
// plus, minus prefix
for (j = 0; j < separatorScoreCnt && buf[j] == ' '; ++j)
{
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be simplified to some combination of find (which is already imported by this function)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how to use find here, but I'm open to ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Apparently find isn't quite what we need here. How about:

import std.string : indexOf;
auto j = buf[0 .. separatorScoreCnt].indexOf(' ');

? Would that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a little different but basically yes, thanks

std/format.d Outdated
}
put(w, '.');

// digists after dot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: spelling: digits

@burner burner force-pushed the origin/formatunderscore branch 2 times, most recently from 07222d4 to 57a6485 Compare April 6, 2017 11:50
std/format.d Outdated
{
++separatorScoreCnt;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this loop basically equivalent to:

ptrdiff_t mantissaLen = afterDotLen - (dot + 1);
separatorScoreCnt = (mantissaLen > 0) ? (mantissaLen - 1) / fs.separators : 0;

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. afterDotLen seems a bit misleading, because it's actually an index to the end of the mantissa. So perhaps a better name might be afterDotIdx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed both

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@quickfur
Copy link
Member

quickfur commented Apr 7, 2017

Thanks!

ping @schveiguy any other comments before we merge?

Copy link
Member

@schveiguy schveiguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the typo, gotta be honest, I don't know how the code works. So I'll assume from the unit tests and from @quickfur's thorough review that it works correctly :) Just fix the typo, and this is good to go.

std/format.d Outdated
@@ -215,6 +223,18 @@ $(I FormatChar):
preceding the actual argument, is taken as the precision.
If it is negative, it is as if there was no $(I Precision) specifier.)

$(DT $(I Separator))
$(DD Inserts the seperator symbols ',' every $(I X) digits, from right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seperator -> separator

@quickfur
Copy link
Member

ping @burner

Let's fix that typo, clean up the git commit log, and let's merge this already.

// rest
if (ePos != -1)
{
put(w, buf[afterDotIdx .. len]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the autotester is reporting a range violation on this line.

std/format.d Outdated
if (fs.flSeparator && dot != -1)
{
ptrdiff_t firstDigit = buf.indexOfAny("0123456789");
ptrdiff_t ePos = buf.indexOf('e');
Copy link
Member

@quickfur quickfur Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the problem: buf is a static array of length 512, so the above searches of buf will actually search all 512 bytes, since in D, we do not stop upon encountering a null terminator. So lines 2236, 2239, and 2240 should use buf[0 .. len].indexOf(...) rather than buf.indexOf(...).

If e is not found in the formatted string, for example, the above indexOf call will search past the end of the formatted string. If you're unlucky enough that previous data included an e (note that buf is void-initialized -- yet another reason why auto-zeroing local variables is a good idea), then ePos will be an invalid index. So later on when you try to slice buf[ePos .. len] it will throw a RangeError because ePos > len.

@burner
Copy link
Member Author

burner commented Apr 18, 2017

@quickfur back from a holiday, applied your fix, fixed the typo. Lets see what the autotester says.

@quickfur
Copy link
Member

Yay, autotester passed!

@quickfur
Copy link
Member

quickfur commented Apr 18, 2017

I'm not understanding the coverage report, though. Why is the entire block from l.2238 to l.2297 marked in red? Even though we clearly cover this code in the unittests (which is why it was failing before when we had buf.indexOf(...) instead of buf[0..len].indexOf(...)).

Or is that because the local unittests didn't cover it, but it just happened to be covered by a use case in another Phobos module?

std/format.d Outdated
}
else
{
// "." was specified, but nothing after it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick: misleading comment, should be "," not ".".

@quickfur
Copy link
Member

quickfur commented Apr 18, 2017

Also, please simplify the commit message. There's no need to list all the previous commit messages now that it's all squashed. Nobody reading the commit log would know (or care) what "rewrite", "minor fix", "typo deleted old code" refer to, for example. It's good enough to just say that this (squashed) commit implements feature XYZ and leave it at that.

@burner
Copy link
Member Author

burner commented Apr 19, 2017

not sure what is happening with the coverage. I manually checked and every line is hit, except the throw statements.

@quickfur
Copy link
Member

No problem, I was just wondering.

Anyway, I think that's good enough already. Let's merge!!

@dlang-bot dlang-bot merged commit 5de9af2 into dlang:master Apr 19, 2017
@burner
Copy link
Member Author

burner commented Apr 19, 2017

@quickfur, @schveiguy thank you

@burner burner deleted the origin/formatunderscore branch April 19, 2017 21:57
@quickfur
Copy link
Member

quickfur commented Jun 1, 2017

Just found some missed cases this morning: https://issues.dlang.org/show_bug.cgi?id=17459

@burner
Copy link
Member Author

burner commented Jun 1, 2017

I'll have a look

@quickfur quickfur mentioned this pull request Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants