Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add support for '_' as thousands separator to math command #8496

Closed
EdwardBetts opened this issue Nov 28, 2021 · 17 comments
Closed

Comments

@EdwardBetts
Copy link
Contributor

It would be nice if I could use '_' as a thousands separator as a parameter to the math command. I'd like to be able to write 2000 as 2_000. I'm not asking for the output to change.

Python, JavaScript, Perl and Ruby all support '_' as a thousands separator. It makes it easier to read large numbers.

Example of suggested usage:

$ math 2_000 + 2
2002
$
@zanchey zanchey added this to the fish-future milestone Nov 28, 2021
@triallax
Copy link
Contributor

triallax commented Dec 1, 2021

Does it have to be for thousands only? Surely it can be used in other positions.

@floam
Copy link
Member

floam commented Dec 1, 2021

why not ,?

@faho
Copy link
Member

faho commented Dec 1, 2021

Because math uses that to separate arguments to functions. It is unavailable.

@EdwardBetts
Copy link
Contributor Author

Does it have to be for thousands only? Surely it can be used in other positions.

Yes, you're right, any position. I just checked, Javascript, Perl, Python and Ruby support '_' in any position within a number.

@mqudsi
Copy link
Contributor

mqudsi commented Dec 1, 2021

why not ,?
Because math uses that to separate arguments to functions. It is unavailable.

Also because that's very locale-specific behavior.

@andmis
Copy link
Contributor

andmis commented Dec 26, 2021

I looked into this a bit.

This was added to JavaScript recently, and they have a good summary of prior art in their proposal. tl;dr: most languages use an underscore, allow it to be placed between digits, and disallow multiple underscores in a row, leading or trailing underscores, underscores next to 0x or 0b, etc.

The implementation of fish's math builtin uses (a ported-to-C++ and otherwise-modified) tinyexpr for expression parsing and evaluation.

Out-of-the-box, to parse out numeric literals, the tokenizer in tinyexpr invokes strtod when it arrives at a decimal digit or a dot (.). The function strtod also advances the next-character-pointer in the expression being parsed.

Our modified tinyexpr invokes fish_wcstod, which is a wrapper around strtod and std::wcstod that uses the narrow-string code path when it can, for performance reasons according to a comment in the code. (Incidentally, AFAICT this means that our modified tinyexpr has quadratic runtime, since fish_wcstod decides whether to use the fast code path by checking whether every character in the string it's given is ASCII.) Neither strtod nor std::wcstod can be configured to do what we want (allow underscore dividers).

It looks like there are a few options:

  • Don't use the standard-library strtod/std::wcstod -- add our own implementation to the codebase having the same functionality and treating underscores differently. Use this modified strtod from our tinyexpr. We only use tinyexpr to implement the math builtin. Cover our implementation with tests. This doesn't seem so bad. Or:
  • Before passing an expression to tinyexpr, preprocess it to remove (some) underscores. There are two problems with this. First, it's not obvious which underscores to remove. It's not enough to remove ones between decimal digits because of hex. Second, we would need to keep track of the modifications to the expression for diagnostics in case of error. I drafted this solution -- note the erroneous position of the diagnostic caret in the examples below:
[2] ~/repos/fish-shell/am/build-debug $ math 1_0 + 2_0
30
[2] ~/repos/fish-shell/am/build-debug $ math 1_0 + 2_0 blargh
math: Error: Unknown function
'1_0 + 2_0 blargh'
              ^
[2] ~/repos/fish-shell/am/build-debug [1] $ math 1_0 + 2_0 + 3_0 blargh
math: Error: Unknown function
'1_0 + 2_0 + 3_0 blargh'
                   ^
[2] ~/repos/fish-shell/am/build-debug [1] $ math 1_0 + 2_0 + 3_0_0_0_0 blargh
math: Error: Missing operator
'1_0 + 2_0 + 3_0_0_0_0 blargh'
             ^
[2] ~/repos/fish-shell/am/build-debug [1] $ math 1_0 + 2_0 + 3_0 + 4_0 + 5_0 bla
math: Error: Unknown function
'1_0 + 2_0 + 3_0 + 4_0 + 5_0 bla'
                          ^
  • Something else?

Thoughts?

@krobelus
Copy link
Member

I think both solutions will work. Reimplementing strtod/std::wcstod feels like the cleanest option, then we don't need weird special logic.

Before passing an expression to tinyexpr, preprocess it to remove (some) underscores. There are two problems with this. First, it's not obvious which underscores to remove. It's not enough to remove ones between decimal digits because of hex.

The obvious way is to consume something like ([0-9_]+|0x[0-9a-fA-F_]+|0b[01_]+), and convert all of that. Also, we have the same problem if we avoid strtog/wcstod, right?

disallow multiple underscores in a row, leading or trailing underscores, underscores next to 0x or 0b, etc.

that sounds reasonable; probably easiest to implement inside a reimplementation of wcstod

Second, we would need to keep track of the modifications to the expression for diagnostics in case of error. I drafted this solution -- note the erroneous position of the diagnostic caret in the examples below:

I think the wrong positions could be fixed, even if we keep using strtod/wcstod.

@andmis
Copy link
Contributor

andmis commented Dec 26, 2021

@ridiculousfish when you implemented a700aca, did you happen to look into what about the macOS implementation of wcstod actually made it so slow?

@ridiculousfish
Copy link
Member

wcstod is slow on Mac because it always invokes _simple_salloc which allocates a page from the kernel (not via malloc!). All the time is spent allocating and freeing this one page.

@andmis
Copy link
Contributor

andmis commented Dec 27, 2021

I took a pass hand-rolling a wcstod clone that does what we want: https://github.com/andrey-mishchenko/fish-shell/commit/4fa69ca09262bf39b8684c849e14566e83b11a86

This is a draft and isn't really tested yet, certainly has at least one bug in it, would need unit tests, would need to be renamed and cleaned up, and so on. But before I do all that, is this the kind of solution you guys have in mind?

@krobelus
Copy link
Member

krobelus commented Dec 27, 2021

I haven't taken a detailed look yet but overall this feels fine.
It's not so complex and by doing the parsing ourselves we avoid any performance issues from memory allocation.
In the final version we should briefly document where we deviate from std::wcstod.

commit 4fa69ca
Author: Andrey Mishchenko mishchea@gmail.com
Date: Sun Dec 26 13:16:28 2021 -0500

Draft of fish_wcstod_permissive

diff --git a/src/tinyexpr.cpp b/src/tinyexpr.cpp
index e76955d45..bd0e134f3 100644
--- a/src/tinyexpr.cpp
+++ b/src/tinyexpr.cpp
@@ -272,7 +272,7 @@ static void next_token(state *s) {

     /* Try reading a number. */
     if ((s->next[0] >= '0' && s->next[0] <= '9') || s->next[0] == '.') {

- s->value = fish_wcstod(s->next, const_cast<wchar_t **>(&s->next));
+ s->value = fish_wcstod_permissive(s->next, const_cast<wchar_t **>(&s->next));

I wonder if we want to switch over our other relevant caller (test).
If not (the safe choice) we can still try to add a "bool allow_underscore" flag and drop the current fish_wcstod so we have fewer code paths to worry about.

         s->type = TOK_NUMBER;
     } else {
         /* Look for a function call. */

diff --git a/src/wutil.cpp b/src/wutil.cpp
index 55c8affef..ae70ea942 100644
--- a/src/wutil.cpp
+++ b/src/wutil.cpp
@@ -711,6 +711,152 @@ double fish_wcstod(const wchar_t *str, wchar_t **endptr) {
return std::wcstod(str, endptr);
}

+double fish_wcstod_permissive(const wchar_t *str, wchar_t **endptr) {

We might wanna have the API deviate from wcstod, which reports errors via errno, so it's not thread-safe.
I guess it doesn't matter because we ignore errors (only overflow?).

+ const wchar_t *orig = str;
+ while (*str && iswspace(*str)) ++str;
+
+ // As per the documentation of strtod, this is what we are supposed to do if we fail to perform
+ // any conversion.
+#define FISH_WCSTOD_NO_CONVERSION() {
+ if (endptr) { *endptr = const_cast<wchar_t *>(orig); }
+ return 0;
+}

should avoid the macro with a wrapper function or a lambda

+
+ if (!*str) FISH_WCSTOD_NO_CONVERSION();
+
+ auto is_sign = [](wchar_t wc) -> bool {
+ return wc == L'+' || wc == L'-';
+ };
+
+ // Consume leading +/-.
+ bool significand_negative = false;
+ if (is_sign(*str)) {
+ significand_negative = (*str == L'-');
+ ++str;
+ }
+
+ // We do not handle NaN or floating-point infinity. So we aren't quite compatible with strtod.
+ // But if we did handle these, this would be the place to insert the logic.
+
+ // Consume leading 0x/0X.
+ bool hex = false;
+ if (!wcsncasecmp(str, L"0x", 2)) {
+ hex = true;
+ ++str;
+ ++str;
+
+ if (!iswxdigit(*str)) {
+ // Consume the (signed) zero, but don't consume the X/x.
+ if (endptr) { *endptr = const_cast<wchar_t *>(str - 1); }
+ return significand_negative ? -0.0 : 0.0;
+ }
+ }
+
+ const int base = hex ? 16 : 10;
+ double significand = 0;
+ bool seen_decimal_point = false;
+ int exponent_shift = 0;
+ bool exponent_negative = false;
+ int exponent = 0;
+
+#define FISH_WCSTOD_DONE(endptr_val) {
+ if (endptr) { *endptr = const_cast<wchar_t *>(endptr_val); }
+ return significand *
+ pow(base, (exponent_negative ? -1 : 1) * exponent + exponent_shift) *
+ (significand_negative ? -1 : 1);
+}
+
+ // Confirm leading character is a digit.
+ // We implicitly rely on this in underscore_legal.
+ const auto is_digit = [hex](wchar_t wc) -> bool {
+ if (hex) {
+ return iswxdigit(wc);
+ } else {
+ return iswdigit(wc);
+ }
+ };
+
+ if (!is_digit(*str)) {
+ FISH_WCSTOD_NO_CONVERSION();
+ }
+
+ const auto value = [](wchar_t wc) -> int {
+ if (L'0' <= wc && wc <= L'9') {
+ return (wchar_t)(wc - L'0');
+ } else if (L'A' <= wc && wc <= L'F') {
+ return 10 + (wchar_t)(wc - L'A');
+ } else {
+ assert(L'a' <= wc && wc <= L'f');
+ return 10 + (wchar_t)(wc - L'a');
+ }
+ };
+
+ const auto underscore_legal = [is_digit](const wchar_t p) {
+ assert(p == L'_');
+ return is_digit(
(p - 1)) && is_digit(
(p + 1));

I think the *(p - 1) breaks if someone calls us with "_123"

+ };
+
+ while (is_digit(*str) || *str == L'.' || *str == L'_' && underscore_legal(str)) {
+ if (is_digit(*str)) {
+ significand *= base;
+ if (seen_decimal_point) {
+ exponent_shift -= hex ? 4 : 1;

TIL hex literals have a binary exponent, maybe add a comment mentioning that.

I haven't looked at the stdlib implementation but I wonder if we are losing more precision than necessary with this computation.
significand can get fairly large (if the input is like 0.100000001), and computations of small numbers are more precise.

+ }
+ significand += value(*str);
+ } else if (*str == L'.') {
+ if (seen_decimal_point) {
+ FISH_WCSTOD_DONE(str);
+ } else {
+ seen_decimal_point = true;
+ }
+ }
+ ++str;
+ }
+
+ const auto is_exp_char = [hex](wint_t wc) -> bool {
+ if (hex) {
+ return wc == L'P' || wc == L'p';
+ } else {
+ return wc == L'E' || wc == L'e';
+ }
+ };
+
+ if (!is_exp_char(*str)) {
+ FISH_WCSTOD_DONE(str);
+ } else {
+ ++str;
+ }
+
+ // Consume leading +/-.
+ if (is_sign(*str)) {
+ exponent_negative = (str == L'-');
+ ++str;
+ }
+
+ // Confirm leading character is a digit.
+ // We implicitly rely on this in underscore_legal below.
+ if (!is_digit(str)) {
+ // Ok, back out from consuming the Pp? in the previous 1 or 2 chars.
+ if (is_sign(
(str - 1))) {
+ assert(is_exp_char(
(str - 2)));
+ FISH_WCSTOD_DONE(str - 2);

not sure if *(str - 2) is guaranteed to be in-bound, we should maybe clarify

+ } else {
+ assert(is_exp_char(*(str - 1)));
+ FISH_WCSTOD_DONE(str - 1);
+ }
+ }
+
+ while (is_digit(*str) || *str == L'_' && underscore_legal(str)) {
+ if (is_digit(*str)) {
+ exponent *= base;
+ exponent += value(*str);
+ }
+ ++str;
+ }
+
+ FISH_WCSTOD_DONE(str);
+}
+
file_id_t file_id_t::from_stat(const struct stat &buf) {
file_id_t result = {};
result.device = buf.st_dev;
diff --git a/src/wutil.h b/src/wutil.h
index 139f5c8aa..d54135274 100644
--- a/src/wutil.h
+++ b/src/wutil.h
@@ -133,6 +133,7 @@ long long fish_wcstoll(const wchar_t *str, const wchar_t **endptr = nullptr, int
unsigned long long fish_wcstoull(const wchar_t *str, const wchar_t **endptr = nullptr,
int base = 10);
double fish_wcstod(const wchar_t *str, wchar_t **endptr);
+double fish_wcstod_permissive(const wchar_t *str, wchar_t **endptr);

/// Class for representing a file's inode. We use this to detect and avoid symlink loops, among
/// other things. While an inode / dev pair is sufficient to distinguish co-existing files, Linux

@faho
Copy link
Member

faho commented Dec 27, 2021

reports errors via errno, so it's not thread-safe.

Errno is thread-safe. This is required by POSIX, on linux it's typically a thread-local variable.

Of course we still might want to change the API, but we would do so for ergonomics, features or performance, not safety.

@ridiculousfish
Copy link
Member

Do not hand-roll strtod(). The naive approach of just value * 10 + digit will introduce significant floating point error. fish should be at least as accurate as the system-provided strtod() (even though probably nobody is using fish for precise floating point calculations).

Look up "How to Read Floating Point Numbers Accurately" by Clinger to see how gnarly this problem is.

I suggest finding and modifying another implementation of strtod, e.g. from musl or BSD or Python.

@andmis
Copy link
Contributor

andmis commented Dec 27, 2021

Yeah, I think it makes sense to start with an existing implementation.

Just a note, though: the value * base + digit approach is not as bad as it sounds, and I did try hard to be careful. For example, suppose we have a very large number in hex (with no fractional part or exponent for simplicity -- there is code in the implementation dealing with these and this discussion is still correct). Doing value * 16 + digit loses no precision at all until you get to 2^53 (on a IEEE-754 64-bit float) (since *16 does not affect the significand and +digit will be exact for integers until 2^53). And value * 16 + digit will never touch the significand again once the exponent gets big enough that single-hex-digit addition does not change the value of the float. So as far as I can tell this implementation actually gives the closest double to a given hex value, modulo perhaps the single value * 16 + digit operation where we transition from exact-integer-representation to so-big-that-single-digit-addition-changes-nothing (although I think that operation should round correctly as well). The key thing here is that in hex, moving the decimal point does not change the binary representation of the significand in the floating-point representation, so value * base + digit isn't actually losing us precision.

It's true that the decimal situation is more complicated, because value * 10 + digit is changing the value of the significand in the floating-point rep, and this is where errors can accumulate. Sorry, I should have called this out. I think it would be straightforward to give a very-tight upper-bound on the error between the input and the resulting float by stopping the value * 10 + digit accumulation as soon as we know that the +digit part will have no effect on the significand, and replace the *10 with a +1 to an exponent_shift, and then just do significand * pow(10, exponent + exponent_shift) at the end. This turns the overall calculation into a pow with bounded error, and a single float multiplication which also has a bounded error. This is pretty close to the way I implemented it.

But I think you're right that (1) we should be at least as accurate as existing implementations and (2) this is a lot of bit twiddling to deal with ourselves from scratch, and so it makes sense to start with an existing implementation.

Thanks for the paper reference, I haven't seen that before.

@andmis
Copy link
Contributor

andmis commented Jan 2, 2022 via email

@andmis
Copy link
Contributor

andmis commented Jan 2, 2022

Actually here are the commits, this implementation is just in the hex half of the parser: https://github.com/andrey-mishchenko/fish-shell/commits/8496-underscore-dividers

@krobelus
Copy link
Member

krobelus commented Jan 2, 2022

We may end up wanting to take the approach of preprocessing the input string to support the functionality in this ticket.

Sure. I don't expect performance to matter, plus we can keep the common case (no underscore) as before.
(also we could even allocate the extra buffer on the stack)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants