Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use exponential format for decimals with relatively large scale #121

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

declanvk
Copy link

@declanvk declanvk commented Jan 8, 2024

This PR is related to some of the comments in #108, but there is not a specific issue created separately for this. Please let me know if you want me to create an issue first for some discussion.


The first commit of the PR is mostly just adding tests to capture the current behavior.

The second commit is the more important, it adds a different formatting code path to the fmt::Display impl, so that numbers that have a large (absolute) scale relative to their integer representation length will be displayed in exponential format. The second commit message has more detail.

I chose the (abs_int.len() as i64 - self.scale - 1) as the criteria for the cutoff because I didn't want it to trigger only based on the scale. I imagined a scenario where the scale is large (either negative or positive), but it is inflated because the int_val has a large number of digits.

I'm specifically targeting cases where someone could input a small string like 1E100000 and then crash a program by making it do a huge allocation.


Thanks for making and maintaining the bigdecimal crate! We find it very useful

@akubera
Copy link
Owner

akubera commented Jan 17, 2024

This is much needed, thanks. I think started that at some point, but I can't find the commits 😬 .

Like the other traits, I'd like to break this out into an impl_fmt.rs file. So I'll do that in trunk and rebase your branch onto that to keep accurate attribution.

I'll break your function-scope format_full_scale and format_exponential out into module-scope, so they may be individually, without dependency on EXPONENTIAL_FORMAT_THRESHOLD.

@akubera
Copy link
Owner

akubera commented Jan 17, 2024

Regarding EXPONENTIAL_FORMAT_THRESHOLD, I think I'll make that a compile-time configurable; like RUST_BIGDECIMAL_DEFAULT_PRECISION.

Most people wont care, but someone might find it useful.

It looks like Python's default decimal formatter only keeps 5 leading zeros before going to exponential form:

>>> Decimal("1e-6")
Decimal('0.000001')
>>> Decimal("1e-7")
Decimal('1E-7')

>>> Decimal("11e-7")
Decimal('0.0000011')
>>> Decimal("11e-8")
Decimal('1.1E-7')

And trailing zeros are precision-dependent, which makes sense as those could indicate accuracy:

>>> Decimal("1e0")
Decimal('1')
>>> Decimal("1e1")
Decimal('1E+1')
>>> Decimal("1e3")
Decimal('1E+3')

>>> Decimal("10e1")
Decimal('1.0E+2')

>>> Decimal("10000e3")
Decimal('1.0000E+7')

__str__ follows same rules:

>>> str(Decimal("1e-6"))
'0.000001'
>>> str(Decimal("1e-7"))
'1E-7'
>>> str(Decimal("100e7"))
'1.00E+9'

@akubera
Copy link
Owner

akubera commented Jan 17, 2024

I found my feature/impl-fmt branch. I rebased that onto trunk and I'll get your changes in there somewhere.

@akubera akubera self-assigned this Jan 17, 2024
@akubera
Copy link
Owner

akubera commented Jan 17, 2024

I applied your changes to the split file instead of src/lib.rs.

exponential-format

Only significant change was moving the format_exponential out of the fmt method, and leaving the other code in the function as-is (i.e. jump to format_exponential if length condition holds, otherwise use the old code)

I may change this back to your two-function approach, and move the whole thing into yet another function that parameterizes the threshold-value, so we can write nicer tests.

@akubera
Copy link
Owner

akubera commented Jan 17, 2024

One thing to note: I set EXPONENTIAL_FORMAT_THRESHOLD = 2 (small value) and the exponential form is chosen when given a value like 7000:

println!("{}", BigDecimal::from(7000)); -> "7.000E3"

Is that expected behavior?

@declanvk
Copy link
Author

Regarding EXPONENTIAL_FORMAT_THRESHOLD, I think I'll make that a compile-time configurable; like RUST_BIGDECIMAL_DEFAULT_PRECISION.

Sounds good to me! That is a cool technique you're using in the build.rs

Is that expected behavior?

This makes sense, but I did not anticipate it ahead of time. Sinceabs_int = "7000" and scale = 0, then we get "7000".len() - 0 > 2 which evaluates to true and goes to the exponential path.

I think this points out a weakness in the condition, I previously assumed that the trailing zeroes would not have this affect.

What is your preference on this? Should I make it count the abs_int.len() - num_trailing_zeroes?

@akubera
Copy link
Owner

akubera commented Jan 17, 2024

The special case scale == 0 always means the BigDecimal is an integer, and the BigInt may be printed verbatim. So that's one.

@akubera
Copy link
Owner

akubera commented Jan 18, 2024

Comparing to other libraries:

// Java
import java.math.*;
class BigDecimalTest {
    public static void main(String[] args) {
        String[] strs = new String[]{
            "0.123",
            "0.000001",
            "0.0000001",
            "100",
            "100e3",
            "1000000000",
            "1000000000.00000",
            "1000000000e1",
        };
        for (String s : strs) {
           BigDecimal n = new BigDecimal(s);
           System.out.println(s + " -> " + n);
        }
    }
}

outputs

0.123 -> 0.123
0.000001 -> 0.000001
0.0000001 -> 1E-7
100 -> 100
100e3 -> 1.00E+5
1000000000 -> 1000000000
1000000000.00000 -> 1000000000.00000
1000000000e1 -> 1.000000000E+10
# Python
from decimal import Decimal

strs = [
    "0.123",
    "0.000001",
    "0.0000001",
    "100",
    "100e3",
    "1000000000",
    "1000000000.00000",
    "1000000000e1",
]

for s in strs:
    n = Decimal(s)
    print(f"{s} -> {n}")
0.123 -> 0.123
0.000001 -> 0.000001
0.0000001 -> 1E-7
100 -> 100
100e3 -> 1.00E+5
1000000000 -> 1000000000
1000000000.00000 -> 1000000000.00000
1000000000e1 -> 1.000000000E+10

(good, it's the same)

Ruby:

require 'bigdecimal'

strs = [
    "0.123",
    "0.000001",
    "0.0000001",
    "100",
    "100e3",
    "1000000000",
    "1000000000.00000",
    "1000000000e1",
]

for s in strs do
  n = BigDecimal s
  print(s, " -> ", n, "\n")
end
0.123 -> 0.123e0
0.000001 -> 0.1e-5
0.0000001 -> 0.1e-6
100 -> 0.1e3
100e3 -> 0.1e6
1000000000 -> 0.1e10
1000000000.00000 -> 0.1e10
1000000000e1 -> 0.1e11

...... well I regret trying that one. They ALL start with 0. !?

@akubera
Copy link
Owner

akubera commented Jan 18, 2024

0.1  ->  0.1
0.10  ->  0.10
0.100  ->  0.100
0.1000  ->  0.1000
...
0.100000000  ->  0.100000000
...
0.100000000000000000000000000  ->  0.100000000000000000000000000

So trailing zeros don't matter.

I think the rule is

// compare scale to zero
match self.scale.cmp(&0) {
   // simply print integer if scale is zero
   Equal => print(abs_int),

   // decimal point is "to the right" of the end of the integer  
   // always print exponential form
   // (example 100e3 => 100xxx. )
   Less => print_exponential_form(...),

   // greater means decimal point is to the left of the end of the integer 
   // we should do exponential form if the decimal point is past the left of
   // the start (significant) digit 
   //       123e-2 => 1.23         (scale - len = 2 - 3 = -1)
   //       123e-5 => 0.00123      (scale - len = 5 - 3 = 2)
   //       123e-9 => 0.000000123  (scale - len = 9 - 3 = 6 > threshold => 1.23e-7)
   _ => if self.scale - abs_int.len() as i64  > THRESHOLD  {
          print_exponential_form(...)
        } else {
          print_full_form(...)
        }
}

@akubera
Copy link
Owner

akubera commented Jan 18, 2024

Ok: We have decimal with digits

          dddddddddd

Only thing that determines exponential is placement of the decimal point. Value or number of digits do not matter:

          dddddddddd0…00.    <- exponential
          dddddddddd.        <- full (no decimal point to be printed)
          ddddd.dddd         <- full
         .dddddddddd         <- full
     .0000dddddddddd         <- full
.0000…0000dddddddddd     <- exponential if n > threshold
 |-- n --|

So I'm going with

// check if scale is outside of "boundary"
if scale < 0 || (scale as usize > abs_int.len() + threshold) {
    exponential
} else {
    full
}

@akubera
Copy link
Owner

akubera commented Jan 18, 2024

These tests are passing for threshold = 2 👏

    impl_case!(case_0d123: "0.123" => "0.123");
    impl_case!(case_0d0123: "0.0123" => "0.0123");
    impl_case!(case_0d00123: "0.00123" => "0.00123");
    impl_case!(case_0d000123: "0.000123" => "1.23E-4");

    impl_case!(case_123d: "123." => "123");
    impl_case!(case_123de1: "123.e1" => "1.23E3");
test impl_fmt::test_display::case_0d000123 ... ok
test impl_fmt::test_display::case_123d ... ok
test impl_fmt::test_display::case_0d00123 ... ok
test impl_fmt::test_display::case_123de1 ... ok
test impl_fmt::test_display::case_0d123 ... ok
test impl_fmt::test_display::case_0d0123 ... ok

the last one there is almost consistent with Python. Do we force the +?

>>> Decimal("123.e1")
Decimal('1.23E+3')

@declanvk
Copy link
Author

Nice work on adjusting the rule! I like the latest iteration, focusing in on the placement of the digits really brings it into focus

the last one there is almost consistent with Python. Do we force the +?

Yeah I think go for it! Looking at the python and java examples, it seems like they always include the sign on the exponent

src/impl_fmt.rs Outdated
}

if exponent != 0 {
abs_int += "E";
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do you want to make this lowercase e to match the other notations in this file? Or prefer to match other big decimal libraries with an uppercase E?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the others to use "E"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they've all been addressed.

src/impl_fmt.rs Outdated
}


fn dynamically_format_decimal(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea,not blocking: Another idea I was playing around with was incorporating the use of the alternate format flag #, see https://doc.rust-lang.org/std/fmt/index.html#sign0.

We could use the alternate format to choose a different format, or force the output into the format_exponential or format_full_scale branches

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying println!("{:#}", d) should use "format_full_scale"? (No threshold check?)

I like it.

That paired with std::fmt::{UpperExp,LowerExp} should give the user full control.

println!("{:e}", d);    //  1.2e+1
println!("{:E}", d);    //  1.2E+1
println!("{:#}", d);    //  12
println!("{}", d);      //  12
println!("{:?}", d);    //  BigDecimal("12E+0")

I've amended format_exponential to take an "e_symbol" parameter so we can choose between e & E (and any other values as required in the future, though I'm not sure if any localization changes that particular char).

Do we leave the case that a user printing "{:#}", BigDecimal("1e-1000000")) will print many zeros? I.e. do we have a larger threshold for number of zeros for "alternate" mode?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes break a lot of tests so I'm going through those before pushing what I've done.

src/impl_fmt.rs Outdated Show resolved Hide resolved
@declanvk
Copy link
Author

I left some nonblocking comments (on my own PR hahaha), let me know what you think! I can go ahead and implement anything, I feel bad making you do all this work!

Thanks again for looking at this

@akubera
Copy link
Owner

akubera commented Jan 18, 2024

I didn't mean to force push onto your branch, but if you like the changes I'm glad I did.
Thanks for your comments, they look good.
I'll rebase again to correct the last old-rust compatibility issue.

I removed the updated threshold check, I'm working on redoing that commit to also update the tests.

I'll look at all this again tonight.

@declanvk
Copy link
Author

I didn't mean to force push onto your branch, but if you like the changes I'm glad I did.

I'm glad you did! The changes you made look good

@akubera akubera force-pushed the exponential-format branch 2 times, most recently from 3b33646 to 379e1c2 Compare January 21, 2024 03:06
@akubera
Copy link
Owner

akubera commented Jan 21, 2024

I'm about ready to call this done. I wrote a little code-writing python script to generate unit-test mods so the Rust bigdecimal formatting of {:XYZ} aligns with the same format in Python.

@akubera
Copy link
Owner

akubera commented Mar 6, 2024

This was merged and released last night in 0.4.3.

I forgot to use the GitHub ui, oops. To make it look merged I'm going to merge it now into trunk and then force push updates away.

@akubera akubera merged commit 2ac1c62 into akubera:trunk Mar 6, 2024
11 checks passed
@declanvk
Copy link
Author

Thanks for all your work on this @akubera !

@declanvk declanvk deleted the exponential-format branch March 11, 2024 16:39
@akubera
Copy link
Owner

akubera commented Mar 11, 2024

'twas a long time coming. Let me know right away if there's any problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants