Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implemented the ecvt function. #12894

Merged
merged 21 commits into from Sep 13, 2017

Conversation

@mazong1123
Copy link
Collaborator

commented Jul 19, 2017

Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf

Note:

  1. This commit won't fix any existing bug.
  2. This is a raw implementation of the paper. The performance on Linux only gain 10%. We could tune the performance further.

UPDATE:

Now the new implementation is at least 2x faster than the old one.

Fix #10651

Re-implemented the ecvt function.
Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf

Note:
1. This commit won't fix any existing bug.
2. This is a raw implementation of the paper. The performance on Linux only gain 10%. We could tune the performance further.

Fix #10651

@mazong1123 mazong1123 changed the title Re-implemented the ecvt function. [WIP] Re-implemented the ecvt function. Jul 19, 2017

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 19, 2017

@tarekgh @jkotas This commit is a raw implementation of the paper https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf . The performance not gained so much (only 10%). I'll tune the performance further.

Currently, I want to go through the CI checks so that if there's any issue I can fix them earlier so I created the PR.

Note I did not change the "15 digit first, then 17 digit precision" round trip conversion rule. I think that should be done in another PR.

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 19, 2017

@tarekgh @jkotas PTAL.
I ran some tests on this PR. Following are the details:

Accuracy

I ran following code against this PR:

using System;

namespace dtost
{
    class Program
    {
        static void TestDoubleConversion(double number)
        {
            string str = number.ToString("R");
            double back = double.Parse(str);
            if (number != back)
            {
                Console.WriteLine("Test failed: converted str: {0}. expected double {1}", back, number);
            }
            else
            {
                Console.WriteLine("Pass");
            }
        }

        static void Main(string[] args)
        {
            TestDoubleConversion(104234.343);
            TestDoubleConversion(-1.7976931348623157E+308);
            TestDoubleConversion(1.7976931348623157e+308);
            TestDoubleConversion(1.7976931348623157e+308);
            TestDoubleConversion(70.9228162514264339123);
            TestDoubleConversion(3.1415926535897937784612345);
            TestDoubleConversion(29999999999999792458.0);
            TestDoubleConversion(1000.4999999999999999999);
            TestDoubleConversion(1000.9999999999999999999);
            TestDoubleConversion(999.99999999999999999999);
            TestDoubleConversion(0.84551240822557006);
            TestDoubleConversion(1.0);
            TestDoubleConversion(10.0);
            TestDoubleConversion(0);
            TestDoubleConversion(0.0);
            TestDoubleConversion(0.0000000000000199);
            TestDoubleConversion(-0.0000123);
        }
    }
}

The output is:

Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Test failed: converted str: 0.84551240822557. expected double 0.84551240822557
Pass
Pass
Pass
Pass
Pass
Pass

Ran the code under current .NET Core release got exactly the same result. The only failure is caused by a known issue https://stackoverflow.com/questions/24299692/why-is-a-round-trip-conversion-via-a-string-not-safe-for-a-double

So the accuracy of the new implementation should be the same as current .NET Core release.

Performance

I ran following micro-benchmark:

using System;
using System.Diagnostics;
using System.Globalization;

namespace hlold
{
    class Program
    {
        static void Main(string[] args)
        {
            double number = 104234.343;
            CultureInfo cultureInfo = new CultureInfo("fr");
            for (int i = 0; i < 20; i++)
            {
                Stopwatch watch = new Stopwatch();
                watch.Start();
                for (int j = 0; j < 10000; j++)
                {
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                }

                Console.Write(watch.ElapsedMilliseconds);
                Console.WriteLine();
            }
        }
    }
}

The new implementation is 3x faster than the old one:

Output of new implementation:

52
52
50
51
50
51
51
51
51
50
50
50
51
50
50
59
50
50
51
51

Output of old implementation:

176
195
182
193
186
190
186
189
184
189
179
193
187
193
181
191
193
185
198
180

If we change the number to double.MinValue / 2, the new implementation is still 2x faster than the old one:

Output of new implementation:

191
189
188
190
192
190
188
189
187
190
192
187
188
187
188
189
192
188
187
187

Output of old implementation:

392
382
381
383
386
400
390
381
381
387
390
382
382
393
386
382
386
386
390
382

So the new implementation is at least 2x faster than the old one. I think we have proved that the algorithm is much more efficient than the old one.

@mazong1123 mazong1123 changed the title [WIP] Re-implemented the ecvt function. Re-implemented the ecvt function. Jul 19, 2017

@dotnet dotnet deleted a comment from dotnet-bot Jul 19, 2017

@dotnet dotnet deleted a comment from dotnet-bot Jul 19, 2017

@tarekgh

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

@mazong1123 this is fantastic. the results look promising. I have enabled the whole corefx repo to run against your changes to know if we get any regression case. Also I am not seeing we used a big static data which is a good news too. if corefx goes fine we can move to the next step to check with the legal about using the algorithm and do official code review then we can finalize the work.

Thanks a lot for getting this done. good work!

@tarekgh

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

@safern could you please take a look at the CI falure in ComponentModel?

Ubuntu x64 Checked Build and Test (Jit - CoreFx)

https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_ubuntu_corefx_baseline_prtest/62/consoleFull#-21281798079fe3b83-f408-404c-b9e7-9207d232e5fc

this is not urgent but it'll be good if you have a look when you have a chance.

@mazong1123 you may ignore the failure in "Ubuntu x64 Checked Build and Test (Jit - CoreFx)" for now as it looks unrelated to your changes.

@safern

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

@safern could you please take a look at the CI falure in ComponentModel?

Will do later :)

@tannergooding

This comment has been minimized.

Copy link
Member

commented Jul 19, 2017

I'm going to give the implementation a review pass sometime later this evening.

const UINT32* pRhsEnd = pRhsCurrent + length;

while (pRhsCurrent != pRhsEnd)
{

This comment has been minimized.

Copy link
@tarekgh

tarekgh Jul 19, 2017

Member

can we just memcpy this part instead of looping?

This comment has been minimized.

Copy link
@mikedn

mikedn Jul 20, 2017

Contributor

It may be best to measure to see if there's any performance difference.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 20, 2017

Member

memcpy should do the right thing, regardless of size. That being said, the current max size is only 35 bytes so the perf difference is barely measurable. But, this size is way too small and should be well over 138 bytes (with exact size varying based on the maximum number of digits we consider). The current max size is 35 dwords (140 bytes).

This comment has been minimized.

Copy link
@tarekgh

tarekgh Jul 20, 2017

Member

why we don't just memcpy considering m_len even if you think the perf is barely measurable?


In reply to: 128531411 [](ancestors = 128531411)


while (true)
{
if (*pCurrentLeftBlock > *pCurrentRightBlock)

This comment has been minimized.

Copy link
@tarekgh

tarekgh Jul 19, 2017

Member

nit: maybe we can do it as

INT64 diff = (INT64) (*pCurrentLeftBlock) - (INT64) (*pCurrentRightBlock);
if (diff != 0)
return diff;

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

👍, but we should validate that this doesn't perform any more poorly on 32-bit platforms

const UINT32* pCurrentLeftBlock = lhs.m_blocks + lastIndex;
const UINT32* pCurrentRightBlock = rhs.m_blocks + lastIndex;

while (true)

This comment has been minimized.

Copy link
@tarekgh

tarekgh Jul 19, 2017

Member

true) [](start = 11, length = 5)

nit: while (pCurrentLeftBlock <= pLeftStartBlock) is better here and remove if (pCurrentLeftBlock == pLeftStartBlock) from the loop

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 20, 2017

Author Collaborator

I think you meant while (pCurrentLeftBlock >= pLeftStartBlock) :). I'll take this optimization.


void BigNum::ShiftLeft(UINT64 input, int shift, BigNum& output)
{
int shiftBlocks = shift / 32;

This comment has been minimized.

Copy link
@tarekgh

tarekgh Jul 19, 2017

Member

int shiftBlocks = shift / 32; [](start = 4, length = 29)

shift >> 5

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

👍, although the compiler should also be doing this optimization for us.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 20, 2017

Member

Actually, we should probably just leave this. The compiler will optimize the next statement as well (into either a single idiv or into a couple of shifts and a subtract)

This comment has been minimized.

Copy link
@mikedn

mikedn Jul 20, 2017

Contributor

Using / and % is fine, no native compiler will generate actual div instructions for this code. However, I would suggest making shift unsigned because unsigned division by constant is cheaper than signed division. It may be the native compiler will figure this one too but it requires interprocedural optimizations so it's somewhat less certain.

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 20, 2017

Author Collaborator

@mikedn Good catch. I think make shift unsigned should be more reasonable here.

realExponent = -1074;
mantissaHighBitIdx = BigNum::LogBase2(realMantissa);
}

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

You can shortcut Zero, Infinity, Indeterminate, and NaN by this point

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 20, 2017

Author Collaborator

These special cases has already been handled before passing to the method: https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/number.cpp#L146

I think adding asserts is worth though.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 20, 2017

Member

It looks like zero and negative zero are not handled by the outer code (just NaN/Inf).

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 21, 2017

Author Collaborator

Yes, we should shortcut zeros here. Thanks. I'd like to borrow the code from https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L185-L193

However, I do not quite understand the comment at https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L184

// we have issue #10290 tracking fixing the sign of 0.0 across the platforms

Did this code introduce issue #10290 ?

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 21, 2017

Member

@mazong1123, no. #10290 is tracking a future improvement (and possible breaking change) that -0.0 and 0.0 should be treated differently when converting to a string.

The code you referenced is fine and is matching the current expected behavior.

UINT64 realMantissa = 0;
int realExponent = 0;
UINT32 mantissaHighBitIdx = 0;
if (((FPDOUBLE*)&value)->exp > 0)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

I would expect you to be special casing denormal numbers, since the implicit bit is zero instead of one.

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 23, 2017

Author Collaborator

I think I have included the denormal numbers in the else condition. According to https://en.wikipedia.org/wiki/Double-precision_floating-point_format

In the case of subnormals (e=0) the double-precision number is described by:
(-1)^sign * 2^(1 - 1023) * 0.fraction = (-1)^sign * 2^(-1022) * 0.fraction

2^(-1022) * 0.fraction = 2^(-1022) * mantissa / 2 ^52 = mantissa * 2^(-1074) . So you can see in the else block, realExponent = -1074 , realMantissa = mantissa.

I will add comments here.

}
} m_initializer;

UINT8 m_len;

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

nit: Either make this UINT32 or place it after m_blocks (larger fields should come first). The resulting struct is going to be 8-bytes in either case...

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

I'll place m_len after m_blocks. Thanks.

This comment has been minimized.

Copy link
@mikedn

mikedn Jul 22, 2017

Contributor

You really should make it UINT32 rather than placing it after m_blocks. In general, the only advantage of using small integral types is storage size. In this case you can't actually save any storage due to alignment requirements. Instead your risk making the code larger because the compiler may need to insert widening conversion in some cases.

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 23, 2017

Author Collaborator

@mikedn Sounds reasonable. But that means I need to extend the static data as well. Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.

Anyway, I'll change m_len and related static data to UINT32 to see if there's any big impact. I don't want to rely on the compiler for the alignment but the size of static data should not be increased so much as well.

This comment has been minimized.

Copy link
@mikedn

mikedn Jul 23, 2017

Contributor

Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.

Yes, but only for consistency. I would expect the compiler to treat those as constants so the impact on the generated code should be exactly 0. In fact, they should be constexpr, provided that the version(s) of VC++ used to build this code doesn't barf on it.

static const UINT8 BIGSIZE = 35;
static const UINT8 UINT32POWER10NUM = 8;
static const UINT8 BIGPOWER10NUM = 6;
static UINT32 m_power10UInt32Table[UINT32POWER10NUM];

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

static UINT32 const m_power10UInt32Table[] =
{
// Declare data inline
};

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

I have to use constexpr to accomplish this. The code can be compiled on my current Ubuntu 16.04 box but I'm not sure if it is able to be compiled across platform. I'll do the change and to see if it can pass the CI later.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 22, 2017

Member

Using constexpr would likely work, but it shouldn't be required.

This should just be a UINT32 constant for the length and then a sequence of bytes representing the data. You likely need a second lookup table that contains the starting index for each entry, but it should just be a cast from UINT32* to BigNum* at that point.


BigNum::BigNum()
:m_len(0)
{

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

a comment about explicitly not zeroing the memory due to perf reasons might be nice

break;
}

--pCurrentLeftBlock;

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

nit: Lengths of both are the same, you can use a single index to do the access

return *this;
}

int BigNum::Compare(const BigNum& lhs, UINT32 value)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

Is having a compare really better than individual ==, !=, <, <=, >, and >= operators?

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 21, 2017

Author Collaborator

I can overload these operators but I think we should do this enhancement in another PR because a simple compare method should be enough to implement the algorithm.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 21, 2017

Member

I was mostly asking from a perf perspective. It seems like we are only actually reusing the Compare value in one place, every other place we are just doing a single comparison, which is likely cheaper than the full Compare check.

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

I'll remove BigNum::Compare(const BigNum& lhs, UINT32 value) since it is not used now.

BigNum::Compare(const BigNum& lhs, const BigNum& rhs) is still useful when comparing numerator and denominator so I'll keep it.

int shiftBlocks = shift / 32;
int remaningToShiftBits = shift % 32;

for (int i = 0; i < shiftBlocks; ++i)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

memset(0) would be much faster

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

(memset over the entire range that is)

This comment has been minimized.

Copy link
@mikedn

mikedn Jul 20, 2017

Contributor

Might be, it really depends on how large these bignums usually get. I suspect not very large.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 20, 2017

Member

As per the memcpy comment. memset should do the right thing, regardless of size. The current max size is only 35 bytes, but I also believe this is considerably too small. The current max size is 35 dwords (140 bytes).

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 21, 2017

Author Collaborator

I'll change the loop to memset.

return true;
}

for (UINT8 i = 0; i < m_len; ++i)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

it seems like we could trivially detect all the cases where the value would become zero and only rely on m_len == 0

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

Did you mean we should ensure that whenever m_len > 0, we at least have 1 block != 0 so that we could only check m_len == 0 inside IsZero()? I think IsZero() should not depend on this assumption which is like to depend on an implementation outside of it.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 22, 2017

Member

IsZero is part of the BigNum class, so it would only be dependent on a convention set for its own class.

This probably isn't too important so we can investigate later.


void BigNum::SetUInt64(UINT64 value)
{
m_len = 0;

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

m_blocks[0] = value & 0xFFFFFFFF;
m_blocks[1] = value >> 32;
m_len = (m_blocks[1] == 0) ? 1 : 2;

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

Nice! Thanks!


void BigNum::ExtendBlock(UINT32 newBlock)
{
m_blocks[m_len] = newBlock;

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 19, 2017

Member

There are a few cases where an ExtendBlock(UINT32 blockValue, UINT32 numBlocks) would be useful. It can memset and increment m_len the appropriate count all at once.

This comment has been minimized.

Copy link
@mazong1123

mazong1123 Jul 22, 2017

Author Collaborator

I tried this change but it slightly slows down the performance. So I'd like to keep this method and use memset when there're bulk blocks need to be extended.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 22, 2017

Member

Did you create a second ExtendBlock method or just replace this one entirely? I was suggesting that we create a second method and call it if the number of blocks being set (we loop the current method in a couple places) is larger than some threshold.

@tannergooding
Copy link
Member

left a comment

A few minor nits, but overall 👍

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 13, 2017

@tannergooding I've updated code according to the feedback. Please take a look. Thanks!

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 13, 2017

@dotnet-bot test Ubuntu x64 corefx_baseline
@dotnet-bot test Windows_NT x86 corefx_baseline
@dotnet-bot test Tizen armel Cross Release Build

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 15, 2017

Do I need to put the license mentioned in #12894 (comment) anywhere?
At the beginning of number.cpp or DoubleToNumberWorker?
I am asking and will get back to you.

@tarekgh Do we have any update?

@tarekgh

This comment has been minimized.

Copy link
Member

commented Aug 15, 2017

Do we have any update?

We are still following up here. looks there is some process we need to follow. we'll let you know what needs to be done when having more details. Thanks for your patience.

@tarekgh

This comment has been minimized.

Copy link
Member

commented Aug 25, 2017

We are still following up here. we'll update as soon as we have news.

@tarekgh

This comment has been minimized.

Copy link
Member

commented Sep 12, 2017

@tarekgh

This comment has been minimized.

Copy link
Member

commented Sep 13, 2017

@mazong1123 thanks a lot for your effort and getting this ready. finally, I am going to merge it :-)

@tarekgh tarekgh merged commit 602ebe9 into dotnet:master Sep 13, 2017

24 checks passed

CROSS Check Build finished.
Details
CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
OSX10.12 x64 Checked Build and Test Build finished.
Details
Tizen armel Cross Debug Build Build finished.
Details
Tizen armel Cross Release Build Build finished.
Details
Ubuntu arm Cross Release Build Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Checked Build and Test (Jit - CoreFx) Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm Cross Debug Build Build finished.
Details
Windows_NT x64 CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details
Windows_NT x86 Checked Build and Test (Jit - CoreFx) Build finished.
Details
Windows_NT x86 CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt legacy_backend CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details

@mazong1123 mazong1123 deleted the mazong1123:fix-10651 branch Sep 13, 2017

@jkotas

This comment has been minimized.

Copy link
Member

commented Sep 13, 2017

This introduced build break on Windows ARM (this config does not build by default in CI):

src\classlibnative\bcltype\bignum.cpp(595): error C3861: 'BitScanReverse64': identifier not found [D:\j\workspace\arm_cross_che---9fe37e18\bin\obj\Windows_NT.arm.Checked\src\classlibnative\bcltype\bcltype.vcxproj]

@jkotas

This comment has been minimized.

Copy link
Member

commented Sep 13, 2017

Fix: #13932

@mazong1123

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 13, 2017

@jkotas how did you find the build break? Is there any way to check arm build in CI?

@jkotas

This comment has been minimized.

Copy link
Member

commented Sep 13, 2017

No worries. I happened to notice it on a different job. There are custom jobs that can be triggered manually based on the nature of the change.

@dotnet-bot help

prints the list. You will find the Windows ARM build somewhere in there. Obviously, it is not cost effective to run all of them for each job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.