Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float decoding problem #69

Closed
gmnash opened this issue Dec 19, 2012 · 28 comments
Closed

Float decoding problem #69

gmnash opened this issue Dec 19, 2012 · 28 comments

Comments

@gmnash
Copy link

gmnash commented Dec 19, 2012

I am seeing a floating point problem that I do not see in json or cjson. The first two tests below pass while the one using ujson fails. I am running Ubuntu 12.04 if that helps.

import unittest
import json
import cjson
import ujson


class TestUJsonFloat(unittest.TestCase):
    def test_json(self):
        sut = {u'a': 4.56}
        encoded = json.dumps(sut)
        decoded = json.loads(encoded)
        self.assertEqual(sut, decoded)

    def test_cjson(self):
        sut = {u'a': 4.56}
        encoded = cjson.encode(sut)
        decoded = cjson.decode(encoded)
        self.assertEqual(sut, decoded)

    def test_ujson(self):
        sut = {u'a': 4.56}
        encoded = ujson.encode(sut)
        decoded = ujson.decode(encoded)
        self.assertEqual(sut, decoded)
@jskorpan
Copy link

Unless there's some rounding issue hidden somewhere in UltraJSONs floating point decoder (and/or encoder) I really can't tell if this is an artifact caused by the very technical limitations of floating point numbers.

Math geniuses to the rescue please :)

Parts of this code is borrowed from the MODP_ASCII project at http://code.google.com/p/stringencoders/

@gmnash
Copy link
Author

gmnash commented Dec 21, 2012

I hacked around with the code a little and tried using the standard strtod() function inside decode_numeric(), and this passes the test. This is the same function that Python uses under the covers too, for string->float conversion.

FASTCALL_ATTR JSOBJ FASTCALL_MSVC decode_numeric ( struct DecoderState *ds)
{   
    char* end = 0;
    double d = strtod(ds->start, &end);
    ds->lastType = JT_DOUBLE;
    ds->start = end;
    RETURN_JSOBJ_NULLCHECK(ds->dec->newDouble (d));
}

This is obviously just a test to prove that the value is correct. More work would be required to support integers, NaN, Infinity, etc.

I know this is a departure from using the stringencoders code, but it seems to return the correct value. What do you think about the approach of using the standard C functions instead?

@joewalnes
Copy link

+1 for using strtod(). I'm trying to migrate a whole bunch of code from cjson to ujson, and this is the only blocker.

-Joe

I work with nashg482.

@jskorpan
Copy link

jskorpan commented Jan 7, 2013

I think I have solved this problem and if we're lucky also further speeded up decoding of integers and decimals. Needs a bit more testing. I'll check back shortly.

@jskorpan
Copy link

Resolved in master branch

@gmnash
Copy link
Author

gmnash commented Jan 16, 2013

Thanks! When will it be available via pip?

@jskorpan
Copy link

Now ☺
//JT

From: Graham Nash [mailto:notifications@github.com]
Sent: den 16 januari 2013 17:23
To: esnme/ultrajson
Cc: Jonas Tärnström
Subject: Re: [ultrajson] Float decoding problem (#69)

Thanks! When will it be available via pip?


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12326275.

@gmnash
Copy link
Author

gmnash commented Jan 16, 2013

It looks like it cuts off floating point numbers after 5 decimal places:

import unittest
import json
import ujson
import cjson


class TestUJsonFloat(unittest.TestCase):
    def test_json(self):
        sut = {u'a': 4.567891}
        encoded = json.dumps(sut)
        decoded = json.loads(encoded)
        self.assertEqual(sut, decoded)

    def test_cjson(self):
        sut = {u'a': 4.567891}
        encoded = cjson.encode(sut)
        decoded = cjson.decode(encoded)
        self.assertEqual(sut, decoded)

    def test_ujson(self):
        sut = {u'a': 4.567891}
        encoded = ujson.encode(sut)
        decoded = ujson.decode(encoded)
        self.assertEqual(sut, decoded)

@jskorpan jskorpan reopened this Jan 17, 2013
@jskorpan
Copy link

There's a kwArg option for that to encode/dumps called double_precision.
Default is 5, I'm fine with changing that default if that seems more reasonable-

@gmnash
Copy link
Author

gmnash commented Jan 17, 2013

I isolated the following case that breaks using the kwArg:

import unittest
import json
import ujson


class TestUJsonFloat(unittest.TestCase):
    def test_random(self):
        sut = {u'a': -528656961.4399388}
        encoded = json.dumps(sut)
        decoded = json.loads(encoded)
        self.assertEqual(sut, decoded)

        encoded = ujson.encode(sut, double_precision=100)
        decoded = ujson.decode(encoded)
        self.assertEqual(sut, decoded)

For reference I have been using this for testing:

import random
import unittest
import json
import ujson


class TestUJsonFloat(unittest.TestCase):
    def test_random_range(self):
        random.seed(0)
        JSON_MAX = pow(2, 53)
        for i in range(0, 100000):
            value = random.uniform(-JSON_MAX, JSON_MAX)
            sut = {u'a': value}

            try:
                encoded = json.dumps(sut)
                decoded = json.loads(encoded)
                self.assertEqual(sut, decoded)
            except Exception as e:
                print "json: i={}, value={}, error={}".format(i, value, e)
                raise

            try:
                encoded = ujson.encode(sut, double_precision=100)
                decoded = ujson.decode(encoded)
                self.assertEqual(sut, decoded)
            except Exception as e:
                print "ujson: i={}, value={}, error={}".format(i, value, e)
                raise

@jskorpan
Copy link

Interesting error, I’ll look into it shortly

In the mean time I'm pulling 1.28 from PyPi as I deem the error serious.

//JT

From: Graham Nash [mailto:notifications@github.com]
Sent: den 17 januari 2013 16:36
To: esnme/ultrajson
Cc: Jonas Tärnström
Subject: Re: [ultrajson] Float decoding problem (#69)

I isolated the following case that breaks using the kwArg:

import unittest

import json

import ujson

class TestUJsonFloat(unittest.TestCase):

def test_random(self):

    sut = {u'a': -528656961.4399388}

    encoded = json.dumps(sut)

    decoded = json.loads(encoded)

    self.assertEqual(sut, decoded)



    encoded = ujson.encode(sut, double_precision=100)

    decoded = ujson.decode(encoded)

    self.assertEqual(sut, decoded)

For reference I have been using this for testing:

import random

import unittest

import json

import ujson

class TestUJsonFloat(unittest.TestCase):

def test_random(self):

    for i in range(0, 1000):

        value = random.uniform(-1000000000, 1000000000)

        sut = {u'a': value}



        try:

            encoded = json.dumps(sut)

            decoded = json.loads(encoded)

            self.assertEqual(sut, decoded, "json: i={}, value={}".format(i, value))

        except Exception as e:

            print "ujson: i={}, value={}, error={}".format(i, value, e)

            raise



        try:

            encoded = ujson.encode(sut, double_precision=100)

            decoded = ujson.decode(encoded)

            self.assertEqual(sut, decoded, "ujson: i={}, value={}".format(i, value))

        except Exception as e:

            print "ujson: i={}, value={}, error={}".format(i, value, e)

            raise


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12372848.

@jskorpan
Copy link

I've hopefully fixed the issues now. I did some redesign of the mantissa decoder.

The expected funtionality at the moment is that it should decode all numeric values without decimals or exponents as 32-bit signed integers if they are 9 digits long or less and as 64-bit signed integers for all numbers up to LLONG_MIN and LLONG_MAX

Looking forwards to you guys tearing it apart for the sake of further narrowing down and defining its capabilities from a can and can not do perspective. I would be especially interested in test cases for where the decoding to double precision floats would be consider failing, possibly based on comparsion.

Big thanks to @gmnash Keep it up!

@gmnash
Copy link
Author

gmnash commented Jan 22, 2013

The test I posted previously, test_random_range, still fails after a few iterations on the latest origin/master. Is there a reason to not try strtod? Are there performance problems, and if so are there any benchmarks available? As floating point is notoriously hard, it might be a good idea to start with a battle hardened solution that we know works first.

@jskorpan
Copy link

I wouldn't want to relay on external functions since we can't be sure how they perform. Also, I'd rather have fast and consistent numeric decoder in ujson than a potentially slow one which supports everything.

On 22 jan 2013, at 17:17, "Graham Nash" <notifications@github.commailto:notifications@github.com> wrote:

The test I posted previously, test_random_range, still fails after a few iterations on the latest origin/master. Is there a reason to not try strtod? Are there performance problems, and if so are there any benchmarks available? As floating point is notoriously hard, it might be a good idea to start with a battle hardened solution that we know works first.


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12551554.

@gmnash
Copy link
Author

gmnash commented Jan 22, 2013

Wouldn't it be prudent to empirically test how strtod performs before
dismissing it out of hand? Wouldn't it be inconsistent to have asymmetry
in encode/decode? By supports everything do you mean all valid JSON
numbers? Do you plan on listing the values for which the asymmetry occurs
in the README? Do you feel confident you can find them all?

On Tue, Jan 22, 2013 at 10:55 AM, Jonas Tärnström
notifications@github.comwrote:

I wouldn't want to relay on external functions since we can't be sure how
they perform. Also, I'd rather have fast and consistent numeric decoder in
ujson than a potentially slow one which supports everything.

On 22 jan 2013, at 17:17, "Graham Nash" <notifications@github.com<mailto:
notifications@github.com>> wrote:

The test I posted previously, test_random_range, still fails after a few
iterations on the latest origin/master. Is there a reason to not try
strtod? Are there performance problems, and if so are there any benchmarks
available? As floating point is notoriously hard, it might be a good idea
to start with a battle hardened solution that we know works first.


Reply to this email directly or view it on GitHub<
https://github.com/esnme/ultrajson/issues/69#issuecomment-12551554>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12553897.

@jskorpan
Copy link

Valid questions, I’m concerned about the speed of strtod and its individual implementations.
I’ll do some trials in a day or two, thanks for your engagement. I see your point.

//JT

From: Graham Nash [mailto:notifications@github.com]
Sent: den 22 januari 2013 18:21
To: esnme/ultrajson
Cc: Jonas Tärnström
Subject: Re: [ultrajson] Float decoding problem (#69)

Wouldn't it be prudent to empirically test how strtod performs before
dismissing it out of hand? Wouldn't it be inconsistent to have asymmetry
in encode/decode? By supports everything do you mean all valid JSON
numbers? Do you plan on listing the values for which the asymmetry occurs
in the README? Do you feel confident you can find them all?

On Tue, Jan 22, 2013 at 10:55 AM, Jonas Tärnström
notifications@github.comwrote:

I wouldn't want to relay on external functions since we can't be sure how
they perform. Also, I'd rather have fast and consistent numeric decoder in
ujson than a potentially slow one which supports everything.

On 22 jan 2013, at 17:17, "Graham Nash" <notifications@github.com<mailto:
notifications@github.com>> wrote:

The test I posted previously, test_random_range, still fails after a few
iterations on the latest origin/master. Is there a reason to not try
strtod? Are there performance problems, and if so are there any benchmarks
available? As floating point is notoriously hard, it might be a good idea
to start with a battle hardened solution that we know works first.


Reply to this email directly or view it on GitHub<
https://github.com/esnme/ultrajson/issues/69#issuecomment-12551554>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12553897.


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-12555199.

@jskorpan
Copy link

Added precise_float=True (default false) option to decoder to override default behavior and use strtod for all numbers that has decimals or exponents.

Integer behavoir will still be retained.

@gmnash
Copy link
Author

gmnash commented Feb 14, 2013

My tests now pass. I don't see a consistent performance difference using precise_float=True. Have you noticed a difference? If not, wouldn't it make more sense from a usability standpoint (and code maintenance) to make the code path of precise_float=True the only path?

Also when do you plan on doing a release to PyPI?

Thanks again. Looking forward to switching off of cjson and json!

@jskorpan
Copy link

I’ll consider your suggestion. My biggest scare are users using the existing “imprecise” floating point decoder taking a completely unnecessary performance hit for an improvement they didn’t ask for. I’m going to do some benchmarks on this later on.

The PyPI release is basically due next week I guess. I’m just not in the need to rush this one.
//JT

From: Graham Nash [mailto:notifications@github.com]
Sent: den 14 februari 2013 16:46
To: esnme/ultrajson
Cc: Jonas Tärnström
Subject: Re: [ultrajson] Float decoding problem (#69)

My tests now pass. I don't see a consistent performance difference using precise_float=True. Have you noticed a difference? If not, wouldn't it make more sense from a usability standpoint (and code maintenance) to make the code path of precise_float=True the only path?

Also when do you plan on doing a release to PyPI?

Thanks again. Looking forward to switching off of cjson and json!


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-13557415.

@gmnash
Copy link
Author

gmnash commented Feb 15, 2013

Sounds good. Let me know how those benchmarks turn out. Thanks again.

@gmnash
Copy link
Author

gmnash commented Mar 15, 2013

How did the benchmarks turn out? Are we close to a release? Thanks.

@jskorpan
Copy link

I haven't been able to devote the time needed for this sorry.
Please use the kwarg switch if that is suitable or patch the code behavior.

//JT

On 15 mar 2013, at 22:58, "Graham Nash" <notifications@github.commailto:notifications@github.com> wrote:

How did the benchmarks turn out? Are we close to a release? Thanks.


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-14987146.

@trottier
Copy link

trottier commented May 3, 2013

So, quick question: why does "precision" here seem to mean "places after the decimal point"? Shouldn't it just be the total number of significant digits, no matter where they are?

@jskorpan
Copy link

jskorpan commented May 6, 2013

You are probably right, we’ve attacked this problem from more of a performance perspective than what’s mathematically a correct definition of precision.

Feel free contribute to a name change for the argument, but for backward compatibility reasons we’re going to have to keep the old name as well.

From: Leo Trottier [mailto:notifications@github.com]
Sent: den 3 maj 2013 20:22
To: esnme/ultrajson
Cc: Jonas Tärnström
Subject: Re: [ultrajson] Float decoding problem (#69)

So, quick question: why does "precision" here seem to mean "places after the decimal point"? Shouldn't it just be the total number of significant digits, no matter where they are?


Reply to this email directly or view it on GitHubhttps://github.com//issues/69#issuecomment-17410062.

@trottier
Copy link

trottier commented May 6, 2013

I was under the impression that relative scale independence was at the heart of the very idea of "floating point".

Otherwise, how will ujson handle 1.4590808237e100 ? or 1.45909280972e-50 ? What does "precision", as ujson is using it, even mean in such cases?

Indeed, if I do:

ujson.dumps(1e-40) ==> '0.0'

Contrast this to:

json.dumps(1e-40) ==> '1e-40'

@vlovich
Copy link

vlovich commented Oct 23, 2014

There are lots of reasons that strtod should not be used for parsing JSON numbers (or really any numbers), & it has nothing to do with performance.

  1. The API only has ERANGE for reporting overflow/underflow. You can't know which exactly.
  2. 64-bit integers (which are allowed in JSON) are broken: you only get 53 bits. At one point I remember there was a huge bug where facebook was using 64-bit numbers in their IDs.
  3. Integers are broken: you can't expose in the API if you had to truncate to get an integer value.
  4. Infinities & NaN are unhandled (which is supported by the standard)
  5. It supports a hex-notation which is unsupported by JSON
  6. It can't tell you if you lost precision due to floating-point representation.

If you want a fully conforming JSON number parsing implementation, you should check out:
https://github.com/openwebos/libpbnjson/blob/e80704d2f1f36a4dc666926a6b5e1be09959c009/src/pbnjson_c/jvalue/num_conversion.c

@jskorpan
Copy link

I guess this is a discussion which ultimately depends on how closely you consider JSON tied to JavaScript. Any numeric parsing implementation widely surpassing the capabilities of what JavaScript could handle is from our perspective not worth pursuing as we would expect most JSON to end up in JavaScript land anyhow.

If the intention is to marshal fixed point precision numerals outside of the JavaScript world, maybe there are other alternatives to JSON or UltraJSON

@vlovich
Copy link

vlovich commented Oct 23, 2014

UltraJSON is in the C/C++/Python world outside of Javascript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants