Fixed decimal serialization and deserialization #361

jancespivo · 2019-07-31T08:59:14Z

Hi,
this is a fix of decimal serialization and deserialization. It is also about 2x faster than the original implementation.
It is related to #360.

I think the serialization of fixed bytes in prepare_fixed_decimal can be also simplified. It will follow in another PR.

Best regards

codecov · 2019-07-31T14:37:03Z

Codecov Report

Merging #361 into master will decrease coverage by 0.93%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #361      +/-   ##
==========================================
- Coverage   97.98%   97.05%   -0.94%     
==========================================
  Files          26       26              
  Lines        1740     1732       -8     
==========================================
- Hits         1705     1681      -24     
- Misses         35       51      +16

Impacted Files	Coverage Δ
fastavro/_logical_writers_py.py	`98.98% <100%> (-0.09%)`	⬇️
fastavro/_read_py.py	`97.49% <100%> (-1.19%)`	⬇️
fastavro/six.py	`94.65% <100%> (-2.65%)`	⬇️
fastavro/_read_common.py	`78.57% <0%> (-14.29%)`	⬇️
fastavro/_write_py.py	`95.13% <0%> (-2.66%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 415f140...c36cf3d. Read the comment docs.

scottbelden · 2019-07-31T14:59:32Z

tests/test_logical_types.py

+        (Decimal("-123.456"), b'\x06\xfe\x1d\xc0'),
+        (Decimal("3245.234"), b'\x061\x84\xb2'),
+        (Decimal("-3245.234"), b'\x06\xce{N'),
+        (Decimal("9999999999999999.456"), b'\x12\x00\x8a\xc7#\x04\x89\xe7\xfd\xe0'),


The tests are complaining because this line is too long. Can you add a # noqa here?

I reformated the lines. It's not complaining now ;)

scottbelden · 2019-07-31T15:01:19Z

Thanks! It's definitely much simpler, but I'm not sure if we can switch to this as it seems like the to_bytes function wasn't added until python 3.2 (https://docs.python.org/3.7/library/stdtypes.html#int.to_bytes) and so I have a a feeling that the python 2.7 tests will fail.

jancespivo · 2019-08-01T13:35:46Z

Your welcome :) I implemented to_bytes for Python 2.7. I will implement also from_bytes.

I can add some benchmarks for the original version and the new.
There are two possibilities:

Create and run benchmarks only locally on my computer and send it here.
Add the benchmarks to test suite. It is very tricky because the CI builds can run on different machines, so the test runs aren't generally comparable.

I suggest adding the tests to test suite but not run them in CI. What do you think? It should be different PR because we want to run it to the original (not fixed) code.

scottbelden · 2019-08-02T03:15:55Z

It looks like the tests are failing on python 2 so maybe something needs to be tweaked there.

As for benchmarks, we currently don't run any benchmarks during CI (though that probably would be a good thing to do but it just hasn't happened). If you have a simple script you've been using to do the benchmarks and can just paste that here I would love to run it locally as well.

jancespivo · 2019-08-02T20:00:49Z

Hi, I hope it is finally fixed :)

The benchmarks I used (with pytest-benchmark plugin):

@pytest.fixture(name='schema_bytes_decimal')
def schema_bytes_decimal_fixture(scale):
    return {
        "name": "n",
        "namespace": "namespace",
        "type": "bytes",
        "logicalType": "decimal",
        "precision": 20,
        "scale": scale,
    }



@pytest.mark.benchmark(
	group = 'serialize',
	disable_gc = True,
	warmup = True,
)
@pytest.mark.parametrize('scale', [3, 20])
@pytest.mark.parametrize(
    'input_data',
    [
        Decimal("0.456"),
        Decimal("-0.456"),
        Decimal("9999999999999999.456"),
        Decimal("-999999999999999.456"),
    ],
    ids=['0.456', '-0.456', '9999999999999999.456', '-9999999999999999.456']
)
def test_bytes_decimal_serialize_benchmark(schema_bytes_decimal, input_data, benchmark):
    benchmark(serialize, schema_bytes_decimal, input_data)


@pytest.mark.benchmark(
	group = 'deserialize',
	disable_gc = True,
	warmup = True,
)
@pytest.mark.parametrize('scale', [3, 20])
@pytest.mark.parametrize(
    'input_data',
    [
        b'\x04\x01\xc8',
        b'\x04\xfe8',
        b'\x12\x00\x8a\xc7#\x04\x89\xe7\xfd\xe0',
        b'\x10\xf2\x1fILX\x9c\x02 ',
    ],
    ids=['0.456', '-0.456', '9999999999999999.456', '-9999999999999999.456']
)
def test_bytes_decimal_deserialize_benchmark(schema_bytes_decimal, input_data, benchmark):
    benchmark(deserialize, schema_bytes_decimal, input_data)

jancespivo · 2019-08-04T18:08:31Z

I run the benchmarks on the original and new version of code in Python 3.7 and 2.7:
pytest -m "benchmark" --benchmark-only --benchmark-sort=fullname --benchmark-columns=min

Python 3.7:
Original version (master):

--------------------------- benchmark 'deserialize': 8 tests ---------------------------
Name (time in us)                                                          Min          
----------------------------------------------------------------------------------------
test_bytes_decimal_deserialize_benchmark[-0.456-20]                    29.9440 (1.03)   
test_bytes_decimal_deserialize_benchmark[-0.456-3]                     29.6970 (1.02)   
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-20]     30.5760 (1.05)   
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-3]      30.6320 (1.05)   
test_bytes_decimal_deserialize_benchmark[0.456-20]                     29.2770 (1.01)   
test_bytes_decimal_deserialize_benchmark[0.456-3]                      29.0600 (1.0)    
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-20]      30.3200 (1.04)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-3]       29.8300 (1.03)   
----------------------------------------------------------------------------------------

--------------------------- benchmark 'serialize': 8 tests ---------------------------
Name (time in us)                                                        Min          
--------------------------------------------------------------------------------------
test_bytes_decimal_serialize_benchmark[-0.456-20]                    11.1190 (1.60)   
test_bytes_decimal_serialize_benchmark[-0.456-3]                      7.0790 (1.02)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-20]     16.6010 (2.40)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-3]      12.2310 (1.76)   
test_bytes_decimal_serialize_benchmark[0.456-20]                     10.7450 (1.55)   
test_bytes_decimal_serialize_benchmark[0.456-3]                       6.9300 (1.0)    
test_bytes_decimal_serialize_benchmark[9999999999999999.456-20]      15.5570 (2.24)   
test_bytes_decimal_serialize_benchmark[9999999999999999.456-3]       11.9630 (1.73)   
--------------------------------------------------------------------------------------

New version (fix/decimal):

--------------------------- benchmark 'deserialize': 8 tests ---------------------------
Name (time in us)                                                          Min          
----------------------------------------------------------------------------------------
test_bytes_decimal_deserialize_benchmark[-0.456-20]                    30.3420 (1.03)   
test_bytes_decimal_deserialize_benchmark[-0.456-3]                     29.3890 (1.0)    
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-20]     30.9710 (1.05)   
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-3]      30.6860 (1.04)   
test_bytes_decimal_deserialize_benchmark[0.456-20]                     29.5810 (1.01)   
test_bytes_decimal_deserialize_benchmark[0.456-3]                      32.9650 (1.12)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-20]      33.5360 (1.14)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-3]       30.7180 (1.05)   
----------------------------------------------------------------------------------------

--------------------------- benchmark 'serialize': 8 tests --------------------------
Name (time in us)                                                       Min          
-------------------------------------------------------------------------------------
test_bytes_decimal_serialize_benchmark[-0.456-20]                    5.9400 (1.07)   
test_bytes_decimal_serialize_benchmark[-0.456-3]                     5.5970 (1.01)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-20]     9.6490 (1.73)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-3]      8.6090 (1.55)   
test_bytes_decimal_serialize_benchmark[0.456-20]                     5.9640 (1.07)   
test_bytes_decimal_serialize_benchmark[0.456-3]                      5.5630 (1.0)    
test_bytes_decimal_serialize_benchmark[9999999999999999.456-20]      8.9890 (1.62)   
test_bytes_decimal_serialize_benchmark[9999999999999999.456-3]       8.7640 (1.58)   
-------------------------------------------------------------------------------------

So in Python 3.7 deserialization is almost the same fast in both versions. Serialization is faster in the new version. It is more significant with bigger scale and more digits in a number. It is mainly because of elimination two for loops which are related to scale and digits.

Python 2.7:
Original version (master):

--------------------------- benchmark 'deserialize': 8 tests ---------------------------
Name (time in us)                                                          Min          
----------------------------------------------------------------------------------------
test_bytes_decimal_deserialize_benchmark[-0.456-20]                    19.7887 (1.0)    
test_bytes_decimal_deserialize_benchmark[-0.456-3]                     19.7887 (1.0)    
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-20]     20.9808 (1.06)   
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-3]      20.9808 (1.06)   
test_bytes_decimal_deserialize_benchmark[0.456-20]                     19.7887 (1.0)    
test_bytes_decimal_deserialize_benchmark[0.456-3]                      19.7887 (1.0)    
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-20]      21.9345 (1.11)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-3]       20.9808 (1.06)   
----------------------------------------------------------------------------------------

--------------------------- benchmark 'serialize': 8 tests ---------------------------
Name (time in us)                                                        Min          
--------------------------------------------------------------------------------------
test_bytes_decimal_serialize_benchmark[-0.456-20]                     9.7752 (1.74)   
test_bytes_decimal_serialize_benchmark[-0.456-3]                      5.9605 (1.06)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-20]     18.1198 (3.23)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-3]      10.9673 (1.96)   
test_bytes_decimal_serialize_benchmark[0.456-20]                      8.9407 (1.60)   
test_bytes_decimal_serialize_benchmark[0.456-3]                       5.6028 (1.0)    
test_bytes_decimal_serialize_benchmark[9999999999999999.456-20]      16.9277 (3.02)   
test_bytes_decimal_serialize_benchmark[9999999999999999.456-3]       12.8746 (2.30)   
--------------------------------------------------------------------------------------

New version (fix/decimal):

--------------------------- benchmark 'deserialize': 8 tests ---------------------------
Name (time in us)                                                          Min          
----------------------------------------------------------------------------------------
test_bytes_decimal_deserialize_benchmark[-0.456-20]                    19.7887 (1.04)   
test_bytes_decimal_deserialize_benchmark[-0.456-3]                     19.0735 (1.0)    
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-20]     20.9808 (1.10)   
test_bytes_decimal_deserialize_benchmark[-9999999999999999.456-3]      20.9808 (1.10)   
test_bytes_decimal_deserialize_benchmark[0.456-20]                     19.7887 (1.04)   
test_bytes_decimal_deserialize_benchmark[0.456-3]                      19.7887 (1.04)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-20]      20.9808 (1.10)   
test_bytes_decimal_deserialize_benchmark[9999999999999999.456-3]       20.9808 (1.10)   
----------------------------------------------------------------------------------------

--------------------------- benchmark 'serialize': 8 tests ---------------------------
Name (time in us)                                                        Min          
--------------------------------------------------------------------------------------
test_bytes_decimal_serialize_benchmark[-0.456-20]                     7.0333 (1.02)   
test_bytes_decimal_serialize_benchmark[-0.456-3]                      6.9141 (1.0)    
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-20]     11.9209 (1.72)   
test_bytes_decimal_serialize_benchmark[-9999999999999999.456-3]      10.9673 (1.59)   
test_bytes_decimal_serialize_benchmark[0.456-20]                      6.9141 (1.0)    
test_bytes_decimal_serialize_benchmark[0.456-3]                       6.9141 (1.0)    
test_bytes_decimal_serialize_benchmark[9999999999999999.456-20]      11.9209 (1.72)   
test_bytes_decimal_serialize_benchmark[9999999999999999.456-3]       10.9673 (1.59)   
--------------------------------------------------------------------------------------

In Python 2.7 is the situation similar. Deserialization is almost the same fast (btw it is faster than in Python 3.7). Performance gain during serialization is significant only with bigger scale or/and more digits. There is almost no performance loss :)

jancespivo · 2019-08-26T07:09:39Z

@scottbelden Hi, is there plan to publish the new version? We would like to use it :)

scottbelden · 2019-08-26T14:36:56Z

Should be up now as 0.22.4

jancespivo mentioned this pull request Jul 31, 2019

Added tests for decimal serialization #360

Closed

jancespivo force-pushed the fix/decimal branch from 6d09b97 to c9a4f65 Compare July 31, 2019 14:37

jancespivo force-pushed the fix/decimal branch from c9a4f65 to b15fd27 Compare July 31, 2019 14:55

scottbelden reviewed Jul 31, 2019

View reviewed changes

jancespivo force-pushed the fix/decimal branch 3 times, most recently from 98e67a4 to 5af3c62 Compare August 1, 2019 13:18

jancespivo force-pushed the fix/decimal branch 4 times, most recently from 16b5342 to 5570470 Compare August 1, 2019 19:07

Fixed decimal serialization and deserialization

c36cf3d

jancespivo force-pushed the fix/decimal branch from 5570470 to c36cf3d Compare August 2, 2019 14:42

scottbelden merged commit 84ed2cb into fastavro:master Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed decimal serialization and deserialization #361

Fixed decimal serialization and deserialization #361

jancespivo commented Jul 31, 2019

codecov bot commented Jul 31, 2019 •

edited

scottbelden Jul 31, 2019

jancespivo Aug 1, 2019

scottbelden commented Jul 31, 2019

jancespivo commented Aug 1, 2019

scottbelden commented Aug 2, 2019

jancespivo commented Aug 2, 2019

jancespivo commented Aug 4, 2019 •

edited

jancespivo commented Aug 26, 2019

scottbelden commented Aug 26, 2019

Fixed decimal serialization and deserialization #361

Fixed decimal serialization and deserialization #361

Conversation

jancespivo commented Jul 31, 2019

codecov bot commented Jul 31, 2019 • edited

Codecov Report

scottbelden Jul 31, 2019

Choose a reason for hiding this comment

jancespivo Aug 1, 2019

Choose a reason for hiding this comment

scottbelden commented Jul 31, 2019

jancespivo commented Aug 1, 2019

scottbelden commented Aug 2, 2019

jancespivo commented Aug 2, 2019

jancespivo commented Aug 4, 2019 • edited

jancespivo commented Aug 26, 2019

scottbelden commented Aug 26, 2019

codecov bot commented Jul 31, 2019 •

edited

jancespivo commented Aug 4, 2019 •

edited