Skip to content

ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257

Closed
wesm wants to merge 4 commits intoapache:masterfrom
wesm:ARROW-3928
Closed

ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257
wesm wants to merge 4 commits intoapache:masterfrom
wesm:ARROW-3928

Conversation

@wesm
Copy link
Member

@wesm wesm commented Dec 23, 2018

This adds a deduplicate_objects option to all of the to_pandas methods. It works with string types, date types (when date_as_object=True), and time types.

I also made it so that ScalarMemoTable can be used with string_view, for more efficient memoization in this case.

I made the default for deduplicate_objects is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower.

Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values:

In [50]: import pandas.util.testing as tm                                                                 

In [51]: unique_values = [tm.rands(10) for i in range(1000)]                                                                 

In [52]: values = unique_values * 10000                                                                                      

In [53]: arr = pa.array(values)                                                                                              

In [54]: timeit arr.to_pandas()                                                                                              
236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: timeit arr.to_pandas(deduplicate_objects=False)                                                                     
730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Almost 3 times faster in this case. The different in memory use is even more drastic

In [44]: unique_values = [tm.rands(10) for i in range(1000)]                                                                 

In [45]: values = unique_values * 10000                                                                                      

In [46]: arr = pa.array(values)                                                                                              

In [49]: %memit result11 = arr.to_pandas()                                                                                   
peak memory: 1505.89 MiB, increment: 76.27 MiB

In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False)                                                          
peak memory: 2202.29 MiB, increment: 696.11 MiB

As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time.

When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table

In [17]: unique_values = [tm.rands(10) for i in range(500000)]                                                               

In [18]: values = unique_values * 2                                                                                          

In [19]: arr = pa.array(values)                                                                                              

In [20]: timeit result = arr.to_pandas()                                                                                     
177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [21]: timeit result = arr.to_pandas(deduplicate_objects=False)                                                            
70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %memit result8 = arr.to_pandas()                                                                                    
peak memory: 644.39 MiB, increment: 92.23 MiB

In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False)                                                           
peak memory: 610.85 MiB, increment: 58.41 MiB

In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default.

@wesm wesm added the WIP PR is work in progress label Dec 23, 2018
@wesm
Copy link
Member Author

wesm commented Dec 23, 2018

I'll remove the WIP label once I've added some asv benchmarks for this

@wesm
Copy link
Member Author

wesm commented Dec 23, 2018

Memory profiler can be a bit noisy depending on the state of the gc. The math on memory usage didn't look right so I re-ran with 10M values and the results look more correct (about ~600MB difference between on/off)

@codecov-io
Copy link

codecov-io commented Dec 23, 2018

Codecov Report

Merging #3257 into master will increase coverage by 1.23%.
The diff coverage is 93.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3257      +/-   ##
==========================================
+ Coverage   88.49%   89.72%   +1.23%     
==========================================
  Files         540      481      -59     
  Lines       72917    68928    -3989     
==========================================
- Hits        64527    61847    -2680     
+ Misses       8283     7081    -1202     
+ Partials      107        0     -107
Impacted Files Coverage Δ
python/pyarrow/lib.pxd 0% <ø> (ø) ⬆️
cpp/src/arrow/type_traits.h 95.23% <ø> (ø) ⬆️
python/pyarrow/tests/test_convert_pandas.py 95.18% <100%> (+0.14%) ⬆️
python/pyarrow/compat.py 77.63% <100%> (ø) ⬆️
cpp/src/arrow/util/hashing.h 99.26% <100%> (+0.01%) ⬆️
cpp/src/arrow/type.cc 91.46% <100%> (-0.03%) ⬇️
cpp/src/arrow/type.h 89.67% <100%> (+0.06%) ⬆️
cpp/src/arrow/python/arrow_to_pandas.cc 91.7% <100%> (+0.02%) ⬆️
python/pyarrow/pandas_compat.py 98.01% <100%> (ø) ⬆️
python/pyarrow/table.pxi 71.1% <50%> (+0.23%) ⬆️
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2849f46...d9a8870. Read the comment docs.

wesm added 3 commits December 26, 2018 13:52
…converting to Python objects

Change-Id: I0136ddb1498ac007509680ba9f9b3327e7e11a18
Change-Id: I834e91d6a2474bfc67c504dc1bd0497c08869563
Change-Id: Iab72262741fcbe85e19f25e321ee80d06d81f7c2
@wesm wesm removed the WIP PR is work in progress label Dec 26, 2018
@wesm
Copy link
Member Author

wesm commented Dec 26, 2018

Running benchmarks on master. I hacked around ARROW-4117 and wasn't able to resolve ARROW-4118 when running these

[ 50.00%] ··· ============ ==========
              --             total   
              ------------ ----------
               uniqueness   1000000  
              ============ ==========
                 0.001      79.6±0ms 
                  0.01      76.8±0ms 
                  0.1       84.9±0ms 
                  0.5       77.8±0ms 
              ============ ==========

This PR:

[ 25.00%] ··· convert_pandas.ToPandasStrings.time_to_pandas_dedup                                                    ok
[ 25.00%] ··· ============ ==========
              --             total
              ------------ ----------
               uniqueness   1000000
              ============ ==========
                 0.001      25.7±0ms
                  0.01      30.7±0ms
                  0.1       96.7±0ms
                  0.5       183±0ms
              ============ ==========

[ 50.00%] ··· convert_pandas.ToPandasStrings.time_to_pandas_no_dedup                                                 ok
[ 50.00%] ··· ============ ==========
              --             total
              ------------ ----------
               uniqueness   1000000
              ============ ==========
                 0.001      77.9±0ms
                  0.01      77.7±0ms
                  0.1       77.7±0ms
                  0.5       77.6±0ms
              ============ ==========

@xhochy
Copy link
Member

xhochy commented Dec 27, 2018

@wesm Running benchmarks with asv run --python=same should work around these problems as this will not create a new installation.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@wesm
Copy link
Member Author

wesm commented Dec 27, 2018

Thanks. I will take care of the date_as_object deprecation issue in a follow up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants