ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays by wesm · Pull Request #3257 · apache/arrow

wesm · 2018-12-23T21:54:49Z

This adds a deduplicate_objects option to all of the to_pandas methods. It works with string types, date types (when date_as_object=True), and time types.

I also made it so that ScalarMemoTable can be used with string_view, for more efficient memoization in this case.

I made the default for deduplicate_objects is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower.

Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values:

In [50]: import pandas.util.testing as tm                                                                 

In [51]: unique_values = [tm.rands(10) for i in range(1000)]                                                                 

In [52]: values = unique_values * 10000                                                                                      

In [53]: arr = pa.array(values)                                                                                              

In [54]: timeit arr.to_pandas()                                                                                              
236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: timeit arr.to_pandas(deduplicate_objects=False)                                                                     
730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Almost 3 times faster in this case. The different in memory use is even more drastic

In [44]: unique_values = [tm.rands(10) for i in range(1000)]                                                                 

In [45]: values = unique_values * 10000                                                                                      

In [46]: arr = pa.array(values)                                                                                              

In [49]: %memit result11 = arr.to_pandas()                                                                                   
peak memory: 1505.89 MiB, increment: 76.27 MiB

In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False)                                                          
peak memory: 2202.29 MiB, increment: 696.11 MiB

As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time.

When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table

In [17]: unique_values = [tm.rands(10) for i in range(500000)]                                                               

In [18]: values = unique_values * 2                                                                                          

In [19]: arr = pa.array(values)                                                                                              

In [20]: timeit result = arr.to_pandas()                                                                                     
177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [21]: timeit result = arr.to_pandas(deduplicate_objects=False)                                                            
70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %memit result8 = arr.to_pandas()                                                                                    
peak memory: 644.39 MiB, increment: 92.23 MiB

In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False)                                                           
peak memory: 610.85 MiB, increment: 58.41 MiB

In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default.

wesm · 2018-12-23T21:56:58Z

I'll remove the WIP label once I've added some asv benchmarks for this

wesm · 2018-12-23T22:08:03Z

Memory profiler can be a bit noisy depending on the state of the gc. The math on memory usage didn't look right so I re-ran with 10M values and the results look more correct (about ~600MB difference between on/off)

codecov-io · 2018-12-23T22:30:18Z

Codecov Report

Merging #3257 into master will increase coverage by 1.23%.
The diff coverage is 93.37%.

@@            Coverage Diff             @@
##           master    #3257      +/-   ##
==========================================
+ Coverage   88.49%   89.72%   +1.23%     
==========================================
  Files         540      481      -59     
  Lines       72917    68928    -3989     
==========================================
- Hits        64527    61847    -2680     
+ Misses       8283     7081    -1202     
+ Partials      107        0     -107

Impacted Files	Coverage Δ
python/pyarrow/lib.pxd	`0% <ø> (ø)`	⬆️
cpp/src/arrow/type_traits.h	`95.23% <ø> (ø)`	⬆️
python/pyarrow/tests/test_convert_pandas.py	`95.18% <100%> (+0.14%)`	⬆️
python/pyarrow/compat.py	`77.63% <100%> (ø)`	⬆️
cpp/src/arrow/util/hashing.h	`99.26% <100%> (+0.01%)`	⬆️
cpp/src/arrow/type.cc	`91.46% <100%> (-0.03%)`	⬇️
cpp/src/arrow/type.h	`89.67% <100%> (+0.06%)`	⬆️
cpp/src/arrow/python/arrow_to_pandas.cc	`91.7% <100%> (+0.02%)`	⬆️
python/pyarrow/pandas_compat.py	`98.01% <100%> (ø)`	⬆️
python/pyarrow/table.pxi	`71.1% <50%> (+0.23%)`	⬆️
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2849f46...d9a8870. Read the comment docs.

…converting to Python objects Change-Id: I0136ddb1498ac007509680ba9f9b3327e7e11a18

Change-Id: I834e91d6a2474bfc67c504dc1bd0497c08869563

Change-Id: Iab72262741fcbe85e19f25e321ee80d06d81f7c2

wesm · 2018-12-26T20:49:34Z

Running benchmarks on master. I hacked around ARROW-4117 and wasn't able to resolve ARROW-4118 when running these

[ 50.00%] ··· ============ ==========
              --             total   
              ------------ ----------
               uniqueness   1000000  
              ============ ==========
                 0.001      79.6±0ms 
                  0.01      76.8±0ms 
                  0.1       84.9±0ms 
                  0.5       77.8±0ms 
              ============ ==========

This PR:

[ 25.00%] ··· convert_pandas.ToPandasStrings.time_to_pandas_dedup                                                    ok
[ 25.00%] ··· ============ ==========
              --             total
              ------------ ----------
               uniqueness   1000000
              ============ ==========
                 0.001      25.7±0ms
                  0.01      30.7±0ms
                  0.1       96.7±0ms
                  0.5       183±0ms
              ============ ==========

[ 50.00%] ··· convert_pandas.ToPandasStrings.time_to_pandas_no_dedup                                                 ok
[ 50.00%] ··· ============ ==========
              --             total
              ------------ ----------
               uniqueness   1000000
              ============ ==========
                 0.001      77.9±0ms
                  0.01      77.7±0ms
                  0.1       77.7±0ms
                  0.5       77.6±0ms
              ============ ==========

xhochy · 2018-12-27T16:18:33Z

@wesm Running benchmarks with asv run --python=same should work around these problems as this will not create a new installation.

xhochy

+1, LGTM

wesm · 2018-12-27T16:23:25Z

Thanks. I will take care of the date_as_object deprecation issue in a follow up PR

wesm added the WIP PR is work in progress label Dec 23, 2018

wesm added 3 commits December 26, 2018 13:52

First working iteration of string deduplication when calling to_pandas

7a7873b

Add Python unit tests, deduplicate for date and time types also when …

ca88b96

…converting to Python objects Change-Id: I0136ddb1498ac007509680ba9f9b3327e7e11a18

Add benchmarks for object deduplication

a00b51c

Change-Id: I834e91d6a2474bfc67c504dc1bd0497c08869563

wesm force-pushed the ARROW-3928 branch from 5872ee0 to a00b51c Compare December 26, 2018 20:35

Prettier output

d9a8870

Change-Id: Iab72262741fcbe85e19f25e321ee80d06d81f7c2

wesm removed the WIP PR is work in progress label Dec 26, 2018

xhochy approved these changes Dec 27, 2018

View reviewed changes

wesm closed this in 9b03947 Dec 27, 2018

wesm deleted the ARROW-3928 branch December 27, 2018 18:21

asfimport mentioned this pull request Dec 27, 2018

[Python] Add option to deduplicate PyBytes / PyString / PyUnicode objects in Table.to_pandas conversion path #20538

Closed

jorisvandenbossche mentioned this pull request Feb 14, 2023

GH-34104: [Python] update deduplicate_objects default in docs to match implementation #34128

Merged

jorisvandenbossche mentioned this pull request Jul 8, 2025

PERF: Unnecessary string interning in read_csv? pandas-dev/pandas#61783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257

ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257
wesm wants to merge 4 commits intoapache:masterfrom
wesm:ARROW-3928

wesm commented Dec 23, 2018 •

edited

Loading

Uh oh!

wesm commented Dec 23, 2018

Uh oh!

wesm commented Dec 23, 2018

Uh oh!

codecov-io commented Dec 23, 2018 •

edited

Loading

Uh oh!

wesm commented Dec 26, 2018

Uh oh!

xhochy commented Dec 27, 2018

Uh oh!

xhochy left a comment

Uh oh!

wesm commented Dec 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wesm commented Dec 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Dec 23, 2018

Uh oh!

wesm commented Dec 23, 2018

Uh oh!

codecov-io commented Dec 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wesm commented Dec 26, 2018

Uh oh!

xhochy commented Dec 27, 2018

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Dec 23, 2018 •

edited

Loading

codecov-io commented Dec 23, 2018 •

edited

Loading