ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257
ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays#3257wesm wants to merge 4 commits intoapache:masterfrom
Conversation
|
I'll remove the WIP label once I've added some asv benchmarks for this |
|
Memory profiler can be a bit noisy depending on the state of the gc. The math on memory usage didn't look right so I re-ran with 10M values and the results look more correct (about ~600MB difference between on/off) |
Codecov Report
@@ Coverage Diff @@
## master #3257 +/- ##
==========================================
+ Coverage 88.49% 89.72% +1.23%
==========================================
Files 540 481 -59
Lines 72917 68928 -3989
==========================================
- Hits 64527 61847 -2680
+ Misses 8283 7081 -1202
+ Partials 107 0 -107
Continue to review full report at Codecov.
|
…converting to Python objects Change-Id: I0136ddb1498ac007509680ba9f9b3327e7e11a18
Change-Id: I834e91d6a2474bfc67c504dc1bd0497c08869563
Change-Id: Iab72262741fcbe85e19f25e321ee80d06d81f7c2
|
Running benchmarks on master. I hacked around ARROW-4117 and wasn't able to resolve ARROW-4118 when running these This PR: |
|
@wesm Running benchmarks with |
|
Thanks. I will take care of the |
This adds a
deduplicate_objectsoption to all of theto_pandasmethods. It works with string types, date types (whendate_as_object=True), and time types.I also made it so that
ScalarMemoTablecan be used withstring_view, for more efficient memoization in this case.I made the default for
deduplicate_objectsis True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower.Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values:
Almost 3 times faster in this case. The different in memory use is even more drastic
As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time.
When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table
In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default.