Reduce memory usage of concat (large)utf8 #348

ritchie46 · 2021-05-25T10:14:26Z

partial solution for #347

This PR precomputes the needed length of the buffers that store the raw string data. This way we only allocate the minimal needed memory and it is faster because there will be no reallocation.

codecov-commenter · 2021-05-25T10:33:26Z

Codecov Report

Merging #348 (257bc59) into master (5ac771a) will increase coverage by 0.05%.
The diff coverage is 99.30%.

❗ Current head 257bc59 differs from pull request most recent head b3bfb5d. Consider uploading reports for the commit b3bfb5d to get more accurate results

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   82.56%   82.61%   +0.05%     
==========================================
  Files         162      162              
  Lines       44063    44197     +134     
==========================================
+ Hits        36379    36512     +133     
- Misses       7684     7685       +1

Impacted Files	Coverage Δ
arrow/src/array/array_boolean.rs	`90.90% <ø> (ø)`
arrow/src/csv/reader.rs	`89.91% <99.07%> (+1.54%)`	⬆️
arrow/src/array/transform/mod.rs	`89.18% <100.00%> (+0.24%)`	⬆️
arrow/src/compute/kernels/concat.rs	`100.00% <100.00%> (ø)`
parquet/src/arrow/arrow_writer.rs	`98.11% <100.00%> (+0.03%)`	⬆️
parquet/src/column/writer.rs	`93.29% <100.00%> (ø)`
parquet/src/encodings/encoding.rs	`94.85% <0.00%> (-0.20%)`	⬇️
arrow/src/array/transform/boolean.rs	`84.61% <0.00%> (+7.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ac771a...b3bfb5d. Read the comment docs.

alamb · 2021-05-25T21:45:00Z

The MIRI failure is unrelated to this PR: #345

Dandandan · 2021-05-25T21:53:46Z

@ritchie46 do we have some (micro) benchmark results, like:

cargo bench --bench concatenate_kernel -- "concat str"

ritchie46 · 2021-05-26T06:31:10Z

@ritchie46 do we have some (micro) benchmark results, like:

Gnuplot not found, using plotters backend
concat str 10k          time:   [162.54 us 162.73 us 163.00 us]                           
                        change: [-3.7750% -3.6483% -3.5084%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

concat str 50% nulls 10k                                                                            
                        time:   [619.71 us 621.16 us 623.28 us]
                        change: [-1.1095% -0.9350% -0.7141%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  7 (7.00%) high severe

concat str 5% nulls 10k time:   [391.61 us 391.71 us 391.81 us]                                    
                        change: [-1.3337% -1.2446% -1.1878%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

Performance wise it doesn't matter/ or hurt that much. So it mostly is more memory efficient.

Dandandan

LGTM thanks @ritchie46

Great improvement / idea.
Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).

ritchie46 · 2021-05-28T10:47:50Z

Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).

Nice!

alamb · 2021-05-29T10:45:48Z

MIRI test failure is not related to this PR (we subsequently disabled it until we can fix the errors)

alamb · 2021-05-29T10:46:01Z

Thanks @ritchie46

jorgecarleitao · 2021-05-29T20:48:40Z

I did not have the time to review this, but I think that this will over-allocate in all cases where MutableArrayData's API is used other than concat.

In particular, IMO this overallocate in:

filter arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
zip kernel: we now allocate falsy.len() + truthy.len(), even though we will only pick either from one or the other.

ritchie46 · 2021-05-30T05:28:34Z

I did not realize this @jorgecarleitao. What do you think about creating a builder pattern (with an prealloc option) for the concat kernel case and return the new implementation to the old state?

* reduce memory needed for concat * reuse code for str allocation buffer

…se (#411) * Reduce memory usage of concat (large)utf8 (#348) * reduce memory needed for concat * reuse code for str allocation buffer * make sure that only concat preallocates buffers (#382) * MutableArrayData::with_capacities * better pattern matching * add binary capacities * add list child data * add struct capacities * add panic for dictionary type * change dictionary capacity enum variant Co-authored-by: Ritchie Vink <ritchie46@gmail.com>

reduce memory needed for concat

9b30c2b

reuse code for str allocation buffer

b3bfb5d

Dandandan approved these changes May 28, 2021

View reviewed changes

alamb merged commit e801d4b into apache:master May 29, 2021

Dandandan mentioned this pull request May 29, 2021

Reduce memory of concat kernel #347

Closed

ritchie46 mentioned this pull request May 30, 2021

make sure that only concat preallocates buffers #382

Merged

alamb pushed a commit that referenced this pull request Jun 4, 2021

Reduce memory usage of concat (large)utf8 (#348)

36511f7

* reduce memory needed for concat * reuse code for str allocation buffer

alamb added the cherry-picked label Jun 4, 2021

alamb mentioned this pull request Jun 4, 2021

Cherry pick Reduce memory usage of concat (large)utf8 to active_release #411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of concat (large)utf8 #348

Reduce memory usage of concat (large)utf8 #348

ritchie46 commented May 25, 2021

codecov-commenter commented May 25, 2021 •

edited

Loading

alamb commented May 25, 2021

Dandandan commented May 25, 2021

ritchie46 commented May 26, 2021

Dandandan left a comment •

edited

Loading

ritchie46 commented May 28, 2021

alamb commented May 29, 2021

alamb commented May 29, 2021

jorgecarleitao commented May 29, 2021 •

edited

Loading

ritchie46 commented May 30, 2021

Reduce memory usage of concat (large)utf8 #348

Reduce memory usage of concat (large)utf8 #348

Conversation

ritchie46 commented May 25, 2021

codecov-commenter commented May 25, 2021 • edited Loading

Codecov Report

alamb commented May 25, 2021

Dandandan commented May 25, 2021

ritchie46 commented May 26, 2021

Dandandan left a comment • edited Loading

Choose a reason for hiding this comment

ritchie46 commented May 28, 2021

alamb commented May 29, 2021

alamb commented May 29, 2021

jorgecarleitao commented May 29, 2021 • edited Loading

ritchie46 commented May 30, 2021

codecov-commenter commented May 25, 2021 •

edited

Loading

Dandandan left a comment •

edited

Loading

jorgecarleitao commented May 29, 2021 •

edited

Loading