Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage of concat (large)utf8 #348

Merged
merged 2 commits into from May 29, 2021
Merged

Conversation

ritchie46
Copy link
Contributor

partial solution for #347

This PR precomputes the needed length of the buffers that store the raw string data. This way we only allocate the minimal needed memory and it is faster because there will be no reallocation.

@codecov-commenter
Copy link

codecov-commenter commented May 25, 2021

Codecov Report

Merging #348 (257bc59) into master (5ac771a) will increase coverage by 0.05%.
The diff coverage is 99.30%.

❗ Current head 257bc59 differs from pull request most recent head b3bfb5d. Consider uploading reports for the commit b3bfb5d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   82.56%   82.61%   +0.05%     
==========================================
  Files         162      162              
  Lines       44063    44197     +134     
==========================================
+ Hits        36379    36512     +133     
- Misses       7684     7685       +1     
Impacted Files Coverage Δ
arrow/src/array/array_boolean.rs 90.90% <ø> (ø)
arrow/src/csv/reader.rs 89.91% <99.07%> (+1.54%) ⬆️
arrow/src/array/transform/mod.rs 89.18% <100.00%> (+0.24%) ⬆️
arrow/src/compute/kernels/concat.rs 100.00% <100.00%> (ø)
parquet/src/arrow/arrow_writer.rs 98.11% <100.00%> (+0.03%) ⬆️
parquet/src/column/writer.rs 93.29% <100.00%> (ø)
parquet/src/encodings/encoding.rs 94.85% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/boolean.rs 84.61% <0.00%> (+7.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ac771a...b3bfb5d. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented May 25, 2021

The MIRI failure is unrelated to this PR: #345

@Dandandan
Copy link
Contributor

@ritchie46 do we have some (micro) benchmark results, like:

cargo bench --bench concatenate_kernel -- "concat str"

@ritchie46
Copy link
Contributor Author

@ritchie46 do we have some (micro) benchmark results, like:

Gnuplot not found, using plotters backend
concat str 10k          time:   [162.54 us 162.73 us 163.00 us]                           
                        change: [-3.7750% -3.6483% -3.5084%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

concat str 50% nulls 10k                                                                            
                        time:   [619.71 us 621.16 us 623.28 us]
                        change: [-1.1095% -0.9350% -0.7141%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  7 (7.00%) high severe

concat str 5% nulls 10k time:   [391.61 us 391.71 us 391.81 us]                                    
                        change: [-1.3337% -1.2446% -1.1878%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

Performance wise it doesn't matter/ or hurt that much. So it mostly is more memory efficient.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @ritchie46

Great improvement / idea.
Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).

@ritchie46
Copy link
Contributor Author

Did some tests locally with 1025 items in the array, getting a small improvement in timing there too (3-5%).

Nice!

@alamb
Copy link
Contributor

alamb commented May 29, 2021

MIRI test failure is not related to this PR (we subsequently disabled it until we can fix the errors)

@alamb alamb merged commit e801d4b into apache:master May 29, 2021
@alamb
Copy link
Contributor

alamb commented May 29, 2021

Thanks @ritchie46

@jorgecarleitao
Copy link
Member

jorgecarleitao commented May 29, 2021

I did not have the time to review this, but I think that this will over-allocate in all cases where MutableArrayData's API is used other than concat.

In particular, IMO this overallocate in:

  • filter arrays, as it now pre-allocates the maximum size of the string array, even if we only keep a single slot.
  • zip kernel: we now allocate falsy.len() + truthy.len(), even though we will only pick either from one or the other.

@ritchie46
Copy link
Contributor Author

I did not realize this @jorgecarleitao. What do you think about creating a builder pattern (with an prealloc option) for the concat kernel case and return the new implementation to the old state?

alamb pushed a commit that referenced this pull request Jun 4, 2021
* reduce memory needed for concat

* reuse code for str allocation buffer
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 4, 2021
alamb added a commit that referenced this pull request Jun 9, 2021
…se (#411)

* Reduce memory usage of concat (large)utf8 (#348)

* reduce memory needed for concat

* reuse code for str allocation buffer

* make sure that only concat preallocates buffers (#382)

* MutableArrayData::with_capacities

* better pattern matching

* add binary capacities

* add list child data

* add struct capacities

* add panic for dictionary type

* change dictionary capacity enum variant

Co-authored-by: Ritchie Vink <ritchie46@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked PR that was backported to active release (will be included in maintenance release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants