Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11037: [Rust] Optimized creation of string array from iterator. #9016

Closed
wants to merge 1 commit into from
Closed

ARROW-11037: [Rust] Optimized creation of string array from iterator. #9016

wants to merge 1 commit into from

Conversation

jorgecarleitao
Copy link
Member

Avoids a memcopy from 2 Vec<T> to Buffer, by building the buffers on the fly.

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 26, 2020

Codecov Report

Merging #9016 (6e8d2ff) into master (86cf246) will decrease coverage by 0.00%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9016      +/-   ##
==========================================
- Coverage   82.60%   82.60%   -0.01%     
==========================================
  Files         204      204              
  Lines       50175    50176       +1     
==========================================
- Hits        41447    41446       -1     
- Misses       8728     8730       +2     
Impacted Files Coverage Δ
rust/arrow/src/array/array_string.rs 88.88% <87.50%> (-1.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86cf246...efb7883. Read the comment docs.

@Dandandan
Copy link
Contributor

I think the previous time I tried this in a kernel, I got worse perf results than copying them from the Vecs, but maybe the buffer implementation is faster now? Does this code have a benchmark?

@jorgecarleitao
Copy link
Member Author

That is a good point. I assumed that such a memcopy would always be worse for sufficiently large arrays, but I have not validated this.

@jorgecarleitao jorgecarleitao marked this pull request as draft December 27, 2020 12:33
@jorgecarleitao
Copy link
Member Author

I am moving this to draft until the benches are in place. @Dandandan is right that this is not necessarily an improvement. I am investigating this.\

let mut values = Vec::new();
let mut offsets =
MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
let mut values = MutableBuffer::new(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to test here whether it is cheaper to calculate offsets + total size first for values buffer based on final length_so_far so that it doesn't require extra allocations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like

for s in &v {
            length_so_far = length_so_far + OffsetSize::from_usize(s.len()).unwrap();
            offsets.extend_from_slice(length_so_far.to_byte_slice());
        }
        
let mut values = MutableBuffer::new(length_so_far);
for s in &v {
    values.extend_from_slice(s.as_bytes());
}

Copy link
Contributor

@Dandandan Dandandan Dec 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the offset buffer is also faster after converting it to a slice using typed_data_mut::<OffsetSize>() and iterating it with iter_mut, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both great ideas. If it is ok for you, I will leave them to future work, as I am focusing my work on the MutableBuffer atm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. the general direction that I am pushing towards is to stop using Vec and replace it by MutableBuffer to avoid extra allocations. Once that is in place, we can replace these by iter_mut.

also current conversion to a byte slice may add some overhead?

I believe so, as to_byte_slice returns a slice of unknown size, both for &T and &[T]. Either the compiler optimizes it, or there is an extra cost. I benched a 10% diff in building buffers when introduced a method that was not doing bound checks on these. IMO we should be using to_le_bytes, and have a method MutableBuffer::push<ToByteSlice>. I think we need the crate byteorder because AFAIK std's to_le_bytes does not have a trait (which we need).

@jorgecarleitao
Copy link
Member Author

I have re-opened this PR as #9032 shows that creating a buffer from a mutable buffer is 2x faster.

@alamb
Copy link
Contributor

alamb commented Dec 31, 2020

The full set of Rust CI tests did not run on this PR :(

Can you please rebase this PR against apache/master to pick up the changes in #9056 so that they do?

I apologize for the inconvenience.

@andygrove andygrove added the needs-rebase A PR that needs to be rebased by the author label Jan 1, 2021
@jorgecarleitao jorgecarleitao removed the needs-rebase A PR that needs to be rebased by the author label Jan 2, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @jorgecarleitao -- I didn't test this PR out or run the benchmarks but I think your results are good for me;

I did read the code and I understand the changes. 👍

@alamb alamb closed this in f7d47a3 Jan 4, 2021
jorgecarleitao added a commit that referenced this pull request Jan 19, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
kszucs pushed a commit that referenced this pull request Jan 25, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in #8796 and, together with #9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache/arrow#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and #9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes #9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Avoids a memcopy from 2 `Vec<T>` to `Buffer`, by building the buffers on the fly.

Closes apache#9016 from jorgecarleitao/optimize_string

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR refactors `MutableBuffer::extend_from_slice` to remove the need to use `to_byte_slice` on every call, thereby removing its level of indirection, that does not allow the compiler to optimize out some code.

This is the second performance improvement originally presented in apache#8796 and, together with apache#9027 , brings the performance of "MutableBuffer" to the same level as `Vec<u8>`, in particular to building buffers on the fly.

Basically, when converting to a byte slice `&[u8]`, the compiler loses the type size information, and thus needs to perform extra checks and can't just optimize out the code.

This PR adopts the same API as `Vec<T>::extend_from_slice`, but since our buffers are in `u8` (i.e. a la `Vec<u8>`), I made the signature

```
pub fn extend_from_slice<T: ToByteSlice>(&mut self, items: &[T])
pub fn push<T: ToByteSlice>(&mut self, item: &T)
```

i.e. it consumes something that can be converted to a byte slice, but internally makes the conversion to bytes (as `to_byte_slice` was doing).

Credits for the root cause analysis that lead to this PR go to @Dandandan, [originally fielded here](apache#9016 (comment)).

> [...] current conversion to a byte slice may add some overhead? - @Dandandan

Benches (against master, so, both this PR and apache#9044 ):

```
Switched to branch 'perf_buffer'
Your branch and 'origin/perf_buffer' have diverged,
and have 6 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 1m 00s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/buffer_create-915da5f1abaf0471
Gnuplot not found, using plotters backend
mutable                 time:   [463.11 us 463.57 us 464.07 us]
                        change: [-19.508% -18.571% -17.526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe

mutable prepared        time:   [527.84 us 528.46 us 529.14 us]
                        change: [-13.356% -12.522% -11.790%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

Benchmarking from_slice: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
from_slice              time:   [1.1968 ms 1.1979 ms 1.1991 ms]
                        change: [-6.8697% -6.2029% -5.5812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

from_slice prepared     time:   [917.49 us 918.89 us 920.60 us]
                        change: [-6.5111% -5.9102% -5.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
```

Closes apache#9076 from jorgecarleitao/perf_buffer

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants