ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

abreis · 2021-02-09T11:35:52Z

This PR proposes a divide_scalar kernel that divides numeric arrays by a single scalar.

Benchmarks show ~40-50% gains:

# features = []
divide 512              time:   [2.3210 us 2.3345 us 2.3490 us]
divide_scalar 512       time:   [1.4374 us 1.4425 us 1.4485 us] (-38%)
divide_nulls 512        time:   [2.1718 us 2.1799 us 2.1894 us]
divide_scalar_nulls 512 time:   [1.3888 us 1.3959 us 1.4036 us] (-36%)

# features = ["simd"]
divide 512              time:   [1.0221 us 1.0348 us 1.0481 us] 
divide_scalar 512       time:   [468.04 ns 471.36 ns 475.19 ns] (-54%)
divide_nulls 512        time:   [960.20 ns 964.30 ns 969.15 ns]
divide_scalar_nulls 512 time:   [471.33 ns 476.41 ns 482.09 ns] (-51%)

The speedups are due to:

checking for DivideByZero only once;
not having to combine two null bitmaps;
using Simd::splat() to fill the divisor chunks.

Tests are pretty bare right now, if you think this is worth merging I'll write a few more.

github-actions · 2021-02-09T11:36:17Z

https://issues.apache.org/jira/browse/ARROW-11572

codecov-io · 2021-02-09T11:54:58Z

Codecov Report

Merging #9454 (6b52026) into master (5fc0e5e) will decrease coverage by 0.00%.
The diff coverage is 70.58%.

@@            Coverage Diff             @@
##           master    #9454      +/-   ##
==========================================
- Coverage   82.27%   82.26%   -0.01%     
==========================================
  Files         244      244              
  Lines       55393    55427      +34     
==========================================
+ Hits        45573    45599      +26     
- Misses       9820     9828       +8

Impacted Files	Coverage Δ
rust/arrow/src/buffer/immutable.rs	`96.82% <ø> (ø)`
rust/arrow/src/compute/kernels/arithmetic.rs	`87.36% <70.58%> (-2.28%)`	⬇️
rust/parquet/src/encodings/encoding.rs	`95.05% <0.00%> (+0.19%)`	⬆️
rust/arrow/src/array/transform/fixed_binary.rs	`84.21% <0.00%> (+5.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fc0e5e...6b52026. Read the comment docs.

rust/arrow/benches/arithmetic_kernels.rs

nevi-me · 2021-02-11T20:48:48Z

This is a good change, thanks @abreis.

I think adding scalar equivalents to functions where we've previously converted scalars to arrays, ties us up in the future.

The C++ implementation solves this issue by using a Datum that's loosely:

enum Datum {
  Array(ArrayRef),
  Scalar(ScalarRef)
}

This would allow us to avoid {kernel}_scalar for all our kernels where 2 or more arrays are currently taken as inputs.

What are your thoughts @jorgecarleitao @alamb @andygrove @ritchie46?

abreis · 2021-02-11T21:39:10Z

IMO, that would be the superior approach (though it's a much larger rewrite).

I tried to mock what it could look like for the numeric kernels:

// Datum with concrete types for the arithmetic module
enum DatumNumeric<'a, T>
where
    T: ArrowNumericType,
{
    Array(&'a PrimitiveArray<T>),
    Scalar(T::Native),
}

// lets the user pass arrays without interacting with Datum
impl<'a, T> From<&'a PrimitiveArray<T>> for DatumNumeric<'a, T>
where
    T: ArrowNumericType,
{
    fn from(array: &'a PrimitiveArray<T>) -> Self {
        DatumNumeric::Array(array)
    }
}

// can match and run specialized code for array/array, array/scalar, etcetera
fn datum_math_divide<'a, T, DN>(left: DN, right: DN) -> Result<PrimitiveArray<T>>
where
    T: ArrowNumericType,
    T::Native: Div<Output = T::Native> + Zero,
    DN: Into<DatumNumeric<'a, T>>,
{
    use DatumNumeric::*;
    match (left.into(), right.into()) {
        (Array(left), Array(right)) => todo!("array/array"),
        (Array(array), Scalar(divisor)) => todo!("array/scalar"),
        _ => todo!(),
    }
}

fn test_datum_divide() {
    let a = Int32Array::from(vec![15, 15, 8, 1, 9]);
    let b = Int32Array::from(vec![5, 6, 8, 9, 1]);
    let c = datum_math_divide(&a, &b).unwrap(); // Works, same interface as before
}

However, Rust doesn't like the impl From for scalars (I tried impl<'a, T> From<T::Native> for DatumNumeric<'a, T> where T: ArrowNumericType). Perhaps there's a better way to accomplish this.

alamb · 2021-02-11T22:00:55Z

I think @jorgecarleitao was doing something similar to the Datum approach in DataFusion (#9376) -- I realize it is not the same exact thing, but the ColumnarValue of that PR is very close I think to the Datum described by @nevi-me

ritchie46 · 2021-02-12T07:08:35Z

As @abreis' mock-up shows it can be an elegant way to reduce the public API surface.

jorgecarleitao · 2021-02-12T07:48:34Z

@abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types).

We could of course offer a DatumTemporal and a DatumList and so one and so forth, but IMO that is introducing more complexity, as we now start having structs for all kinds of logical types.

Generally, the logical operation performed on the data depends on its DataType. Thus, the scalar needs to contain that logical data type information.

In my opinion, the data structure should be something like what DataFusion has, but instead of having one enum variant per logical type, we should have one enum variant per physical type, i.e.

enum ScalarValue {
    Int32(Option<i32>, DataType),  // int32, date32, time32
    Int64(Option<i64>, DataType), // int64, date64, time64, timestamp
    List(Option<Vec<ScalarValue>>, DataType),
}

The Option indicates validity of the type, DataType represents the logical type.

This corresponds to the notion that each variant needs to be treated fundamentally different because its physical layout (i.e. at the machine level) is different. This is how Rust handles these things with generics, because it relies on type information to compile the instructions.

This would allow us to write our numerical operations neatly using a generic over ArrowNativeType, as T::Native would have a one-to-one correspondence to the physical type in the enum.

We still have a problem over which a PrimitiveArray currently needs to be downcasted to its logical - not physical - representation, and thus we still have a lot of useless downcastings. Specifically, PrimitiveArray<Int32Type> and PrimitiveArray<Date32Type> have the same in-memory representation, their only difference is a constant value at ArrayData::DataType; there is no reason to have 2 variants here: we just need one variant and two values (the datatypes). Point 10 of my proposal addresses this by splitting the logical and physical. I.e. make it PrimitiveArray<i32>. Regardless, this is just to explain why imo we should aim at one variant per physical, not logical type.

abreis · 2021-02-12T13:53:59Z

@abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types).

We could of course offer a DatumTemporal and a DatumList and so one and so forth, but IMO that is introducing more complexity, as we now start having structs for all kinds of logical types.

I agree. I just realized that ArrowPrimitiveType exists though, so Datum can be generic over that instead.

Here is a more fleshed-out mock of a Datum type. The signature of math_divide enforces a consistent T over array/array and array/scalar combinations, and the method's output.

Base type:

#[derive(Debug)]
pub enum Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    Array(&'a PrimitiveArray<T>),
    Scalar(Option<T::Native>),
}

From impls for both arrays and nullable scalars:

impl<'a, T> From<&'a PrimitiveArray<T>> for Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    fn from(array: &'a PrimitiveArray<T>) -> Self {
        Datum::Array(array)
    }
}

impl<'a, T> From<Option<T::Native>> for Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    fn from(scalar: Option<T::Native>) -> Self {
        Datum::Scalar(scalar)
    }
}

A (user-facing) method for math division:

pub fn math_divide<'a1, 'a2, T, DL, DR>(
    left: DL,
    right: DR,
) -> Result<PrimitiveArray<T>>
where
    T: ArrowNumericType,
    T::Native: Div<Output = T::Native> + Zero,
    DL: Into<Datum<'a1, T>>, // left and right may have different lifetimes
    DR: Into<Datum<'a2, T>>, // but `T` must be the same
{
    use Datum::*;
    match (left.into(), right.into()) {
        (Array(left), Array(right)) => todo!(), // array/array
        (Array(array), Scalar(divisor)) => todo!(), // array/scalar
        _ => todo!(),
    }
}

Test code:

fn test_datum_divide() {
    let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]);
    let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]);
    let scalar = Some(8i32);

    let a_over_a = math_divide(&array1, &array2).unwrap(); // works
    let a_over_s = math_divide(&array1, scalar).unwrap(); // also works
}

@jorgecarleitao I'm aware this doesn't address the second half of your comment. I'm just exploring this idea for the current version of arrow.

abreis · 2021-02-15T19:10:34Z

Given that compute::kernels::comparison already makes use liberal use of <op>_scalar methods, I'd like to propose the following as a way forward for this PR:

Merge divide_scalar as-is, and other eventual PRs for arithmetic scalar ops (add_scalar, etcetera).
Propose the use of Datum in a separate PR as a way to generalize kernel inputs for arrays and scalars.
If accepted, make every <op>_scalar method in the codebase simply delegate to its (now Datum-using) base method.
Mark the _scalar methods with #[deprecated]. Remove them altogether in future major release.

jorgecarleitao

I left two small suggestions. This looks really good :)

Also, note that there is a potential optimization for all SIMD here where we create an un-initialized buffer and write to it directly from SIMD (instead of creating a zeroed buffer and writing to it)

rust/arrow/src/compute/kernels/arithmetic.rs

abreis · 2021-02-16T12:22:42Z

I left two small suggestions. This looks really good :)

Also, note that there is a potential optimization for all SIMD here where we create an un-initialized buffer and write to it directly from SIMD (instead of creating a zeroed buffer and writing to it)

Thanks!

I went to try and implement this, but it looks like it's already optimized as you suggest. simd_ methods do:

let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);

to prepare the return buffer, which uses memory::allocate_aligned internally, which does not initialize the newly-allocated memory region.

rust/arrow/src/compute/kernels/arithmetic.rs

alamb · 2021-02-19T13:47:41Z

@abreis I think this PR is ready to go but it needs a rebase. Can you please do so and I'll merge it in over the weekend?

abreis · 2021-02-19T13:50:54Z

Apologies, I've been swamped this week. I want to add a couple more tests and take a look at Jorge's comments before it's merged. Let me ping you once I do. Thanks!

alamb · 2021-02-19T13:52:57Z

Makes sense. Thanks @abreis !

abreis · 2021-02-20T23:28:55Z

@alamb Thanks, this should be good to go now.

I want to point out that the latest benchmarks (#9454 (comment)) suggest that the compiler's auto-vectorized code might be ~2-5% faster than our own SIMD implementation of this particular method. I have little experience with high-performance code though, so I can't say whether that will always be the case.

jorgecarleitao · 2021-02-21T05:18:10Z

I went to try and implement this, but it looks like it's already optimized as you suggest. simd_ methods do:
let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);
to prepare the return buffer, which uses memory::allocate_aligned internally, which does not initialize the newly-allocated memory region.

Note that with_bitset(buffer_size, false); is a malloc + memset. So, we effectively do malloc + memset (zeros) + memset (values).

One easy win is to use MutableBuffer::from_len_zeroed(), which replaces malloc + memset (zeros) by calloc (which is lilkely faster as it is a single instruction, but LLVM may optimize this already).

What I was thinking was using MutableBuffer::with_capacity(buffer_size) and introduce a method to write SIMD types (e.g. i64x16) directly to the MutableBuffer, allowing to convert all our SIMD kernels to malloc + memset (values).

This is not for this PR, though; was just a comment that we may be able to do more here. :)

alamb

Thanks for the contribution @abreis -- really good stuff.

…nel in arrow This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (#9454). Performance-wise: * on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row; * on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays. The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation. Closes #9543 from abreis/datafusion-divide-scalar Authored-by: Andre Braga Reis <andre@brg.rs> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR proposes a `divide_scalar` kernel that divides numeric arrays by a single scalar. Benchmarks show ~40-50% gains: ``` # features = [] divide 512 time: [2.3210 us 2.3345 us 2.3490 us] divide_scalar 512 time: [1.4374 us 1.4425 us 1.4485 us] (-38%) divide_nulls 512 time: [2.1718 us 2.1799 us 2.1894 us] divide_scalar_nulls 512 time: [1.3888 us 1.3959 us 1.4036 us] (-36%) # features = ["simd"] divide 512 time: [1.0221 us 1.0348 us 1.0481 us] divide_scalar 512 time: [468.04 ns 471.36 ns 475.19 ns] (-54%) divide_nulls 512 time: [960.20 ns 964.30 ns 969.15 ns] divide_scalar_nulls 512 time: [471.33 ns 476.41 ns 482.09 ns] (-51%) ``` The speedups are due to: - checking for `DivideByZero` only once; - not having to combine two null bitmaps; - using `Simd::splat()` to fill the divisor chunks. Tests are pretty bare right now, if you think this is worth merging I'll write a few more. Closes apache#9454 from abreis/divide-scalar Authored-by: Andre Braga Reis <andre@brg.rs> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…nel in arrow This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (apache#9454). Performance-wise: * on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row; * on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays. The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation. Closes apache#9543 from abreis/datafusion-divide-scalar Authored-by: Andre Braga Reis <andre@brg.rs> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR proposes a `divide_scalar` kernel that divides numeric arrays by a single scalar. Benchmarks show ~40-50% gains: ``` # features = [] divide 512 time: [2.3210 us 2.3345 us 2.3490 us] divide_scalar 512 time: [1.4374 us 1.4425 us 1.4485 us] (-38%) divide_nulls 512 time: [2.1718 us 2.1799 us 2.1894 us] divide_scalar_nulls 512 time: [1.3888 us 1.3959 us 1.4036 us] (-36%) # features = ["simd"] divide 512 time: [1.0221 us 1.0348 us 1.0481 us] divide_scalar 512 time: [468.04 ns 471.36 ns 475.19 ns] (-54%) divide_nulls 512 time: [960.20 ns 964.30 ns 969.15 ns] divide_scalar_nulls 512 time: [471.33 ns 476.41 ns 482.09 ns] (-51%) ``` The speedups are due to: - checking for `DivideByZero` only once; - not having to combine two null bitmaps; - using `Simd::splat()` to fill the divisor chunks. Tests are pretty bare right now, if you think this is worth merging I'll write a few more. Closes apache#9454 from abreis/divide-scalar Authored-by: Andre Braga Reis <andre@brg.rs> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

…nel in arrow This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (apache#9454). Performance-wise: * on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row; * on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays. The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation. Closes apache#9543 from abreis/datafusion-divide-scalar Authored-by: Andre Braga Reis <andre@brg.rs> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added the Component: Rust label Feb 9, 2021

abreis marked this pull request as ready for review February 9, 2021 23:57

abreis force-pushed the divide-scalar branch from ac4e214 to 9f03b4f Compare February 11, 2021 01:18

Dandandan reviewed Feb 11, 2021

View reviewed changes

rust/arrow/benches/arithmetic_kernels.rs Outdated Show resolved Hide resolved

abreis force-pushed the divide-scalar branch from 9f03b4f to 2eb0195 Compare February 11, 2021 20:33

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

abreis force-pushed the divide-scalar branch from 2eb0195 to 2926907 Compare February 15, 2021 16:53

jorgecarleitao approved these changes Feb 16, 2021

View reviewed changes

rust/arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved

rust/arrow/src/compute/kernels/arithmetic.rs Show resolved Hide resolved

jorgecarleitao reviewed Feb 17, 2021

View reviewed changes

rust/arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved

jorgecarleitao reviewed Feb 17, 2021

View reviewed changes

rust/arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved

alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 19, 2021

abreis added 6 commits February 20, 2021 22:03

Implement a SISD divide_scalar op

290607d

Implement a SIMD divide_scalar op

cd16016

Add tests and benchmarks

4013ece

Generate a random scalar in benchmarks

7a15590

Improve a couple doc comments

3a157bf

Skip a few more null checks

f3a7307

Expand test coverage

6b52026

abreis force-pushed the divide-scalar branch from 2926907 to 6b52026 Compare February 20, 2021 22:11

alamb reviewed Feb 21, 2021

View reviewed changes

alamb closed this in aebabca Feb 21, 2021

abreis mentioned this pull request Feb 21, 2021

ARROW-11725: [Rust][DataFusion] Make use of the new divide_scalar kernel in arrow #9543

Closed

asfimport mentioned this pull request Feb 21, 2021

[Rust] Add a kernel for division by single scalar #27442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

abreis commented Feb 9, 2021 •

edited

github-actions bot commented Feb 9, 2021

codecov-io commented Feb 9, 2021 •

edited

nevi-me commented Feb 11, 2021

abreis commented Feb 11, 2021 •

edited

alamb commented Feb 11, 2021

ritchie46 commented Feb 12, 2021

jorgecarleitao commented Feb 12, 2021

abreis commented Feb 12, 2021 •

edited

abreis commented Feb 15, 2021

jorgecarleitao left a comment

abreis commented Feb 16, 2021

alamb commented Feb 19, 2021

abreis commented Feb 19, 2021

alamb commented Feb 19, 2021

abreis commented Feb 20, 2021

jorgecarleitao commented Feb 21, 2021 •

edited

alamb left a comment

ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

Conversation

abreis commented Feb 9, 2021 • edited

github-actions bot commented Feb 9, 2021

codecov-io commented Feb 9, 2021 • edited

Codecov Report

nevi-me commented Feb 11, 2021

abreis commented Feb 11, 2021 • edited

alamb commented Feb 11, 2021

ritchie46 commented Feb 12, 2021

jorgecarleitao commented Feb 12, 2021

abreis commented Feb 12, 2021 • edited

abreis commented Feb 15, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

abreis commented Feb 16, 2021

alamb commented Feb 19, 2021

abreis commented Feb 19, 2021

alamb commented Feb 19, 2021

abreis commented Feb 20, 2021

jorgecarleitao commented Feb 21, 2021 • edited

alamb left a comment

Choose a reason for hiding this comment

abreis commented Feb 9, 2021 •

edited

codecov-io commented Feb 9, 2021 •

edited

abreis commented Feb 11, 2021 •

edited

abreis commented Feb 12, 2021 •

edited

jorgecarleitao commented Feb 21, 2021 •

edited