Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11572: [Rust] Add a kernel for division by single scalar #9454

Closed
wants to merge 7 commits into from

Conversation

abreis
Copy link
Contributor

@abreis abreis commented Feb 9, 2021

This PR proposes a divide_scalar kernel that divides numeric arrays by a single scalar.

Benchmarks show ~40-50% gains:

# features = []
divide 512              time:   [2.3210 us 2.3345 us 2.3490 us]
divide_scalar 512       time:   [1.4374 us 1.4425 us 1.4485 us] (-38%)
divide_nulls 512        time:   [2.1718 us 2.1799 us 2.1894 us]
divide_scalar_nulls 512 time:   [1.3888 us 1.3959 us 1.4036 us] (-36%)

# features = ["simd"]
divide 512              time:   [1.0221 us 1.0348 us 1.0481 us] 
divide_scalar 512       time:   [468.04 ns 471.36 ns 475.19 ns] (-54%)
divide_nulls 512        time:   [960.20 ns 964.30 ns 969.15 ns]
divide_scalar_nulls 512 time:   [471.33 ns 476.41 ns 482.09 ns] (-51%)

The speedups are due to:

  • checking for DivideByZero only once;
  • not having to combine two null bitmaps;
  • using Simd::splat() to fill the divisor chunks.

Tests are pretty bare right now, if you think this is worth merging I'll write a few more.

@github-actions
Copy link

github-actions bot commented Feb 9, 2021

@codecov-io
Copy link

codecov-io commented Feb 9, 2021

Codecov Report

Merging #9454 (6b52026) into master (5fc0e5e) will decrease coverage by 0.00%.
The diff coverage is 70.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9454      +/-   ##
==========================================
- Coverage   82.27%   82.26%   -0.01%     
==========================================
  Files         244      244              
  Lines       55393    55427      +34     
==========================================
+ Hits        45573    45599      +26     
- Misses       9820     9828       +8     
Impacted Files Coverage Δ
rust/arrow/src/buffer/immutable.rs 96.82% <ø> (ø)
rust/arrow/src/compute/kernels/arithmetic.rs 87.36% <70.58%> (-2.28%) ⬇️
rust/parquet/src/encodings/encoding.rs 95.05% <0.00%> (+0.19%) ⬆️
rust/arrow/src/array/transform/fixed_binary.rs 84.21% <0.00%> (+5.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fc0e5e...6b52026. Read the comment docs.

@abreis abreis marked this pull request as ready for review February 9, 2021 23:57
@nevi-me
Copy link
Contributor

nevi-me commented Feb 11, 2021

This is a good change, thanks @abreis.

I think adding scalar equivalents to functions where we've previously converted scalars to arrays, ties us up in the future.

The C++ implementation solves this issue by using a Datum that's loosely:

enum Datum {
  Array(ArrayRef),
  Scalar(ScalarRef)
}

This would allow us to avoid {kernel}_scalar for all our kernels where 2 or more arrays are currently taken as inputs.

What are your thoughts @jorgecarleitao @alamb @andygrove @ritchie46?

@abreis
Copy link
Contributor Author

abreis commented Feb 11, 2021

IMO, that would be the superior approach (though it's a much larger rewrite).

I tried to mock what it could look like for the numeric kernels:

// Datum with concrete types for the arithmetic module
enum DatumNumeric<'a, T>
where
    T: ArrowNumericType,
{
    Array(&'a PrimitiveArray<T>),
    Scalar(T::Native),
}

// lets the user pass arrays without interacting with Datum
impl<'a, T> From<&'a PrimitiveArray<T>> for DatumNumeric<'a, T>
where
    T: ArrowNumericType,
{
    fn from(array: &'a PrimitiveArray<T>) -> Self {
        DatumNumeric::Array(array)
    }
}

// can match and run specialized code for array/array, array/scalar, etcetera
fn datum_math_divide<'a, T, DN>(left: DN, right: DN) -> Result<PrimitiveArray<T>>
where
    T: ArrowNumericType,
    T::Native: Div<Output = T::Native> + Zero,
    DN: Into<DatumNumeric<'a, T>>,
{
    use DatumNumeric::*;
    match (left.into(), right.into()) {
        (Array(left), Array(right)) => todo!("array/array"),
        (Array(array), Scalar(divisor)) => todo!("array/scalar"),
        _ => todo!(),
    }
}

fn test_datum_divide() {
    let a = Int32Array::from(vec![15, 15, 8, 1, 9]);
    let b = Int32Array::from(vec![5, 6, 8, 9, 1]);
    let c = datum_math_divide(&a, &b).unwrap(); // Works, same interface as before
}

However, Rust doesn't like the impl From for scalars (I tried impl<'a, T> From<T::Native> for DatumNumeric<'a, T> where T: ArrowNumericType). Perhaps there's a better way to accomplish this.

@alamb
Copy link
Contributor

alamb commented Feb 11, 2021

I think @jorgecarleitao was doing something similar to the Datum approach in DataFusion (#9376) -- I realize it is not the same exact thing, but the ColumnarValue of that PR is very close I think to the Datum described by @nevi-me

@ritchie46
Copy link
Contributor

As @abreis' mock-up shows it can be an elegant way to reduce the public API surface.

@jorgecarleitao
Copy link
Member

@abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types).

We could of course offer a DatumTemporal and a DatumList and so one and so forth, but IMO that is introducing more complexity, as we now start having structs for all kinds of logical types.

Generally, the logical operation performed on the data depends on its DataType. Thus, the scalar needs to contain that logical data type information.

In my opinion, the data structure should be something like what DataFusion has, but instead of having one enum variant per logical type, we should have one enum variant per physical type, i.e.

enum ScalarValue {
    Int32(Option<i32>, DataType),  // int32, date32, time32
    Int64(Option<i64>, DataType), // int64, date64, time64, timestamp
    List(Option<Vec<ScalarValue>>, DataType),
}

The Option indicates validity of the type, DataType represents the logical type.

This corresponds to the notion that each variant needs to be treated fundamentally different because its physical layout (i.e. at the machine level) is different. This is how Rust handles these things with generics, because it relies on type information to compile the instructions.

This would allow us to write our numerical operations neatly using a generic over ArrowNativeType, as T::Native would have a one-to-one correspondence to the physical type in the enum.

We still have a problem over which a PrimitiveArray currently needs to be downcasted to its logical - not physical - representation, and thus we still have a lot of useless downcastings. Specifically, PrimitiveArray<Int32Type> and PrimitiveArray<Date32Type> have the same in-memory representation, their only difference is a constant value at ArrayData::DataType; there is no reason to have 2 variants here: we just need one variant and two values (the datatypes). Point 10 of my proposal addresses this by splitting the logical and physical. I.e. make it PrimitiveArray<i32>. Regardless, this is just to explain why imo we should aim at one variant per physical, not logical type.

@abreis
Copy link
Contributor Author

abreis commented Feb 12, 2021

@abreis , I am not convinced that that is sufficient, unfortunately, because it excludes all types that are not Numeric (i.e. all dates and times for primitives, as well as all other logical types).

We could of course offer a DatumTemporal and a DatumList and so one and so forth, but IMO that is introducing more complexity, as we now start having structs for all kinds of logical types.

I agree. I just realized that ArrowPrimitiveType exists though, so Datum can be generic over that instead.

Here is a more fleshed-out mock of a Datum type. The signature of math_divide enforces a consistent T over array/array and array/scalar combinations, and the method's output.

Base type:

#[derive(Debug)]
pub enum Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    Array(&'a PrimitiveArray<T>),
    Scalar(Option<T::Native>),
}

From impls for both arrays and nullable scalars:

impl<'a, T> From<&'a PrimitiveArray<T>> for Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    fn from(array: &'a PrimitiveArray<T>) -> Self {
        Datum::Array(array)
    }
}

impl<'a, T> From<Option<T::Native>> for Datum<'a, T>
where
    T: ArrowPrimitiveType,
{
    fn from(scalar: Option<T::Native>) -> Self {
        Datum::Scalar(scalar)
    }
}

A (user-facing) method for math division:

pub fn math_divide<'a1, 'a2, T, DL, DR>(
    left: DL,
    right: DR,
) -> Result<PrimitiveArray<T>>
where
    T: ArrowNumericType,
    T::Native: Div<Output = T::Native> + Zero,
    DL: Into<Datum<'a1, T>>, // left and right may have different lifetimes
    DR: Into<Datum<'a2, T>>, // but `T` must be the same
{
    use Datum::*;
    match (left.into(), right.into()) {
        (Array(left), Array(right)) => todo!(), // array/array
        (Array(array), Scalar(divisor)) => todo!(), // array/scalar
        _ => todo!(),
    }
}

Test code:

fn test_datum_divide() {
    let array1 = Int32Array::from(vec![15, 15, 8, 1, 9]);
    let array2 = Int32Array::from(vec![5, 6, 8, 9, 1]);
    let scalar = Some(8i32);

    let a_over_a = math_divide(&array1, &array2).unwrap(); // works
    let a_over_s = math_divide(&array1, scalar).unwrap(); // also works
}

@jorgecarleitao I'm aware this doesn't address the second half of your comment. I'm just exploring this idea for the current version of arrow.

@abreis
Copy link
Contributor Author

abreis commented Feb 15, 2021

Given that compute::kernels::comparison already makes use liberal use of <op>_scalar methods, I'd like to propose the following as a way forward for this PR:

  1. Merge divide_scalar as-is, and other eventual PRs for arithmetic scalar ops (add_scalar, etcetera).
  2. Propose the use of Datum in a separate PR as a way to generalize kernel inputs for arrays and scalars.
  3. If accepted, make every <op>_scalar method in the codebase simply delegate to its (now Datum-using) base method.
  4. Mark the _scalar methods with #[deprecated]. Remove them altogether in future major release.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two small suggestions. This looks really good :)

Also, note that there is a potential optimization for all SIMD here where we create an un-initialized buffer and write to it directly from SIMD (instead of creating a zeroed buffer and writing to it)

rust/arrow/src/compute/kernels/arithmetic.rs Outdated Show resolved Hide resolved
rust/arrow/src/compute/kernels/arithmetic.rs Show resolved Hide resolved
@abreis
Copy link
Contributor Author

abreis commented Feb 16, 2021

I left two small suggestions. This looks really good :)

Also, note that there is a potential optimization for all SIMD here where we create an un-initialized buffer and write to it directly from SIMD (instead of creating a zeroed buffer and writing to it)

Thanks!

I went to try and implement this, but it looks like it's already optimized as you suggest. simd_ methods do:

let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);

to prepare the return buffer, which uses memory::allocate_aligned internally, which does not initialize the newly-allocated memory region.

@alamb
Copy link
Contributor

alamb commented Feb 19, 2021

@abreis I think this PR is ready to go but it needs a rebase. Can you please do so and I'll merge it in over the weekend?

@alamb alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 19, 2021
@abreis
Copy link
Contributor Author

abreis commented Feb 19, 2021

Apologies, I've been swamped this week. I want to add a couple more tests and take a look at Jorge's comments before it's merged. Let me ping you once I do. Thanks!

@alamb
Copy link
Contributor

alamb commented Feb 19, 2021

Makes sense. Thanks @abreis !

@abreis
Copy link
Contributor Author

abreis commented Feb 20, 2021

@alamb Thanks, this should be good to go now.

I want to point out that the latest benchmarks (#9454 (comment)) suggest that the compiler's auto-vectorized code might be ~2-5% faster than our own SIMD implementation of this particular method. I have little experience with high-performance code though, so I can't say whether that will always be the case.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Feb 21, 2021

I went to try and implement this, but it looks like it's already optimized as you suggest. simd_ methods do:

let mut result = MutableBuffer::new(buffer_size).with_bitset(buffer_size, false);

to prepare the return buffer, which uses memory::allocate_aligned internally, which does not initialize the newly-allocated memory region.

Note that with_bitset(buffer_size, false); is a malloc + memset. So, we effectively do malloc + memset (zeros) + memset (values).

One easy win is to use MutableBuffer::from_len_zeroed(), which replaces malloc + memset (zeros) by calloc (which is lilkely faster as it is a single instruction, but LLVM may optimize this already).

What I was thinking was using MutableBuffer::with_capacity(buffer_size) and introduce a method to write SIMD types (e.g. i64x16) directly to the MutableBuffer, allowing to convert all our SIMD kernels to malloc + memset (values).

This is not for this PR, though; was just a comment that we may be able to do more here. :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @abreis -- really good stuff.

@alamb alamb closed this in aebabca Feb 21, 2021
alamb pushed a commit that referenced this pull request Feb 23, 2021
…nel in arrow

This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (#9454).

Performance-wise:
* on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row;
* on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays.

The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation.

Closes #9543 from abreis/datafusion-divide-scalar

Authored-by: Andre Braga Reis <andre@brg.rs>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR proposes a `divide_scalar` kernel that divides numeric arrays by a single scalar.

Benchmarks show ~40-50% gains:
```
# features = []
divide 512              time:   [2.3210 us 2.3345 us 2.3490 us]
divide_scalar 512       time:   [1.4374 us 1.4425 us 1.4485 us] (-38%)
divide_nulls 512        time:   [2.1718 us 2.1799 us 2.1894 us]
divide_scalar_nulls 512 time:   [1.3888 us 1.3959 us 1.4036 us] (-36%)

# features = ["simd"]
divide 512              time:   [1.0221 us 1.0348 us 1.0481 us]
divide_scalar 512       time:   [468.04 ns 471.36 ns 475.19 ns] (-54%)
divide_nulls 512        time:   [960.20 ns 964.30 ns 969.15 ns]
divide_scalar_nulls 512 time:   [471.33 ns 476.41 ns 482.09 ns] (-51%)
```

The speedups are due to:
- checking for `DivideByZero` only once;
- not having to combine two null bitmaps;
- using `Simd::splat()` to fill the divisor chunks.

Tests are pretty bare right now, if you think this is worth merging I'll write a few more.

Closes apache#9454 from abreis/divide-scalar

Authored-by: Andre Braga Reis <andre@brg.rs>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…nel in arrow

This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (apache#9454).

Performance-wise:
* on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row;
* on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays.

The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation.

Closes apache#9543 from abreis/datafusion-divide-scalar

Authored-by: Andre Braga Reis <andre@brg.rs>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR proposes a `divide_scalar` kernel that divides numeric arrays by a single scalar.

Benchmarks show ~40-50% gains:
```
# features = []
divide 512              time:   [2.3210 us 2.3345 us 2.3490 us]
divide_scalar 512       time:   [1.4374 us 1.4425 us 1.4485 us] (-38%)
divide_nulls 512        time:   [2.1718 us 2.1799 us 2.1894 us]
divide_scalar_nulls 512 time:   [1.3888 us 1.3959 us 1.4036 us] (-36%)

# features = ["simd"]
divide 512              time:   [1.0221 us 1.0348 us 1.0481 us]
divide_scalar 512       time:   [468.04 ns 471.36 ns 475.19 ns] (-54%)
divide_nulls 512        time:   [960.20 ns 964.30 ns 969.15 ns]
divide_scalar_nulls 512 time:   [471.33 ns 476.41 ns 482.09 ns] (-51%)
```

The speedups are due to:
- checking for `DivideByZero` only once;
- not having to combine two null bitmaps;
- using `Simd::splat()` to fill the divisor chunks.

Tests are pretty bare right now, if you think this is worth merging I'll write a few more.

Closes apache#9454 from abreis/divide-scalar

Authored-by: Andre Braga Reis <andre@brg.rs>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…nel in arrow

This is a small PR to make DataFusion use the just-merged `divide_scalar` arrow kernel (apache#9454).

Performance-wise:
* on the `arrow` side, this specialized kernel is ~40-50% faster than the standard `divide`, mostly due to not having to check for divide-by-zero on every row;
* on the `datafusion` side, it can now skip the `scalar.to_array_of_size(num_rows)` allocation, which should be a decent win for operations on large arrays.

The eventual goal is to have `op_scalar` variants for every arithmetic operation — `divide` will show the biggest performance gains but all variants should save DataFusion a (possibly expensive) allocation.

Closes apache#9543 from abreis/datafusion-divide-scalar

Authored-by: Andre Braga Reis <andre@brg.rs>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Rust needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants