Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation string concat #1720

Merged
merged 7 commits into from May 25, 2022
Merged

Implementation string concat #1720

merged 7 commits into from May 25, 2022

Conversation

ismailmaj
Copy link
Contributor

Which issue does this PR close?

Closes #1540 .

Rationale for this change

What changes are included in this PR?

Add an implementation for string concat and a single test.

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 21, 2022
@ismailmaj ismailmaj marked this pull request as ready for review May 22, 2022 08:54
@codecov-commenter
Copy link

codecov-commenter commented May 22, 2022

Codecov Report

Merging #1720 (e9bea71) into master (5cf06bf) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head e9bea71 differs from pull request most recent head bdad3b0. Consider uploading reports for the commit bdad3b0 to get more accurate results

@@            Coverage Diff             @@
##           master    #1720      +/-   ##
==========================================
+ Coverage   83.32%   83.34%   +0.02%     
==========================================
  Files         196      197       +1     
  Lines       55961    56031      +70     
==========================================
+ Hits        46627    46698      +71     
+ Misses       9334     9333       -1     
Impacted Files Coverage Δ
arrow/src/compute/kernels/string.rs 100.00% <100.00%> (ø)
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cf06bf...bdad3b0. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start, thank you, I've left a few comments. Let me know if anything is unclear

@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
Ok(make_array(mutable.freeze()))
}

// Elementwise concatenation of StringArrays
pub fn string_concat<Offset: OffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be in a new string module, as opposed to concat, as I could see it causing confusion that this performs element-wise concatenation, when the concat kernel above concatenates arrays?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @alamb has some thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold to create a new compute kernel. Also, list, binary, fixed size list and fix size binary can all be supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that a new string (or utf8 module) would be good here.

There are a bunch of other potential candidates in datafusion to move into arrow-rs (kudos to @ovr for implementing many of them initially): https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/string_expressions.rs

arrow/src/compute/kernels/concat.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/concat.rs Outdated Show resolved Hide resolved
None
}
})
.collect::<GenericStringArray<Offset>>())
Copy link
Contributor

@tustvold tustvold May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be significantly faster to do something like (not at all tested)

// TODO: Handle non-zero offset in source ArrayData

if left.len() != right.len() {
    return Err(...)
}

let nulls = match (left.data().null_bitmap(), right.data.null_bitmap()) {
  (Some(left), Some(right)) = Some((left & right)?)
  (Some(left), None) => Some(left),
  (None, Some(right)) => Some(right),
  (None, None) => None,
};
let left_offsets = left.value_offsets();
let right_offsets = right.value_offsets();

let left_values = left.value_data().as_slice();
let right_values = right.value_data().as_slice();

let left_iter = left_offsets.windows(2);
let right_iter = right_offsets.windows(2);

let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
output_offsets.append(0);

for (left, right) in left_iter.zip(right_iter) {
  output_values.append_slice(&left_values[left[0]..left[1]]);
  output_values.append_slice(&right_values[right[0]..right[1]]);)
  
  output_offsets.append(output_values.len()); // With checked cast
}

let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
  .len(left.len())
  .add_buffer(output_offsets.finish())
  .add_buffer(output_values.finish());

if let Some(nulls) = nulls {
  builder = builder.null_bit_buffer(nulls);
}

// SAFETY - offsets valid by construction
unsafe {builder.build_unchecked()}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could file a follow-on issue to add benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple thoughts, because I spent some time digging into this:

  1. @alamb's example in Implementation string concat #1720 suggests that "valid_str" + None => None, where both implementations in this PR go for "valid_str" + None => "valid_str". As a user, I'd prefer the later, but I thought I'd make a note of it. If the former is chosen, then compute::util has combine_option_bitmap() which works nicely for the null handling
  2. How do we handle cases when concatenating two arrays results in the offsets overflowing? There should probably be a test case for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"valid_str" + None => None

This is the standard SQL semantic and it is how most other arrow kernels I know of behave, thus I think it would be best for consistency

arrow/src/compute/kernels/concat.rs Outdated Show resolved Hide resolved
@ismailmaj
Copy link
Contributor Author

@tustvold Thank you for the review and guidance!
I've pushed something that is faster and close to your implementation.
I'm still unfamiliar with the codebase, so I'm sure there are still some mistakes.

I'll add more test cases and support for non-zero offset later.

@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
Ok(make_array(mutable.freeze()))
}

// Elementwise concatenation of StringArrays
pub fn string_concat<Offset: OffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold to create a new compute kernel. Also, list, binary, fixed size list and fix size binary can all be supported.

@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
Ok(make_array(mutable.freeze()))
}

// Elementwise concatenation of StringArrays
pub fn string_concat<Offset: OffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should support concatenate arbitrary number of arrays. We could file a follow-on issue to track it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670 would you like to file the follow on issues, or shall I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will.

None
}
})
.collect::<GenericStringArray<Offset>>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could file a follow-on issue to add benchmarks.

@HaoYang670
Copy link
Contributor

@Ismail-Maj Really a good start! Thank you for contributing.

@tustvold
Copy link
Contributor

Looking good 😄 I think the major thing this now needs to handle is sliced inputs, and tests.

@ismailmaj
Copy link
Contributor Author

ismailmaj commented May 24, 2022

Only tests left to do (and move the code from concat.rs) I believe.
@tustvold do you think this code can be ported to handle other DataTypes?

@alamb
Copy link
Contributor

alamb commented May 24, 2022

Only tests left to do (and move the code from concat.rs) I believe.

@tustvold do you think this code can be ported to handle other DataTypes?

I personally suggest completing this PR and getting it merged first, and then add other DataTypes as follow on tickets (you might find others interested in working on them as well, if you don't have time)

@ismailmaj ismailmaj requested review from tustvold and alamb May 25, 2022 07:48
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, assuming CI passes, nice work 👍

Will leave open for a bit longer in case anyone else wants to review, otherwise I'll likely merge this afternoon (BST).

@tustvold
Copy link
Contributor

@Ismail-Maj could you possibly merge the latest master to bring in #1743

@ismailmaj
Copy link
Contributor Author

@tustvold it should be good now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really nice to me -- thank you @Ismail-Maj ❤️

)
.unwrap();

let expected = [None, Some("foofar"), Some("barfaz")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement string_concat kernel
8 participants