Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10959: [C++] Add scalar string join kernel #8990

Closed
wants to merge 5 commits into from

Conversation

maartenbreddels
Copy link
Contributor

@jorisvandenbossche I've implemented this kernel as a binary (arity) kernel, so the input list array and the separator input string array can both be an array (see python test).

I did not implement the case where the input list is a scalar, and the separator an array, since I don't think that's very common.

And note that the kernel is named binary_join because it takes string-like and binary-like inputs.

@github-actions
Copy link

@maartenbreddels
Copy link
Contributor Author

Failures seem unrelated.
@pitrou this is ready for review assuming @jorisvandenbossche agrees with the current behavior

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maartenbreddels . Here are some comments.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
@ianmcook
Copy link
Member

Regarding "join" in the name:
+1 for consistency with Python's join function
-1 because as Arrow gains more database-like features, the word "join" is likely to confuse people. "concat" might be better.

Regarding "binary" in the name:
IMO that is likely to cause users to get the impression that this is not for strings.

It would also be good for the name to reflect that this operates on a list array.

How about renaming it list_concat?

@westonpace
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

@ianmcook
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

SQL—which probably has more users than all those languages combined 😁

@ianmcook
Copy link
Member

I'll just point out that "join" is not a python-ism. There is a string join in Java, Rust, C#, JavaScript, etc. and it is consistently called join. I think R is the only language I can find that doesn't call it "join".

SQL—which probably has more users than all those languages combined 😁

Snark aside, that's a good point, and since users who are thinking in the SQL way will always be at least one abstraction layer removed from the C++ library, that makes me think "join" is fine 👍

@jorisvandenbossche
Copy link
Member

In SQL it's called CONCAT? (https://www.w3schools.com/sql/func_mysql_concat.asp, although this doesn't have the concept of a join separator)

@ianmcook
Copy link
Member

In SQL it's called CONCAT? (https://www.w3schools.com/sql/func_mysql_concat.asp, although this doesn't have the concept of a join separator)

In SQL, concat_ws is the variant with a separator

@pitrou
Copy link
Member

pitrou commented May 10, 2021

@maartenbreddels Do you want to update this PR?

@pitrou
Copy link
Member

pitrou commented Jun 2, 2021

I'm going to take this up.

@pitrou pitrou force-pushed the ARROW-10959 branch 2 times, most recently from f06549e to 9353e4e Compare June 3, 2021 10:56
@pitrou
Copy link
Member

pitrou commented Jun 3, 2021

I adressed review comments, added a benchmark, and made binary_join 3x faster (in particular by preallocating data at the right size).

@pitrou
Copy link
Member

pitrou commented Jun 3, 2021

Still a TODO to address (also support large_list inputs).

Edit: done

@pitrou pitrou requested a review from bkietz June 3, 2021 11:52
cpp/src/arrow/array/builder_binary.h Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
@@ -113,6 +113,11 @@ void CheckScalarBinary(std::string func_name, std::shared_ptr<Array> left_input,
std::shared_ptr<Array> expected,
const FunctionOptions* options = nullptr);

void CheckScalarBinary(std::string func_name, std::shared_ptr<Array> left_input,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the public overloads to a minimum, these should probably just take Datums. Added https://issues.apache.org/jira/browse/ARROW-12953

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
Comment on lines +2608 to +2875
std::shared_ptr<Array> string_array;
RETURN_NOT_OK(builder->Finish(&string_array));
*out = *string_array->data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't lose anything for string kernels which always do their own allocation, but it's worth noting that *out is always a correctly shaped ArrayData. We could also safely write

Suggested change
std::shared_ptr<Array> string_array;
RETURN_NOT_OK(builder->Finish(&string_array));
*out = *string_array->data();
RETURN_NOT_OK(builder->FinishInternal(const_cast<std::shared_ptr<ArrayData>*>(&out->array())));

@pitrou pitrou closed this in 2820b25 Jun 7, 2021
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
@jorisvandenbossche I've implemented this kernel as a binary (arity) kernel, so the input list array *and* the separator input string array can both be an array (see python test).

I did not implement the case where the input list is a scalar, and the separator an array, since I don't think that's very common.

And note that the kernel is named `binary_join` because it takes string-like and binary-like inputs.

Closes apache#8990 from maartenbreddels/ARROW-10959

Lead-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants