-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concat_ws Spark function #8854
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Found Spark supports arguments with String/Array type mixed used. Needs to enhance this patch. SELECT concat_ws(,, a, b, array(c, d), e);
"a,b,c,d,e" |
b2bca03
to
7efd217
Compare
3033eb8
to
a1f105b
Compare
Supported such case in the latest code. |
@rui-mo, could you take a review? |
379108a
to
8203b08
Compare
.. spark:function:: concat_ws(separator, [string]/[array<string>], ...) -> varchar | ||
|
||
Returns the concatenation for ``string`` & all elements in ``array<string>``, separated by | ||
``separator``. Only accepts constant ``separator``. It takes variable number of remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark allow non-constant separator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, yes. Just updated the code to support this. Thanks!
@@ -20,6 +20,21 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
If ``n < 0``, the result is an empty string. | |||
If ``n >= 256``, the result is equivalent to chr(``n % 256``). | |||
|
|||
.. spark:function:: concat_ws(separator, [string]/[array<string>], ...) -> varchar | |||
|
|||
Returns the concatenation for ``string`` & all elements in ``array<string>``, separated by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: & to and?
``separator``. Only accepts constant ``separator``. It takes variable number of remaining | ||
arguments. And ``string`` & ``array<string>`` can be used in combination. If ``separator`` | ||
is NULL, returns NULL, regardless of the following inputs. If only ``separator`` (not a | ||
NULL) is provided or all remaining inputs are NULL, returns an empty string. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you clarify the behavior when the separator is not NULL, but string or array(string) contain NULL? Maybe also include an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just clarified and added an example.
c2638ed
to
bbfa05e
Compare
@rui-mo, could you review again? Thanks! |
6651c0d
to
20df5f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/functions/sparksql/String.cpp
Outdated
const std::vector<exec::VectorFunctionArg>& inputArgs, | ||
const core::QueryConfig& /*config*/) { | ||
auto numArgs = inputArgs.size(); | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: VELOX_USER_CHECK_GE
velox/functions/sparksql/String.cpp
Outdated
numArgs >= 1, | ||
"concat_ws requires one arguments at least, but got {}.", | ||
numArgs); | ||
for (auto& arg : inputArgs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const &
velox/functions/sparksql/String.cpp
Outdated
} | ||
} | ||
|
||
// String arg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit obscure, could we use a complete sentence as comment?
velox/functions/sparksql/String.cpp
Outdated
@@ -103,6 +103,229 @@ class Length : public exec::VectorFunction { | |||
} | |||
}; | |||
|
|||
void doApply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we are lacking some description for the process of concating, maybe better to add some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/functions/sparksql/String.cpp
Outdated
separatorDecoded = exec::LocalDecodedVector(context, *args[0], rows); | ||
} | ||
|
||
// Calculate the total number of bytes in the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability, it seems several helper functions could be extracted to achieve a single purpose.
@rui-mo, could you review this pr again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
of the following inputs. For non-NULL ``separator``, if no remaining input or all remaining | ||
inputs are NULL, returns an empty string. :: | ||
|
||
SELECT concat_ws('~', 'a', 'b', 'c'); -- 'a~b~c' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does concat_ws behave when element is empty string or array, like concat_ws('~', 'a', 'b', '')
, or concat_ws('~', [], ['d'])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirmed. Empty string & its neighbors are also separated in Spark. Just fixed this and updated doc.
velox/functions/sparksql/String.cpp
Outdated
auto element = elementsDecoded.valueAt<StringView>(offset + j); | ||
if (!element.empty()) { | ||
allElements++; | ||
totalResultBytes += element.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it seems empty element is skipped. Will empty element being concated in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed. Thanks!
velox/functions/sparksql/String.cpp
Outdated
decodedSeparator); | ||
|
||
// Allocate a string buffer. | ||
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is totalResultBytes
the exact size? Maybe use getRawStringBufferWithSpace(totalResultBytes, true)
if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
velox/functions/sparksql/String.cpp
Outdated
|
||
// Allocate a string buffer. | ||
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes); | ||
rows.applyToSelected([&](int row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int -> vector_size_t or auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo, fixed your comments. Thanks!
of the following inputs. For non-NULL ``separator``, if no remaining input or all remaining | ||
inputs are NULL, returns an empty string. :: | ||
|
||
SELECT concat_ws('~', 'a', 'b', 'c'); -- 'a~b~c' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirmed. Empty string & its neighbors are also separated in Spark. Just fixed this and updated doc.
velox/functions/sparksql/String.cpp
Outdated
auto element = elementsDecoded.valueAt<StringView>(offset + j); | ||
if (!element.empty()) { | ||
allElements++; | ||
totalResultBytes += element.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed. Thanks!
velox/functions/sparksql/String.cpp
Outdated
decodedSeparator); | ||
|
||
// Allocate a string buffer. | ||
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
Hi @mbasmanova, Rui has no more comment. Could you please spare some time to review? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE Thank you for adding this function. Some comments.
|
||
Returns the concatenation for ``string`` and all elements in ``array<string>``, separated | ||
by ``separator``. The type of ``separator`` is VARCHAR . It can take variable number of | ||
remaining arguments, where ``string`` & ``array<string>`` can be used in combination. NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
& -> and
by ``separator``. The type of ``separator`` is VARCHAR . It can take variable number of | ||
remaining arguments, where ``string`` & ``array<string>`` can be used in combination. NULL | ||
element is skipped in the concatenation. If ``separator`` is NULL, returns NULL, regardless | ||
of the followinginputs. For non-NULL ``separator``, if no remaining input or all remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
followinginputs -> following inputs
"unix_timestamp", | ||
// Skip concat_ws due to the below issue: | ||
// We use "any" type in its signature to allow mixed | ||
// using of VARCHAR & ARRAY<VARCHAR>. But the fuzzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
& -> and
'mixed using' -> 'a mix of VARCHAR and ARRAY arguments'
velox/functions/sparksql/String.cpp
Outdated
exec::FunctionSignatureBuilder() | ||
.returnType("varchar") | ||
.constantArgumentType("varchar") | ||
.argumentType("any") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function would be better defined as a special form.
velox/functions/sparksql/String.cpp
Outdated
@@ -103,6 +103,286 @@ class Length : public exec::VectorFunction { | |||
} | |||
}; | |||
|
|||
class ConcatWs : public exec::VectorFunction { | |||
public: | |||
explicit ConcatWs(const std::optional<std::string>& separator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function be implemented as a simple function? If not, let's move it into a separate file.
@@ -54,7 +54,12 @@ int main(int argc, char** argv) { | |||
"chr", | |||
"replace", | |||
"might_contain", | |||
"unix_timestamp"}; | |||
"unix_timestamp", | |||
// Skip concat_ws due to the below issue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without fuzzer coverage it will be hard to ensure the implementation is correct.
velox/functions/sparksql/String.cpp
Outdated
// varchar, [varchar], [array(varchar)], ... -> varchar. | ||
exec::FunctionSignatureBuilder() | ||
.returnType("varchar") | ||
.constantArgumentType("varchar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature says that separator must be constant, but it seems that implementation also allows non-constant separator.
velox/functions/sparksql/String.cpp
Outdated
DecodedVector& arrayDecoded = *decodedArrays[i].get(); | ||
auto baseArray = arrayDecoded.base()->as<ArrayVector>(); | ||
auto rawSizes = baseArray->rawSizes(); | ||
auto rawOffsets = baseArray->rawOffsets(); | ||
auto indices = arrayDecoded.indices(); | ||
auto elements = baseArray->elements(); | ||
exec::LocalSelectivityVector nestedRows(context, elements->size()); | ||
nestedRows.get()->setAll(); | ||
exec::LocalDecodedVector elementsHolder( | ||
context, *elements, *nestedRows.get()); | ||
auto& elementsDecoded = *elementsHolder.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, rearrange the code to avoid doing this per row and do that per batch of rows instead.
|
||
namespace facebook::velox::functions::sparksql { | ||
|
||
class ConcatWsCallToSpecialForm : public exec::FunctionCallToSpecialForm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have many context here. can we register as simple function?
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
Hey @PHILO-HE any new progress for this pr? Looking forward to this function support! |
Add concat_ws Spark function which returns the concatenation for the
input, separated by a separator (the first argument). It allows variable
number of VARCHAR or ARRAY<VARCHAR> arguments. And these two
types can be used in combination.
This function is a bit similar to ConcatFunction, except that
concat_ws
requires separator and supports using ARRAY type and mixed types.
This PR is based on #6292 (author: @unigof). There are a few
improvements to align with Spark's behavior.
Doc link.