Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concat_ws Spark function #8854

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

PHILO-HE
Copy link
Contributor

@PHILO-HE PHILO-HE commented Feb 26, 2024

Add concat_ws Spark function which returns the concatenation for the
input, separated by a separator (the first argument). It allows variable
number of VARCHAR or ARRAY arguments. And these
two types can be used in combination.

This function is a bit similar to ConcatFunction, except that concat_ws
requires separator and supports using ARRAY type and mixed types.

This PR is based on #6292 (author: @unigof). There are a few
improvements to align with Spark's behavior.

Doc link.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 26, 2024
Copy link

netlify bot commented Feb 26, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit c21b1ed
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6661d04ca6a57d0008f45c22
😎 Deploy Preview https://deploy-preview-8854--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@PHILO-HE
Copy link
Contributor Author

Found Spark supports arguments with String/Array type mixed used. Needs to enhance this patch.

SELECT concat_ws(,, a, b, array(c, d), e);
"a,b,c,d,e"

@PHILO-HE PHILO-HE force-pushed the concat_ws branch 3 times, most recently from b2bca03 to 7efd217 Compare March 26, 2024 14:49
@PHILO-HE PHILO-HE force-pushed the concat_ws branch 9 times, most recently from 3033eb8 to a1f105b Compare April 10, 2024 02:58
@PHILO-HE
Copy link
Contributor Author

Found Spark supports arguments with String/Array type mixed used. Needs to enhance this patch.

SELECT concat_ws(,, a, b, array(c, d), e);
"a,b,c,d,e"

Supported such case in the latest code.

@PHILO-HE
Copy link
Contributor Author

@rui-mo, could you take a review?

@PHILO-HE PHILO-HE force-pushed the concat_ws branch 2 times, most recently from 379108a to 8203b08 Compare April 10, 2024 03:11
.. spark:function:: concat_ws(separator, [string]/[array<string>], ...) -> varchar

Returns the concatenation for ``string`` & all elements in ``array<string>``, separated by
``separator``. Only accepts constant ``separator``. It takes variable number of remaining
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Spark allow non-constant separator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, yes. Just updated the code to support this. Thanks!

@@ -20,6 +20,21 @@ Unless specified otherwise, all functions return NULL if at least one of the arg
If ``n < 0``, the result is an empty string.
If ``n >= 256``, the result is equivalent to chr(``n % 256``).

.. spark:function:: concat_ws(separator, [string]/[array<string>], ...) -> varchar

Returns the concatenation for ``string`` & all elements in ``array<string>``, separated by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: & to and?

``separator``. Only accepts constant ``separator``. It takes variable number of remaining
arguments. And ``string`` & ``array<string>`` can be used in combination. If ``separator``
is NULL, returns NULL, regardless of the following inputs. If only ``separator`` (not a
NULL) is provided or all remaining inputs are NULL, returns an empty string. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you clarify the behavior when the separator is not NULL, but string or array(string) contain NULL? Maybe also include an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clarified and added an example.

@PHILO-HE PHILO-HE force-pushed the concat_ws branch 2 times, most recently from c2638ed to bbfa05e Compare April 12, 2024 03:35
@PHILO-HE
Copy link
Contributor Author

@rui-mo, could you review again? Thanks!

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

const std::vector<exec::VectorFunctionArg>& inputArgs,
const core::QueryConfig& /*config*/) {
auto numArgs = inputArgs.size();
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: VELOX_USER_CHECK_GE

numArgs >= 1,
"concat_ws requires one arguments at least, but got {}.",
numArgs);
for (auto& arg : inputArgs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const &

}
}

// String arg.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit obscure, could we use a complete sentence as comment?

@@ -103,6 +103,229 @@ class Length : public exec::VectorFunction {
}
};

void doApply(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we are lacking some description for the process of concating, maybe better to add some.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

separatorDecoded = exec::LocalDecodedVector(context, *args[0], rows);
}

// Calculate the total number of bytes in the result.
Copy link
Collaborator

@rui-mo rui-mo Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability, it seems several helper functions could be extracted to achieve a single purpose.

@PHILO-HE PHILO-HE force-pushed the concat_ws branch 2 times, most recently from cabeb65 to 7433ee6 Compare April 16, 2024 02:26
@PHILO-HE
Copy link
Contributor Author

@rui-mo, could you review this pr again? Thanks!

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

of the following inputs. For non-NULL ``separator``, if no remaining input or all remaining
inputs are NULL, returns an empty string. ::

SELECT concat_ws('~', 'a', 'b', 'c'); -- 'a~b~c'
Copy link
Collaborator

@rui-mo rui-mo Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does concat_ws behave when element is empty string or array, like concat_ws('~', 'a', 'b', ''), or concat_ws('~', [], ['d'])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed. Empty string & its neighbors are also separated in Spark. Just fixed this and updated doc.

auto element = elementsDecoded.valueAt<StringView>(offset + j);
if (!element.empty()) {
allElements++;
totalResultBytes += element.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems empty element is skipped. Will empty element being concated in Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed. Thanks!

decodedSeparator);

// Allocate a string buffer.
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is totalResultBytes the exact size? Maybe use getRawStringBufferWithSpace(totalResultBytes, true) if it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!


// Allocate a string buffer.
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes);
rows.applyToSelected([&](int row) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> vector_size_t or auto

Copy link
Contributor Author

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo, fixed your comments. Thanks!

of the following inputs. For non-NULL ``separator``, if no remaining input or all remaining
inputs are NULL, returns an empty string. ::

SELECT concat_ws('~', 'a', 'b', 'c'); -- 'a~b~c'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed. Empty string & its neighbors are also separated in Spark. Just fixed this and updated doc.

auto element = elementsDecoded.valueAt<StringView>(offset + j);
if (!element.empty()) {
allElements++;
totalResultBytes += element.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed. Thanks!

decodedSeparator);

// Allocate a string buffer.
auto rawBuffer = flatResult.getRawStringBufferWithSpace(totalResultBytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@PHILO-HE
Copy link
Contributor Author

Hi @mbasmanova, Rui has no more comment. Could you please spare some time to review? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE Thank you for adding this function. Some comments.


Returns the concatenation for ``string`` and all elements in ``array<string>``, separated
by ``separator``. The type of ``separator`` is VARCHAR . It can take variable number of
remaining arguments, where ``string`` & ``array<string>`` can be used in combination. NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& -> and

by ``separator``. The type of ``separator`` is VARCHAR . It can take variable number of
remaining arguments, where ``string`` & ``array<string>`` can be used in combination. NULL
element is skipped in the concatenation. If ``separator`` is NULL, returns NULL, regardless
of the followinginputs. For non-NULL ``separator``, if no remaining input or all remaining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followinginputs -> following inputs

"unix_timestamp",
// Skip concat_ws due to the below issue:
// We use "any" type in its signature to allow mixed
// using of VARCHAR & ARRAY<VARCHAR>. But the fuzzer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& -> and

'mixed using' -> 'a mix of VARCHAR and ARRAY arguments'

exec::FunctionSignatureBuilder()
.returnType("varchar")
.constantArgumentType("varchar")
.argumentType("any")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this function would be better defined as a special form.

@@ -103,6 +103,286 @@ class Length : public exec::VectorFunction {
}
};

class ConcatWs : public exec::VectorFunction {
public:
explicit ConcatWs(const std::optional<std::string>& separator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be implemented as a simple function? If not, let's move it into a separate file.

@@ -54,7 +54,12 @@ int main(int argc, char** argv) {
"chr",
"replace",
"might_contain",
"unix_timestamp"};
"unix_timestamp",
// Skip concat_ws due to the below issue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without fuzzer coverage it will be hard to ensure the implementation is correct.

// varchar, [varchar], [array(varchar)], ... -> varchar.
exec::FunctionSignatureBuilder()
.returnType("varchar")
.constantArgumentType("varchar")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature says that separator must be constant, but it seems that implementation also allows non-constant separator.

Comment on lines +124 to +134
DecodedVector& arrayDecoded = *decodedArrays[i].get();
auto baseArray = arrayDecoded.base()->as<ArrayVector>();
auto rawSizes = baseArray->rawSizes();
auto rawOffsets = baseArray->rawOffsets();
auto indices = arrayDecoded.indices();
auto elements = baseArray->elements();
exec::LocalSelectivityVector nestedRows(context, elements->size());
nestedRows.get()->setAll();
exec::LocalDecodedVector elementsHolder(
context, *elements, *nestedRows.get());
auto& elementsDecoded = *elementsHolder.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, rearrange the code to avoid doing this per row and do that per batch of rows instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants