Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRING_AGG SQL aggregator function #11241

Merged
merged 10 commits into from
Aug 10, 2021
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented May 12, 2021

Description

This PR adds STRING_AGG, which is sort of like ARRAY_AGG added in #11157 but it ignores null values, with an ARRAY_TO_STRING function on the output array to turn it into a string value.

example:

SELECT STRING_AGG(l1, '-'), STRING_AGG(DISTINCT l1, ',') FROM numfoo

output:

["7-325323-0-0-0-0", "0,7,325323"]
Function Notes Default
STRING_AGG(expr, separator, [size]) Collects all values of expr into a single STRING, ignoring null values. Each value is joined by the separator which must be a literal STRING. An optional size in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the STRING_AGG expression is not currently supported, and the ordering of results within the output string may vary depending on processing order. null if druid.generic.useDefaultValueForNull=false, otherwise ''
STRING_AGG(DISTINCT expr, separator, [size]) Collects all distinct values of expr into a single STRING, ignoring null values. Each value is joined by the separator which must be a literal STRING. An optional size in bytes can be supplied to limit aggregation size (default of 1024 bytes). If the aggregated string grows larger than the maximum size in bytes, the query will fail. Use of ORDER BY within the STRING_AGG expression is not currently supported, and the ordering of results within the output string may vary depending on processing order. null if druid.generic.useDefaultValueForNull=false, otherwise ''

Like ARRAY_AGG this is an expression aggregator, so is not well optimized, but will fill in the gaps for now.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

OperandTypes.sequence(
StringUtils.format("'%s'(expr, separator)", NAME),
OperandTypes.ANY,
OperandTypes.LITERAL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also take care of the cases if the literal is wrapped in a scalar function (like concat('a', 'b'))?

Copy link
Member Author

@clintropolis clintropolis Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain if this can (or should) handle an expression for the separator parameter, I'll do some investigation and potentially update this signature accordingly and add a test case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I'm going to hold off on updating this PR until #11280 goes in, since it has a slight refactor/improvement of the underlying expression aggregator that will require some minor changes in this PR.

)
);
} else {
return Aggregation.create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the if-else can be collapsed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it could if i pull out the two expressions into a separate conditional block; I did not make this change yet because it seems like minor shifting of the condition but will modify here (as well as ARRAY_AGG since I did the same thing there) if you feel more strongly about it.

}
maxSizeBytes = ((Number) RexLiteral.value(maxBytes)).intValue();
}
final DruidExpression arg = arguments.get(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be aggregationColumnArg

@@ -65,7 +65,7 @@
// minimum permitted agg size is 10 bytes so it is at least large enough to hold primitive numerics (long, double)
// | expression type byte | is_null byte | primitive value (8 bytes) |
private static final int MIN_SIZE_BYTES = 10;
private static final HumanReadableBytes DEFAULT_MAX_SIZE_BYTES = new HumanReadableBytes(1L << 10);
public static final HumanReadableBytes DEFAULT_MAX_SIZE_BYTES = new HumanReadableBytes(1L << 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like an extra change - can't find its usage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't meant to be, its used in tests now

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @clintropolis 👍

@clintropolis clintropolis merged commit 9af7ba9 into apache:master Aug 10, 2021
@clintropolis clintropolis deleted the string-agg branch August 10, 2021 20:47
@clintropolis
Copy link
Member Author

thanks for review @rohangarg and @jihoonson 👍

@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants