Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new config for disabling group by on multiValue column #12253

Merged
merged 5 commits into from
Feb 16, 2022

Conversation

cryptoe
Copy link
Contributor

@cryptoe cryptoe commented Feb 10, 2022

Description

As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config groupByEnableMultiValueUnnesting which can be set in the query context.

The default value of groupByEnableMultiValueUnnesting is true, therefore it does not change the current engine behavior.
If groupByEnableMultiValueUnnesting is set to false, the query will fail if it encounters a multi-value column in the grouping key.


Key changed/added classes in this PR
  • GroupByQueryConfig
  • GroupByQueryEngineV2
  • GroupByQueryEngine

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

query.getDimensions());
if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true))
&& !allSingleValueDims) {
throw new ISE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should name the dimension so this error message is more actionable. People are going to need to know which dimension is causing the problem, so they can either remove it, set it to array mode, or process it in some other way.

How about:

  1. Change hasNoExplodingDimensions to findExplodingDimensions (i.e., return the names of the exploding dimensions).
  2. Change the error message to:

Encountered multi-value dimensions [%s] that cannot be processed with %s set to false. Consider changing these dimensions to arrays or setting %s to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked. Agreed such messaging would be better for the users

Copy link
Contributor Author

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @gianm . Please have a look at findAllProbableExplodingDimensions

query.getDimensions());
if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true))
&& !allSingleValueDims) {
throw new ISE(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked. Agreed such messaging would be better for the users

## Disable GroupBy on multivalue columns

As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting
`groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this out of the docs until we the array-based story is complete. Then we can document this, array types, array functions, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can keep this documented but change the error message to not mention array-based dimensions. In that case, we can change the error message to this:

Encountered multi-value dimension [%s] that cannot be processed with %s set to false. Consider setting %s to true for unnesting behavior, or using an expression to create a scalar from the multi-value dimension.

For the docs, a couple style points:

  1. The rest of this page uses second person ("you can…") rather than third ("users can…") so we should stick to that.
  2. We usually use US spelling in documentation (e.g. behavior instead of behaviour).

So I'd go with:

You can disable the implicit unnesting behavior for groupBy by setting groupByEnableMultiValueUnnesting: false in your query context. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any multi-valued dimensions that were inadvertently included.

Also, all documented groupBy parameters should be included in the groupbyquery.md document as well, under "GroupBy v2 configurations". So if you mention this here it should be mentioned in the main doc too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest draft I have kept the documentation and changed the error messaging a bit.
Update the "GroupBy v2 configurations" as well.

*/
public static boolean hasNoExplodingDimensions(
public static List<String> findAllProbableExplodingDimensions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking through this I just realized that this function can't tell if dimensions are definitely multi-valued or not. It can only tell if they might be multi-valued. (But they also might not be!)

So I think we should move the check; instead of checking here, we should check row by row as we actually run the query. If a multi-value row is ever encountered, then at that point we should throw the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Updated the check while doing row by row iteration.

@@ -764,6 +783,9 @@ protected void aggregateMultiValueDims(Grouper<ByteBuffer> grouper)
);

if (doAggregate) {
// this check is done during the row aggregation as a dimension can become multi-value col if
// {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: javadoc links don't work here, maybe just say "column capabilities", also "unkown" -> "unknown"

if (multiValuesSize == 0) {
if (!grouper.aggregate(GroupByColumnSelectorStrategy.GROUP_BY_MISSING_VALUE).isOk()) {
return;
}
} else {
if (multiValuesSize > 1) {
// this check is done during the row aggregation as a dimension can become multi-value col if
// {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same nit and typo

@gianm
Copy link
Contributor

gianm commented Feb 16, 2022

Looks good after the changes. Thanks!

@abhishekagarwal87 abhishekagarwal87 merged commit 5794331 into apache:master Feb 16, 2022
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants