Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNION ALLs in MSQ #14981

Merged

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Sep 13, 2023

Description

(Note: The description might use union and union all interchangeably unless specified, both of which mean union all in SQL)

This PR updates the following:

  1. UnionDataSource can have data sources apart from the TableDataSource. This will be used for MSQ only, since MSQ, in theory, can plan arbitrary unions. Also, it is required to plan unions in MSQ in the DataSourcePlan.
  2. Disallow top-level union alls in MSQ. This is because the SQL layer executes the top-level unions sequentially, which doesn't make sense for an async engine like MSQ. More info and examples for this added later.
  3. Add the ability to plan a UnionDataSource in MSQ. This will provide MSQ feature parity with the native engine, as it will allow the
  4. (Will be broken into a separate PR) Add the ability to plan arbitrary data sources as unions (this will be an engine feature), however, this requires alignment as to what leeways are we willing to support - Do the unions with different column names get planned, do the unions with different types get planned, etc.

1

MSQ needs to plan the individual data sources of the union and perform a replace operation so that each data source can be represented by the input specs that it requires. This warrants UnionDataSource to accept other data sources as its children. The current methods in the UnionDataSource have been refactored to perform the original checks only when the data source is used in certain contexts, like the native stack.


2

MSQ currently doesn't support UNION queries. However, in the query stack, there are two types of UNIONs:

  1. UnionDataSource - Very limited. MSQ detects this and throws a QueryNotSupported fault, which is the expected behavior
  2. Top-level union - Works around the shortcomings of 1.

However 2) is executed sequentially by the SQL layer and the results are appended sequentially. For a simple query like

SELECT * FROM foo
UNION ALL
SELECT * FROM foo2 

SQL would execute SELECT * FROM foo and SELECT * FROM foo2 and concat the results together.
This works fine for working with engines producing results synchronously like sql-native where we return the results, however, for MSQ, which produces results asynchronously, the concatenation logic doesn't work as expected since don't wait for the query to finish, fetch the results and submit the second query.

To make matters worse, the SQL layer submits the first query, gets the query ID back as the result, and then executes the second query (that fails). Therefore we only submit the partial query successfully and we might even get the incorrect results back.

This PR introduces the engine feature ALLOW_TOP_LEVEL_UNION_ALL that dictates whether the planner can plan the query using top-level union alls. MSQ disallows this, so the queries are forced to plan using the union data source, which will return query not supported exception.

This flag will also be useful once we start supporting unions in MSQ, which we'd want to exclusively execute using UnionDataSource, and the flag would seamlessly tie in with the query paths we'd wanna take when planning unions then.

With the change, the following query:

Screenshot 2023-09-14 at 2 07 34 AM

native tasks plan query as before (top-level union all)

Screenshot 2023-09-14 at 2 07 39 AM

MSQ tasks plan can't plan query with top-level union all, therefore use the UnionDataSource to plan the query, which then ultimately fails with QueryNotSupported in MSQ

Screenshot 2023-09-14 at 2 07 48 AM

3

Check out the changes in the DataSourcePlan which allows the union to be planned in the SQL stack


4

TBD


Release note

MSQ can execute UNION ALL queries with UnionDataSource.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@LakshSingla
Copy link
Contributor Author

DruidSortUnionRule has a defensive check, therefore, can't add more tests to satisfy code coverage. I think there's little value in trying to satisfy it.

@LakshSingla LakshSingla added Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 and removed MSQ labels Sep 14, 2023
+ "SELECT * FROM foo\n")
.setExpectedRowSignature(rowSignature)
.setExpectedDataSource("foo1")
.setExpectedMSQFault(QueryNotSupportedFault.instance())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… I am confused about why this yields a QueryNotSupportedFault. Shouldn't it fail to plan, and generate a planner error instead of an MSQ fault?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does plan using the UnionDataSource, which then goes into MSQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on how it is getting planned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test can be removed. It should be planned using a UnionDataSource.
Can we also add the NativeQuery for assertion ?

DruidException.Persona.ADMIN,
DruidException.Category.INVALID_INPUT,
"general"
).expectMessageIs("Query planning failed for unknown reason, our best guess is this "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this error coming from? Looking at the code for the union rule, I would think it can't happen, because it's generated by isCompatible, which isn't called when the ALLOW_TOP_LEVEL_UNION_ALL feature is missing. The error should be something about UNION ALL being unsupported for this engine.

Copy link
Contributor Author

@LakshSingla LakshSingla Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tries to plan the query using the UnionDataSourceRule, goes into the isCompatible then, and then rewrites the already set planning error.
This gets executed using UnionDataSourceRule since the column names match, isCompatible returns and not the top-level union all.

Copy link
Contributor Author

@LakshSingla LakshSingla Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename the two test cases that I added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments and renamed the test cases. Hope they clarify the confusion

@LakshSingla LakshSingla removed the MSQ label Sep 15, 2023
@LakshSingla LakshSingla changed the title Disallow top-level UNION ALLs in MSQ Unions in MSQ Sep 29, 2023
@LakshSingla LakshSingla changed the title Unions in MSQ UNION ALLs in MSQ Sep 29, 2023
// No need to set the planning error here
return false;
}
if (!firstColumnNames.equals(secondColumnNames)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to casting Calcite might change the name to something like EXPR$0. In such a case this does not allow the onMatch to trigger. What's our plan of action for handling such cases ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am debating whether to act stringently and only allow the same names and types (no implicit casts) when using union all with this rule as well. Is there a way to realize that there is an implicit cast done, if so, then we can build something around it to remap it to the original variable name, otherwise it would be a hassle for the user since he won't be able to reference the original column.

Sidenote: I am debating whether to pull the planning changes out of the PR into their separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the union is nested inside a subquery, it would be difficult for the upper callers to reference it once it has changed into the form EXPR$0, due to implicit casting. Therefore if we can identify that it has been cast implicitly, then we should remap it back to the original column name, else I am debating that we should include name and the type check,

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the initial code. Left some initial review.

@@ -36,13 +37,17 @@
import java.util.function.Function;
import java.util.stream.Collectors;

/**
* TODO(laksh):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a java doc here.

+ "SELECT * FROM foo\n")
.setExpectedRowSignature(rowSignature)
.setExpectedDataSource("foo1")
.setExpectedMSQFault(QueryNotSupportedFault.instance())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test can be removed. It should be planned using a UnionDataSource.
Can we also add the NativeQuery for assertion ?

*/
@Test
@Override
public void testUnionIsUnplannable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required ?

}

@Test
public void testUnionOnSubqueries()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test can be marked with ignore till we have the new calcite rule in place.

@@ -170,6 +171,18 @@ public static DataSourcePlan forDataSource(
minStageNumber,
broadcast
);
} else if (dataSource instanceof UnionDataSource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the MSQ known issues and the docs where ever we are calling union all as unsupported in MSQ.

@LakshSingla LakshSingla added this to the 28.0 milestone Oct 3, 2023
Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm!!


/**
* Doesn't pass through Druid however the planning error is different as it rewrites to a union datasource.
* This test is disabled because MSQ wants to support union datasources, and it makes little sense to add highly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems outdated.

if (!(input instanceof TableDataSource)) {
throw DruidException.defensive("should be table");
}
return Iterables.getOnlyElement(input.getTableNames());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:Lets avoid using Iterables.getOnlyElement(). Lets use CollectionUtils.getOnlyElement()

@LakshSingla LakshSingla merged commit 549ef56 into apache:master Oct 9, 2023
81 checks passed
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
MSQ now supports UNION ALL with UnionDataSource
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
MSQ now supports UNION ALL with UnionDataSource
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants