Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSQ: Validate that strings and string arrays are not mixed. #15920

Merged
merged 10 commits into from Mar 13, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Feb 17, 2024

When multi-value strings and string arrays coexist in the same column, it causes problems with "classic MVD" style queries such as:

  select * from wikipedia -- fails at runtime
  select count(*) from wikipedia where flags = 'B' -- fails at planning time
  select flags, count(*) from wikipedia group by 1 -- fails at runtime

To avoid these problems, this patch adds type verification for INSERT and REPLACE. It is targeted: the only type changes that are blocked are string-to-array and array-to-string. There is also a way to exclude certain columns from the type checks, if the user really knows what they're doing.

The patch includes new documentation about arrayIngestMode; see docs/querying/arrays.md. Other mentions of it throughout the docs are updated to link to this section.

Error scenarios:

Write SQL ARRAY to existing ARRAY field in arrayIngestMode: mvd

Cannot write into field[arrayString] using type[VARCHAR ARRAY] and arrayIngestMode[mvd], since the existing type is[VARCHAR ARRAY]. Try setting arrayIngestMode to[array] to retain the SQL type[VARCHAR ARRAY]. See https://druid.apache.org/docs/latest/querying/arrays#arrayingestmode for more details about this check and how to override it if needed.

Write SQL VARCHAR to existing ARRAY field in arrayIngestMode: mvd

Cannot write into field[arrayString] using type[VARCHAR] and arrayIngestMode[mvd], since the existing type is[VARCHAR ARRAY]. Try setting arrayIngestMode to[array] and adjusting your query to make this column an ARRAY instead of VARCHAR. See https://druid.apache.org/docs/latest/querying/arrays#arrayingestmode for more details about this check and how to override it if needed.

Write SQL VARCHAR to existing ARRAY field in arrayIngestMode: array

Cannot write into field[arrayString] using type[VARCHAR] and arrayIngestMode[array], since the existing type is[VARCHAR ARRAY]. Try adjusting your query to make this column an ARRAY instead of VARCHAR. See https://druid.apache.org/docs/latest/querying/arrays#arrayingestmode for more details about this check and how to override it if needed.

Write SQL ARRAY to existing VARCHAR field in arrayIngestMode: array

Cannot write into field[mvString] using type[VARCHAR ARRAY] and arrayIngestMode[array], since the existing type is[VARCHAR]. Try wrapping this field using ARRAY_TO_MV(...) AS "mvString". See https://druid.apache.org/docs/latest/querying/arrays#arrayingestmode for more details about this check and how to override it if needed.

When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:

  select * from wikipedia -- fails at runtime
  select count(*) from wikipedia where flags = 'B' -- fails at planning time
  select flags, count(*) from wikipedia group by 1 -- fails at runtime

To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.
@github-actions github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 17, 2024
@gianm
Copy link
Contributor Author

gianm commented Feb 17, 2024

This patch does not yet include docs and tests, so marking it draft. The idea is there though.

@gianm gianm marked this pull request as draft February 17, 2024 02:46
@gianm
Copy link
Contributor Author

gianm commented Feb 17, 2024

Depending on which direction you're going in, the errors look like either:

Cannot write into field[flags] using type[VARCHAR ARRAY] and arrayIngestMode[mvd], since the existing type is[VARCHAR ARRAY]. Try setting arrayIngestMode to[array] to retain the SQL type[VARCHAR ARRAY]

Or:

Cannot write into field[flags] using type[VARCHAR ARRAY] and arrayIngestMode[array], since the existing type is[VARCHAR]. Try wrapping this field using ARRAY_TO_MV(...) AS "flags"

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

)
{
if (targetTable == null) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this doesn't necessarily need to be done as part of this PR, but I do want to see at some point that this would have a validation failure for the case where the output type has any ARRAY and arrayIngestMode isn't set to array, for the effect of preventing people from writing new insert queries that are incorrectly relying on the mvd behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about the validations that are in effect when arrayIngestMode: array, since those validations would help us change the default.

Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd, they need to use ARRAY_TO_MV. (I think that's what you're suggesting here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd, they need to use ARRAY_TO_MV. (I think that's what you're suggesting here.)

yea, I'm mostly thinking about it in terms of forcing people to write the correct queries as soon as possible in either mode to make things less painful down the road, and was thinking it would be nice to prevent bad habits for new tables before they have a chance to get started since those habits are going to have to change.

Its ok to wait until we change the default i guess, which this change should allow us to do much sooner (druid 30 hopefully)

@@ -152,6 +154,7 @@ public class MultiStageQueryContext
public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.MVD;

public static final String CTX_SKIP_TYPE_VERIFICATION = "skipTypeVerification";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we have docs for this parameter? If so, it'll be another parameter that we'd have to support in MSQ. If not, then I don't see a way for users to set this parameter, both of which seem ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do without this parameter, since the user can use ARRAY_TO_MV or MV_TO_ARRAY functions to preserve the behaviour?

Copy link
Contributor Author

@gianm gianm Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this parameter is to allow someone to explicitly decide to ingest mismatched types. I don't think ARRAY_TO_MV and MV_TO_ARRAY would be helpful, since they're more something that users could use to make types line up properly.

One scenario for this parameter might be that someone is migrating from MVD to array, and has updated their queries to work with mismatched array and MVD types: they added MV_TO_ARRAY around all their MVDs, and have migrated to UNNEST, etc. They want to start ingesting arrays rather than MVDs for new data, but don't want to go back and rewrite all their old data, because that would be too expensive.

I do agree it's not really ideal for it to be a new context parameter. I just couldn't think of another approach to allowing validation bypass for specific columns, without a table-level catalog.

Maybe some new SQL syntax?

I wonder if you have any thoughts here? (or anyone else?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context flag seems ok to me, especially since i don't think we're quite yet in a place where a lot of people will want to be using a mix of mvds and arrays, we are missing optimized versions of MV_TO_ARRAY and ARRAY_TO_MV to make things a lot more efficient to use a mix of stuff.

I feel like once we've transitioned to array being default arrayIngestMode, given some time for it to settle, and then eventually remove arrayIngestMode to drop all of the other behaviors, we might be able to consider removing this validation in favor of pushing it to the catalog since the old habits would be gone and at that point it should also be a lot less painful to work with a mix of stuff. But we can revisit once arrayIngestMode is removed.

final ColumnType newDruidType =
DimensionSchemaUtils.getDimensionType(Calcites.getColumnTypeForRelDataType(newSqlType), arrayIngestMode);

if (newDruidType.isArray() && oldDruidType.is(ValueType.STRING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like we are building a DDL layer, but for specific types, and for MSQ ingestion, which seems kinda off, given that Druid allows a flexible schema. However, I also think that most of the cases triggering this would be accidental, rather than deliberate. Do we plan on getting rid of these validations, once arrayIngestMode = array is the only acceptable argument to MSQ ingestion, and we are not stuck between supporting incorrect behavior, and allowing only correct but incompatible behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation is really just to prevent accidents that break queries, due a pretty unique situation:

  • due to the complexity around arrayIngestMode it is easy to accidentally ingest a list of strings as the "wrong" type accidentally (MVD when you meant ARRAY or vice-versa).
  • when a single column has both MVD and ARRAY in it, existing queries can break for the reasons I mentioned in the PR description.

This situation is problematic but also only really exists for the MVD/array combinaton. So I don't intent this to be the start of a DDL layer, just a point fix to this particular problem. I do think we want a catalog layer that can be part of more extensive type validations, but that should be opt-in & user-driven & more flexible than some hard-coded checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@cryptoe cryptoe added this to the 29.0.1 milestone Feb 21, 2024
@gianm gianm marked this pull request as ready for review March 13, 2024 05:49
@gianm
Copy link
Contributor Author

gianm commented Mar 13, 2024

Taking this out of draft since I've added tests and docs. I have also added error message examples to the PR description.

This causes them to be stored as [multi-value strings](multi-value-dimensions.md), using the same `STRING` column type
as regular scalar strings. SQL `BIGINT ARRAY` and `DOUBLE ARRAY` cannot be loaded under `arrayIngestMode: mvd`. This
is the default behavior when `arrayIngestMode` is not provided in your query context, although the default behavior
may change to `array` in a future release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: will change

Comment on lines +123 to +128
) EXTEND (
"timestamp" VARCHAR,
"label" VARCHAR,
"arrayString" VARCHAR ARRAY,
"arrayLong" BIGINT ARRAY,
"arrayDouble" DOUBLE ARRAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing these up, i wrote these docs before i fixed up EXTEND to support array types 🤘

)
{
if (targetTable == null) {
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe when we change the default, we can also consider making it so when people explicitly set arrayIngestMode: mvd, they need to use ARRAY_TO_MV. (I think that's what you're suggesting here.)

yea, I'm mostly thinking about it in terms of forcing people to write the correct queries as soon as possible in either mode to make things less painful down the road, and was thinking it would be nice to prevent bad habits for new tables before they have a chance to get started since those habits are going to have to change.

Its ok to wait until we change the default i guess, which this change should allow us to do much sooner (druid 30 hopefully)

{
if (queryType == null) {
// if schema information is not available, create a string dimension
return ColumnType.STRING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this should return null to make createDimensionSchema decided to use the string schema (in case down the line we ever want to switch that to the auto schema)

Copy link
Contributor Author

@gianm gianm Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I held this condition over from the previous code, but I'm not sure that the type can actually ever be null. This method is used in a couple places:

  • by MSQ ControllerImpl, which checks for null before passing it; if it's null a "No type for column" exception is thrown.
  • by the new validation code I added; but here null only arises if Calcites.getColumnTypeForRelDataType(newSqlType) is null, which seems unlikely for a SQL type that appears in an already-planned INSERT or REPLACE statement. The fact that it's already been planned means we should have Druid types corresponding to all the SQL types.

So, maybe we could remove the @Nullable and make this an error.

Should I try that now, or want to leave it for later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I should have looked at the callers, that makes sense, leaving for now and making an error later seems fine to me

Copy link
Contributor

@vogievetsky vogievetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just built and plaid with it. Looks great.

@gianm gianm merged commit 256160a into apache:master Mar 13, 2024
83 checks passed
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Mar 14, 2024
…5920)

* MSQ: Validate that strings and string arrays are not mixed.

When multi-value strings and string arrays coexist in the same column,
it causes problems with "classic MVD" style queries such as:

  select * from wikipedia -- fails at runtime
  select count(*) from wikipedia where flags = 'B' -- fails at planning time
  select flags, count(*) from wikipedia group by 1 -- fails at runtime

To avoid these problems, this patch adds type verification for INSERT
and REPLACE. It is targeted: the only type changes that are blocked are
string-to-array and array-to-string. There is also a way to exclude
certain columns from the type checks, if the user really knows what
they're doing.

* Fixes.

* Tests and docs and error messages.

* More docs.

* Adjustments.

* Adjust message.

* Fix tests.

* Fix test in DV mode.
cryptoe added a commit that referenced this pull request Mar 19, 2024
#15920) (#16160)

* Cherry-picking 15920-to-29.0.1

* Fixing extra test case which got added as part of merge

---------

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants