MSQ arrayIngestMode to control if arrays are ingested as ARRAY, MVD, or an exception#15093
Conversation
|
This change is important but we need to do it less disruptively. People are using For this reason the change needs to be opt-in. I suggest documenting the context parameter and swapping the default so the old behavior is retained. In addition I'd suggest writing up a doc page about how people can migrate from MVDs to string arrays (with examples of how to rewrite queries) and pointing people at that in the docs for this parameter. And showing people how to use |
|
Another thought: is it possible to add explicit dimension schemas for the various types that can be generated by "auto"? In MSQ we know the exact type we want so it seems odd & circuitous to use "auto". |
I've started doing some work on this, but its sort of non-trivial and shouldn't be part of this PR, so using 'auto' is currently the only way to ingest array columns |
clintropolis
left a comment
There was a problem hiding this comment.
lgtm 👍
Are there any tests for trying to insert numeric arrays in mvd mode or any type of arrays in none mode?
| public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false; | ||
|
|
||
| public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode"; | ||
| public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.NONE; |
There was a problem hiding this comment.
Default should be MVD.
cryptoe
left a comment
There was a problem hiding this comment.
Please add the document changes for this as well.
| public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false; | ||
|
|
||
| public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode"; | ||
| public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.NONE; |
There was a problem hiding this comment.
The default mode should be MVD IMHO since it will not break stuff.
There was a problem hiding this comment.
I personally disagree, I think we should default to none, then people explicitly choose to use MVDs or arrays
There was a problem hiding this comment.
mostly because the behavior of MVD is totally incorrect
There was a problem hiding this comment.
I am working on updating the docs/examples refer to the parameter and to separate them into ones that explicitly store MVDs using ARRAY_TO_MV function, and ones which only use ARRAY_ functions into examples of storing array typed columns instead.
MVD mode is not going to have examples, because it should be the first to go and no one should rely on this behavior, because again the behavior is incorrect.
There was a problem hiding this comment.
IMHO we cannot break user's ingestion sql query so I propose the mode should be MVD.
The patch should also throw a warning from the controller and throw it on the console if a row signature with Array is detected. The warning clearly lays out the path of migration.
There are lot of layers build out in organizations. A change even if its adding a context flag to sometimes takes weeks to reach to production. With this warning we effectively nudge the user that we are going to break compatibility soon.
| "String arrays can not be ingested when '%s' is set to '%s'. Either set '%s' in query context " | ||
| + "to 'array' to ingest the string array as an array, or set it to 'mvd' to ingest the string array " | ||
| + "as MVD (which is legacy behaviour and not recommmended)", | ||
| MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, | ||
| StringUtils.toLowerCase(arrayIngestMode.name()), | ||
| MultiStageQueryContext.CTX_ARRAY_INGEST_MODE |
There was a problem hiding this comment.
I don't think we should recommend MVD mode at all here, instead we should always recommend array mode and suggest using ARRAY_TO_MV if people want to store things as a MVD.
Yes, there are tests verifying that arrays cannot be inserted in none mode, and numeric arrays cannot be inserted in MVD mode. |
|
Thanks, @clintropolis for updating & aligning the description with what we actually merged, and not the stale changes. |
MSQ uses the string dimension schema for ARRAY<STRING> typed columns, which creates MVDs instead of string arrays as required. Therefore someone trying to ingest columns of type ARRAY<STRING> from an external data source or another data source would get STRING columns in the newly generated segments. This patch changes the following: - Use auto dimension schema to ingest the ARRAY<STRING> columns, which will create columns with the desired type. - Add an undocumented flag ingestStringArraysAsMVDs to preserve the legacy behavior. Legacy behaviour is turned on by default. - Create MSQArraysInsertTest and refactor some of the tests in MSQInsertTest.
MSQ uses the string dimension schema for ARRAY<STRING> typed columns, which creates MVDs instead of string arrays as required. Therefore someone trying to ingest columns of type ARRAY<STRING> from an external data source or another data source would get STRING columns in the newly generated segments. This patch changes the following: - Use auto dimension schema to ingest the ARRAY<STRING> columns, which will create columns with the desired type. - Add an undocumented flag ingestStringArraysAsMVDs to preserve the legacy behavior. Legacy behaviour is turned on by default. - Create MSQArraysInsertTest and refactor some of the tests in MSQInsertTest.
Description
MSQ uses the string dimension schema for
ARRAY<STRING>typed columns, which creates MVDs instead of string arrays as is correct. Therefore someone trying to ingest columns of typeARRAY<STRING>from an external data source or another data source would getSTRINGcolumns in the newly generated segments.This patch adds a
arrayIngestModequery context parameter with the following behavior:array, uses auto dimension schema to correctly ingest theARRAY<STRING>columns asARRAY<STRING>columns, and also allows numericARRAYtypes to be ingestedmvd, currently default to ease the transition,ARRAY<STRING>will continue to ingest asSTRINGtyped MVDs, but logs will warn operators that this is a deprecated mode. Numeric arrays cannot be ingested in this mode.none, neither string or numericARRAYtypes can be ingested, this mode will be used as a forcing mechanism to force people to choose if they explicitly want MVDs or ARRAY typed columns, and in either case, suggest setting the mode toarrayand explicitly useARRAY_TO_MVif users still want MVDs.Release note
MSQ supports a new array mode
arrayIngestModewhich specifies the behavior of MSQ while ingesting arrays. Please refer to the docs in the release for a detailed behavior of what each option specifies. The default 'mvd' is the existing behaviour of MSQ, therefore doesn't require immediate intervention, however it is subject to removal in future releases.Key changed/added classes in this PR
MyFooOurBarTheirBazThis PR has: