-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Migration guide for MVDs to arrays #16516
Conversation
c3781c9
to
adbb7ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice start, definitely not an easy task for a complicated subject, as mentioned in a few of my comments i think we should lean more into deferring to the and https://github.com/apache/druid/blob/master/docs/querying/multi-value-dimensions.md because i think a lot of things are too complicated to summarize in a concise way.
docs/release-info/migr-mvd-array.md
Outdated
|---|---|---| | ||
| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter | | ||
| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values | | ||
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrays support range filters using comparison of the sort order of the whole array, e.g. [1, 2, 3]
< [1, 2, 4]
if i write ... WHERE arrayColumn < ARRAY[1, 2, 3]
or whatever. Array overlap is a different filter that checks if an array contains any of the elements of some other array.
Multi-value dimensions also support range filters if matching as individual string values.
docs/release-info/migr-mvd-array.md
Outdated
| Filtering and grouping | <ul><li>Filters and groupings match the entire array value</li><li>Can be used as GROUP BY keys, grouping based on the entire array value</li></ul> | <ul><li>Filters match any value within the array</li><li>Grouping generates a group for each individual value, similar to an implicit UNNEST</li></ul> | | ||
| Conversion | Convert an MVD to an array using [MV_TO_ARRAY](../querying/sql-multivalue-string-functions.md) | Convert an array to an MVD using [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) | | ||
|
||
### Query differences between arrays and MVDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i worry this section is a bit more confusing than it is helpful, there are a lot of examples of behavior and differences between arrays and mvds in https://github.com/apache/druid/blob/master/docs/querying/arrays.md and https://github.com/apache/druid/blob/master/docs/querying/multi-value-dimensions.md
currently this table isn't super clear if it is talking about native or SQL filters, and in some cases maybe is talking about both? (like array_overlap isn't a native filter, but there is a native arrayContainsElement
which when combined with a native or
filter can construct ARRAY_OVERLAP).
I fear this content is too deep to summarize into this table and we might be better off delegating into the in depth docs which have examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll pull out this table and instead highlight three particular example differences side-by-side
- ARRAY_CONTAIN
- ARRAY_OVERLAP
- UNNEST
And will point users to the relevant docs for more information on query differences.
This section will also include a description of the biggest difference (comment)
docs/release-info/migr-mvd-array.md
Outdated
|
||
| Query type | Array | MVD | | ||
|---|---|---| | ||
| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it isn't clear enough that the biggest difference between arrays and mvds is how you interact with them in SQL. like array types you treat as SQL arrays, while mvds you treat as SQL VARCHAR. Like if i have two columns with the same data, some array ['a', 'b', 'c']
, then for the array type the only thing that matches that row isWHERE array = ARRAY['a', 'b', 'c']
, while any of WHERE mvd = 'a'
, WHERE mvd = 'b'
, WHERE mvd = 'c'
would match the row.
docs/release-info/migr-mvd-array.md
Outdated
| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter | | ||
| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values | | ||
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported | | ||
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "Contains filter" referring to the native arrayContainsElement
filter? Also "Use WHERE filter" isn't correct, perhaps you meant "Use equality filter" instead?
docs/release-info/migr-mvd-array.md
Outdated
| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values | | ||
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported | | ||
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter | | ||
| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems confusing and should be left out i think, since the logical filters operate on the results of the child filters rather than directly on arrays or mvds
docs/release-info/migr-mvd-array.md
Outdated
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported | | ||
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter | | ||
| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` | | ||
| Column comparison filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Matches when the dimensions have any overlapping values. For example, `WHERE tags IN ('t1', 't2')` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the native column comparison filter itself is very strange (and not used by sql at all), so i think we should drop it from here. It converts everything into a string array for comparison regardless of type, so it works more or less the same for both mvds and arrays, ish, but again is very weird and i feel like only makes things more confusing
docs/release-info/migr-mvd-array.md
Outdated
| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` | | ||
| Column comparison filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Matches when the dimensions have any overlapping values. For example, `WHERE tags IN ('t1', 't2')` | | ||
| Behavior with SQL constructs | Follows standard SQL behavior with array functions like [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains), [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Requires special SQL functions like [MV_FILTER_ONLY](../querying/sql-functions.md#mv_filter_none), [MV_FILTER_NONE](../querying/sql-functions.md#mv_filter_only) for precise filtering | | ||
| Group by entire array | Groups the entire array as a single value | Not supported | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use mv_to_array
to group mvds as arrays
| Group by individual values | Use [UNNEST](../querying/sql.md#unnest) to group by individual array elements | Automatically unnests groups by each individual value in the array | | ||
|
||
## How to ingest data as arrays | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again i think maybe should just refer to the other external docs which include examples of how to ingest both native and sql for both arrays and mvds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linking out to external docs for examples and more details
docs/release-info/migr-mvd-array.md
Outdated
#### Array | ||
|
||
```sql | ||
SELECT * | ||
FROM "array_example" | ||
WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7']) | ||
``` | ||
|
||
#### MVD | ||
|
||
```sql | ||
SELECT * | ||
FROM "mvd_example" | ||
WHERE MV_OVERLAP(tags, ARRAY['t1', 't7']) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2bethere you mentioned adding an example for ARRAY_OVERLAP but these examples aren't so different between MVDs and arrays. Did you have another use case in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example I was thinking is:
If you do WHERE mvd in ("t1", "t2") today, how do you do that with ARRAY_CONTAINS/ARRAY_OVERLAP
ANother case is if you do WHERE mvd ="t1" AND mvd = "t2" today, how do you do that with ARRAY_CONTAINS/ARRAY_OVERLAP
So not MV_OVERLAP -> ARRAY_OVERLAP. But converting WHERE conditions into ARRAY. Because I don't think WHERE array = "t1" AND array = "t2" works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be useful to include array equality (e.g. WHERE tags = ARRAY['t1', 't2', 't3']
is i think equivalent to WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3']
)
and array grouping
(SELECT label, tags FROM "arrayExample" GROUP BY 1,2
and SELECT label, MV_TO_ARRAY(tags) FROM "mvd_example" GROUP BY 1, 2
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis if MVD equality checks individual strings, would this query always return zero results?
SELECT *
FROM "mvd_example"
WHERE tags = 't1' AND tags = 't2'
since a single string can't be two different values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2bethere I updated the existing example to use WHERE tags = 't1' OR tags = 't2'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis if MVD equality checks individual strings, would this query always return zero results?
This one is tricky, in SQL yes, because SQL considers this a contradiction so "simplifies" it to always false so nothing matches. However, in native json queries you can write that filter, which provides a way to check that the mvd row contains all of the elements, similar to using MV_CONTAINS
i suppose.
## Query differences between arrays and MVDs | ||
|
||
In SQL queries, Druid operates on arrays differently than MVDs. | ||
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). | |
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). Even though multiple string values within the same MVD are still stored as a single field in the MVD column. |
docs/release-info/migr-mvd-array.md
Outdated
| | Arrays| Multi-value dimensions (MVDs) | | ||
|---|---|---| | ||
| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING\>, ARRAY<LONG\>, ARRAY<DOUBLE\>) | Only supports arrays of strings (VARCHAR) | | ||
| SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Does not behave like standard SQL arrays; requires special SQL functions | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Does not behave like standard SQL arrays; requires special SQL functions"
Heh, this is a tough one to make a blurb about. Most of the special functions for mvds though are to allow for doing array like things to mvds, which isn't super clear. If not using special functions, mvds "best effort" behave like regular SQL VARCHAR
(which is apparent from the other parts of the doc but also feels like maybe it should be here somehow). If it should be here, I also am not sure how important is to also mention the parts that fall short of the "best effort" and have some unexpected/unintuitive oddness when doing that such as implicit unnest when grouping and the filtering behavior of matching a row if any element matches the row, which itself makes for potentially confusing results when grouping, as the other values in the row show up which is why functions like mv_filter_only
exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make this a little more clear and linked to the examples (also included a new example that shows the use case for mv_filter_only
)
docs/release-info/migr-mvd-array.md
Outdated
* For MVD columns, Druid returns the row when an equality filter matches any value of the MVD. | ||
For example, any of the following filters returns the row for the query: | ||
`WHERE "mvd_column" = 'a'` | ||
`WHERE "mvd_column" = 'b'` | ||
`WHERE "mvd_column" = 'c'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if its worth mentioning that when grouping this means that you'll see rows that don't obviously match the filter you wrote because of the implicit unnest and the filtering occuring prior to the implicit unnest:
... WHERE "mvd_column" = 'a'
would return 3 rows
a
b
c
I guess the linked examples show this behavior in more depth, but the biggest utility of this guide seems to be calling out the behavioral differences between the two, so maybe worth having here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point; I'll call this out and add another example
docs/release-info/migr-mvd-array.md
Outdated
#### Array | ||
|
||
```sql | ||
SELECT * | ||
FROM "array_example" | ||
WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7']) | ||
``` | ||
|
||
#### MVD | ||
|
||
```sql | ||
SELECT * | ||
FROM "mvd_example" | ||
WHERE MV_OVERLAP(tags, ARRAY['t1', 't7']) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be useful to include array equality (e.g. WHERE tags = ARRAY['t1', 't2', 't3']
is i think equivalent to WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3']
)
and array grouping
(SELECT label, tags FROM "arrayExample" GROUP BY 1,2
and SELECT label, MV_TO_ARRAY(tags) FROM "mvd_example" GROUP BY 1, 2
)
docs/release-info/migr-mvd-array.md
Outdated
Within `dimensionsSpec`, set `"useSchemaDiscovery": true`, and use `dimensions` to list the array inputs with type `auto`. | ||
For an example, see [Ingesting arrays: Native batch and streaming ingestion](../querying/arrays.md#native-batch-and-streaming-ingestion). | ||
|
||
* For SQL-based batch ingestion, include the [query context parameter](../multi-stage-query/reference.md#context-parameters) `"arrayIngestMode": "array"` and reference the relevant array type (`VARCHAR ARRAY`, `BIGINT ARRAY`, or `DOUBLE ARRAY`) in the column descriptors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the column descriptors" i'm not sure its clear what this means and I don't see anything else referring to it in the docs. Do you mean in the extern/extend type declaration thingy?
Ideally everyone should always be using that for any array data type from the external file, and then using ARRAY_TO_MV
if storing as an mvd instead of array. I believe the web-console data loader already does this these days, regardless of arrayIngestMode, since the data is in fact arrays in the source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"column descriptors" 😅 -- I got that phrase from
druid/docs/multi-stage-query/reference.md
Line 71 in e9f7233
3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the language and added the recommendation
docs/release-info/migr-mvd-array.md
Outdated
| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING\>, ARRAY<LONG\>, ARRAY<DOUBLE\>) | Only supports arrays of strings (VARCHAR) | | ||
| SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Behaves like SQL VARCHAR rather than standard SQL arrays and requires special SQL functions to achieve array-like behavior. See the [examples](#examples). | | ||
| Ingestion | <ul><li>JSON arrays are ingested as Druid arrays</li><li>Managed through the query context parameter `arrayIngestMode` in SQL-based ingestion (supported options: `array`, `mvd`, `none`). Note that if you set this mode to `none`, Druid raises an exception if you try to store any type of array.</li></ul> | <ul><li>JSON arrays are ingested as multi-value dimensions</li><li>Managed using functions like [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) in SQL-based ingestion</li></ul> | | ||
| Filtering and grouping | <ul><li>Filters and groupings match the entire array value</li><li>Can be used as GROUP BY keys, grouping based on the entire array value</li></ul> | <ul><li>Filters match any value within the array</li><li>Grouping generates a group for each individual value, similar to an implicit UNNEST</li></ul> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the array section mention explicitly using unnest to group on array elements given that the mvd section talks about implict unnest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, adding
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for making those changes. This looks really good!
## Querying arrays and MVDs | ||
|
||
In SQL queries, Druid operates on arrays differently than MVDs. | ||
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). | |
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). |
Can a single value in an MVD column consist of multiple individual strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, kind of like an array but not treated as a single entity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions/suggestions.
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few suggestions but the changes look good.
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Benedict Jin <asdf2014@apache.org> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Benedict Jin <asdf2014@apache.org>
This PR adds a migration guide for Druid 30 to help users understand the differences and migrate from multi-value dimensions to arrays. Branches off of #16491.