Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Migration guide for MVDs to arrays #16516

Merged
merged 18 commits into from
Jun 13, 2024
Merged

Conversation

vtlim
Copy link
Member

@vtlim vtlim commented May 29, 2024

This PR adds a migration guide for Druid 30 to help users understand the differences and migrate from multi-value dimensions to arrays. Branches off of #16491.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice start, definitely not an easy task for a complicated subject, as mentioned in a few of my comments i think we should lean more into deferring to the and https://github.com/apache/druid/blob/master/docs/querying/multi-value-dimensions.md because i think a lot of things are too complicated to summarize in a concise way.

docs/release-info/migr-mvd-array.md Outdated Show resolved Hide resolved
|---|---|---|
| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter |
| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values |
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrays support range filters using comparison of the sort order of the whole array, e.g. [1, 2, 3] < [1, 2, 4] if i write ... WHERE arrayColumn < ARRAY[1, 2, 3] or whatever. Array overlap is a different filter that checks if an array contains any of the elements of some other array.

Multi-value dimensions also support range filters if matching as individual string values.

| Filtering and grouping | <ul><li>Filters and groupings match the entire array value</li><li>Can be used as GROUP BY keys, grouping based on the entire array value</li></ul> | <ul><li>Filters match any value within the array</li><li>Grouping generates a group for each individual value, similar to an implicit UNNEST</li></ul> |
| Conversion | Convert an MVD to an array using [MV_TO_ARRAY](../querying/sql-multivalue-string-functions.md) | Convert an array to an MVD using [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) |

### Query differences between arrays and MVDs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i worry this section is a bit more confusing than it is helpful, there are a lot of examples of behavior and differences between arrays and mvds in https://github.com/apache/druid/blob/master/docs/querying/arrays.md and https://github.com/apache/druid/blob/master/docs/querying/multi-value-dimensions.md

currently this table isn't super clear if it is talking about native or SQL filters, and in some cases maybe is talking about both? (like array_overlap isn't a native filter, but there is a native arrayContainsElement which when combined with a native or filter can construct ARRAY_OVERLAP).

I fear this content is too deep to summarize into this table and we might be better off delegating into the in depth docs which have examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll pull out this table and instead highlight three particular example differences side-by-side

  • ARRAY_CONTAIN
  • ARRAY_OVERLAP
  • UNNEST

And will point users to the relevant docs for more information on query differences.

This section will also include a description of the biggest difference (comment)


| Query type | Array | MVD |
|---|---|---|
| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't clear enough that the biggest difference between arrays and mvds is how you interact with them in SQL. like array types you treat as SQL arrays, while mvds you treat as SQL VARCHAR. Like if i have two columns with the same data, some array ['a', 'b', 'c'], then for the array type the only thing that matches that row isWHERE array = ARRAY['a', 'b', 'c'], while any of WHERE mvd = 'a', WHERE mvd = 'b', WHERE mvd = 'c' would match the row.

| Equality filter | Matches the entire array value | Matches if any value within the array matches the filter |
| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values |
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported |
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "Contains filter" referring to the native arrayContainsElement filter? Also "Use WHERE filter" isn't correct, perhaps you meant "Use equality filter" instead?

| Null filter | Matches rows where the entire array value is null | Matches rows where the array is empty (considered as null) but does not match arrays with empty (`“”`) values |
| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported |
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter |
| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems confusing and should be left out i think, since the logical filters operate on the results of the child filters rather than directly on arrays or mvds

| Range filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Not directly supported |
| Contains filter | Use [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains)| Use WHERE filter |
| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` |
| Column comparison filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Matches when the dimensions have any overlapping values. For example, `WHERE tags IN ('t1', 't2')` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the native column comparison filter itself is very strange (and not used by sql at all), so i think we should drop it from here. It converts everything into a string array for comparison regardless of type, so it works more or less the same for both mvds and arrays, ish, but again is very weird and i feel like only makes things more confusing

| Logical expression filters | Behaves like standard ANSI SQL on the entire array value, such as AND, OR, NOT. For example, `WHERE arrayLong = ARRAY[1,2,3] OR arrayLong = ARRAY[4,5,6]` | Matches a row if any value within the array matches the logical condition. For example, `WHERE tags = 't1' OR tags = 't3'` |
| Column comparison filter | Use [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Matches when the dimensions have any overlapping values. For example, `WHERE tags IN ('t1', 't2')` |
| Behavior with SQL constructs | Follows standard SQL behavior with array functions like [ARRAY_CONTAINS](../querying/sql-functions.md#array_contains), [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) | Requires special SQL functions like [MV_FILTER_ONLY](../querying/sql-functions.md#mv_filter_none), [MV_FILTER_NONE](../querying/sql-functions.md#mv_filter_only) for precise filtering |
| Group by entire array | Groups the entire array as a single value | Not supported |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use mv_to_array to group mvds as arrays

| Group by individual values | Use [UNNEST](../querying/sql.md#unnest) to group by individual array elements | Automatically unnests groups by each individual value in the array |

## How to ingest data as arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again i think maybe should just refer to the other external docs which include examples of how to ingest both native and sql for both arrays and mvds

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linking out to external docs for examples and more details

docs/release-info/migr-mvd-array.md Outdated Show resolved Hide resolved
Comment on lines 89 to 103
#### Array

```sql
SELECT *
FROM "array_example"
WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7'])
```

#### MVD

```sql
SELECT *
FROM "mvd_example"
WHERE MV_OVERLAP(tags, ARRAY['t1', 't7'])
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2bethere you mentioned adding an example for ARRAY_OVERLAP but these examples aren't so different between MVDs and arrays. Did you have another use case in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example I was thinking is:
If you do WHERE mvd in ("t1", "t2") today, how do you do that with ARRAY_CONTAINS/ARRAY_OVERLAP
ANother case is if you do WHERE mvd ="t1" AND mvd = "t2" today, how do you do that with ARRAY_CONTAINS/ARRAY_OVERLAP

So not MV_OVERLAP -> ARRAY_OVERLAP. But converting WHERE conditions into ARRAY. Because I don't think WHERE array = "t1" AND array = "t2" works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be useful to include array equality (e.g. WHERE tags = ARRAY['t1', 't2', 't3'] is i think equivalent to WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3'])

and array grouping
(SELECT label, tags FROM "arrayExample" GROUP BY 1,2 and SELECT label, MV_TO_ARRAY(tags) FROM "mvd_example" GROUP BY 1, 2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis if MVD equality checks individual strings, would this query always return zero results?

SELECT *
FROM "mvd_example"
WHERE tags = 't1' AND tags = 't2'

since a single string can't be two different values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2bethere I updated the existing example to use WHERE tags = 't1' OR tags = 't2'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis if MVD equality checks individual strings, would this query always return zero results?

This one is tricky, in SQL yes, because SQL considers this a contradiction so "simplifies" it to always false so nothing matches. However, in native json queries you can write that filter, which provides a way to check that the mvd row contains all of the elements, similar to using MV_CONTAINS i suppose.

## Query differences between arrays and MVDs

In SQL queries, Druid operates on arrays differently than MVDs.
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR). Even though multiple string values within the same MVD are still stored as a single field in the MVD column.

| | Arrays| Multi-value dimensions (MVDs) |
|---|---|---|
| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING\>, ARRAY<LONG\>, ARRAY<DOUBLE\>) | Only supports arrays of strings (VARCHAR) |
| SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Does not behave like standard SQL arrays; requires special SQL functions |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Does not behave like standard SQL arrays; requires special SQL functions"

Heh, this is a tough one to make a blurb about. Most of the special functions for mvds though are to allow for doing array like things to mvds, which isn't super clear. If not using special functions, mvds "best effort" behave like regular SQL VARCHAR (which is apparent from the other parts of the doc but also feels like maybe it should be here somehow). If it should be here, I also am not sure how important is to also mention the parts that fall short of the "best effort" and have some unexpected/unintuitive oddness when doing that such as implicit unnest when grouping and the filtering behavior of matching a row if any element matches the row, which itself makes for potentially confusing results when grouping, as the other values in the row show up which is why functions like mv_filter_only exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make this a little more clear and linked to the examples (also included a new example that shows the use case for mv_filter_only)

Comment on lines 54 to 58
* For MVD columns, Druid returns the row when an equality filter matches any value of the MVD.
For example, any of the following filters returns the row for the query:
`WHERE "mvd_column" = 'a'`
`WHERE "mvd_column" = 'b'`
`WHERE "mvd_column" = 'c'`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if its worth mentioning that when grouping this means that you'll see rows that don't obviously match the filter you wrote because of the implicit unnest and the filtering occuring prior to the implicit unnest:
... WHERE "mvd_column" = 'a'
would return 3 rows

a
b
c

I guess the linked examples show this behavior in more depth, but the biggest utility of this guide seems to be calling out the behavioral differences between the two, so maybe worth having here too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point; I'll call this out and add another example

Comment on lines 89 to 103
#### Array

```sql
SELECT *
FROM "array_example"
WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7'])
```

#### MVD

```sql
SELECT *
FROM "mvd_example"
WHERE MV_OVERLAP(tags, ARRAY['t1', 't7'])
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be useful to include array equality (e.g. WHERE tags = ARRAY['t1', 't2', 't3'] is i think equivalent to WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3'])

and array grouping
(SELECT label, tags FROM "arrayExample" GROUP BY 1,2 and SELECT label, MV_TO_ARRAY(tags) FROM "mvd_example" GROUP BY 1, 2)

Within `dimensionsSpec`, set `"useSchemaDiscovery": true`, and use `dimensions` to list the array inputs with type `auto`.
For an example, see [Ingesting arrays: Native batch and streaming ingestion](../querying/arrays.md#native-batch-and-streaming-ingestion).

* For SQL-based batch ingestion, include the [query context parameter](../multi-stage-query/reference.md#context-parameters) `"arrayIngestMode": "array"` and reference the relevant array type (`VARCHAR ARRAY`, `BIGINT ARRAY`, or `DOUBLE ARRAY`) in the column descriptors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in the column descriptors" i'm not sure its clear what this means and I don't see anything else referring to it in the docs. Do you mean in the extern/extend type declaration thingy?

Ideally everyone should always be using that for any array data type from the external file, and then using ARRAY_TO_MV if storing as an mvd instead of array. I believe the web-console data loader already does this these days, regardless of arrayIngestMode, since the data is in fact arrays in the source files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"column descriptors" 😅 -- I got that phrase from

3. A row signature, as a JSON-encoded array of column descriptors. Each column descriptor must have a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the language and added the recommendation

| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING\>, ARRAY<LONG\>, ARRAY<DOUBLE\>) | Only supports arrays of strings (VARCHAR) |
| SQL compliance | Behaves like standard SQL arrays with SQL-compliant behavior | Behaves like SQL VARCHAR rather than standard SQL arrays and requires special SQL functions to achieve array-like behavior. See the [examples](#examples). |
| Ingestion | <ul><li>JSON arrays are ingested as Druid arrays</li><li>Managed through the query context parameter `arrayIngestMode` in SQL-based ingestion (supported options: `array`, `mvd`, `none`). Note that if you set this mode to `none`, Druid raises an exception if you try to store any type of array.</li></ul> | <ul><li>JSON arrays are ingested as multi-value dimensions</li><li>Managed using functions like [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) in SQL-based ingestion</li></ul> |
| Filtering and grouping | <ul><li>Filters and groupings match the entire array value</li><li>Can be used as GROUP BY keys, grouping based on the entire array value</li></ul> | <ul><li>Filters match any value within the array</li><li>Grouping generates a group for each individual value, similar to an implicit UNNEST</li></ul> |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the array section mention explicitly using unnest to group on array elements given that the mvd section talks about implict unnest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, adding

docs/release-info/migr-mvd-array.md Outdated Show resolved Hide resolved
Copy link
Contributor

@2bethere 2bethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for making those changes. This looks really good!

## Querying arrays and MVDs

In SQL queries, Druid operates on arrays differently than MVDs.
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
A value in an array column is treated as a single array entity (SQL ARRAY), whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).

Can a single value in an MVD column consist of multiple individual strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, kind of like an array but not treated as a single entity

Copy link
Contributor

@ektravel ektravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions/suggestions.

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Copy link
Contributor

@ektravel ektravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions but the changes look good.

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
@vtlim vtlim merged commit 836cdb4 into apache:master Jun 13, 2024
12 checks passed
techdocsmith pushed a commit to techdocsmith/druid that referenced this pull request Jun 13, 2024
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
vtlim added a commit that referenced this pull request Jun 13, 2024
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com>
Co-authored-by: Benedict Jin <asdf2014@apache.org>
@vtlim vtlim deleted the docs-array-migration branch June 14, 2024 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants