Feature: Fixed size list nested type (ARRAY) #8983

Maxxen · 2023-09-18T19:43:41Z

ARRAY Physical Type

This PR adds a new nested physical+logical type optimized for representing fixed-size-lists, with accompanying functions and full support for storage/joins/grouping/sorting. Internally an ARRAY vector is somewhat like a combination of a STRUCT and a LIST. The vector itself does not store any data except for a validity mask (like structs), but has a single child vector (like lists).

Unlike lists however there is no need to keep and store "list entries" with offsets and lengths because the data of the array at position i will always be at array_size * i. Therefore the child vector is always a flat vector and allocated at a fixed STANDARD_VECTOR_SIZE * array_size capacity, and is never shrunk or resized (unless the parent vector is too). This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, but in exchange they make child element access faster, but most importantly, predictable, which could enable for more optimized functions to be implemented over them (e.g. list lambdas, sorting, filtering, e.t.c.).

All the list functions should work with arrays, but the following array-native functions are implemented

array_value - Create an ARRAY containing the argument values
array_cross_product - Compute the cross product of two arrays with length 3.
array_cosine_similarity - Compute the cosine similarity between two arrays.
array_distance - Compute the distance between two arrays.
array_inner_product - Compute the inner product between two arrays.
length (overloaded) - Returns the length of the array (always constant).
array_length - Returns the length of the array along a given dimension. (useful for nested arrays).

Future work

`ARRAY` constructor and untangle `array_` as alias for list

We've basically inherited the concept of "arrays" from Postgres and currently have a lot of aliases for list functions (and tests!) using the array_ prefix. We should probably remove/rename these to make a clear separation between lists and this new array type.

We also have the ARRAY[] syntax which I think we also should change to basically alias array_value to create this new type instead of lists. As of now the "cleanest" way to create an array literal is to do a cast, e.g. [1,2,3]::INT[3], but that is of course less efficient.

Specialize list functions for arrays

I've made it so that LIST and ARRAY are implicitly castable to each other if their child elements are implicitly castable, which makes it so that we can use arrays with all the existing list functions (although by introducing a manual cast in their bind hooks, read on below). This works, but is pretty inefficient, not just because we have to allocate a bunch of list_entry_t's but also because I think a lot of list functions could be implemented a lot easier and efficiently when you know all the list have the same length, which is always the case for arrays. This could be done incrementally over time though, so no big deal.

Example

-- Construct with the 'array_value' function
SELECT array_value(1,2,3);
┌──────────────────────┐
│ array_value(1, 2, 3) │
│      integer[3]      │
├──────────────────────┤
│ [1, 2, 3]            │
└──────────────────────┘

-- You can always implicitly cast to a list (and use list functions, like list_extract, '[i]')
SELECT array_value(1,2,3)[2];
┌─────────────────────────┐
│ array_value(1, 2, 3)[2] │
│          int32          │
├─────────────────────────┤
│                       2 │
└─────────────────────────┘

-- You can cast from list, but the dimensions have to match up!
SELECT [3,2,1]::INT[3];
┌──────────────────────────────────────────────┐
│ CAST(main.list_value(3, 2, 1) AS INTEGER[3]) │
│                  integer[3]                  │
├──────────────────────────────────────────────┤
│ [3, 2, 1]                                    │
└──────────────────────────────────────────────┘

SELECT [3,2,1]::INT[4];
Error: Conversion Error: Cannot cast list with length 3 to array with length 4

-- Arrays can of course also be nested!
SELECT array_value(array_value(1,2), array_value(3,4), array_value(5,6));
┌──────────────────────────────────────────────────────────────────────┐
│ array_value(array_value(1, 2), array_value(3, 4), array_value(5, 6)) │
│                            integer[2][3]                             │
├──────────────────────────────────────────────────────────────────────┤
│ [[1, 2], [3, 4], [5, 6]]                                             │
└──────────────────────────────────────────────────────────────────────┘

SELECT array_value({'a': 1, 'b': 2}, {'a': 3, 'b': 4});
┌─────────────────────────────────────────────────────────────────────────────────┐
│ array_value(main.struct_pack(a := 1, b := 2), main.struct_pack(a := 3, b := 4)) │
│                         struct(a integer, b integer)[2]                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│ [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]                                            │
└─────────────────────────────────────────────────────────────────────────────────┘

Generic binding and casts

It is currently a bit awkward to write functions for the list type as you usually want them to be somewhat polymorphic over the child type and do a lot of binding logic and casting in explicit bind functions when implementing e.g. a scalar. For arrays this is even more problematic since you usually also want your function to accept arrays - not just of any type - but also any size. So far I've implemented the new functions for the array type as taking LogicalType::ANY and then performing all type checking manually in the bind hook. While LogicalType::ANY works to signal placeholder types we would need something similar to signal "any number", basically like a non-type template parameter.

I think this could be an opportunity to look into extending our type system to handle both the nested ANY type better and perhaps also support non-type parameter placeholders/generic parameters.

Note that I think this is primarily useful for docs/error messages, I don't mind too much if you have to actually enforce the types manually when binding (similar to how ANY works now I guess).

Small array optimizations

An interesting idea is to allow "small" arrays of fixed-size types, e.g. INTEGER[4], DOUBLE[2] to be stored in the parent vectors data as some sort of small blob/struct up to 16/32 bytes, avoiding having a child vector altogether. You would also probably need to store an additional validity mask for the array elements as an auxiliary vector buffer though. But since this would change so much of the execution/storage maybe it would make more sense to have this as another separate physical type? (like a TUPLE or something). idk.

Add ARRAY to `all_types`, client libraries/integrations

Notably this type maps better to e.g. Arrow FixedSizeLists and other integrations like numpy arrays.

…ted types

… in arrays, needs refactor

…ectors...

…e and HeapGather

Maxxen · 2023-10-10T21:04:11Z

Alright, I've fixed all the issues @lnkuiper brought up and added his examples to the tests. Other work:

You can now roundtrip parquet/json/csv containing ARRAY types, although not optimized (I think parquet performs VARCAHR conversions for now)
For binding purposes its possible to construct a array type of "unknown size", similar to LogicalType::ANY
implicit list->array casts are allowed if the target array type is of known size.
Squashed a whole bunch of bugs regarding child-vector validity at all levels of the system. Similar to structs the validity of the child vector have to be synchronized with the parent (array) vector.

After working on all this (the validity bug fixing in particular) Im actually pretty confident that everything works as it should now. Maybe we can have a look tomorrow or later this week @Mytherin ?

taniabogatsch

Hi @Maxxen! Really cool PR! I looked at the first part of my review assignment haha maybe @taniabogatsch can do the list/cast/binding/frontend-y. I added comments, some nitpicks, and questions!

I'll add another review to cover binding and casts.

General comments

We should probably remove/rename these to make a clear separation between lists and this new array type.

I completely agree. As we've already discussed, some list functions even match the new ARRAY type better! Let's pick this up in a future PR and clean up our functions there.

An interesting idea is to allow "small" arrays of fixed-size types [...] to be stored in the parent vectors data [...]

I don't think that this makes a lot of impact performance-wise. We don't assume people to have a few/a single row in their table with a small array size. Then, the child vector will be filled sufficiently for each processed data chunk, and our scalar function execution will carry the performance.

Other thoughts

This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, [...]

This comment directly made me wonder how much that impacts aggregation performance, as I fixed this for LIST by writing the custom list_segment implementation. But then I noticed we do not support ARRAY as an aggregate function yet. Would that even make sense to have in the future, as that would require all groups to have the same size?

D SELECT LIST(i) FROM t;
┌─────────────────────┐
│       list(i)       │
│      int32[][]      │
├─────────────────────┤
│ [[1, 2], [1, 2, 3]] │
└─────────────────────┘
D SELECT ARRAY(i) FROM t;
Error: Parser Error: syntax error at or near "i"
LINE 1: SELECT ARRAY(i) FROM t;

.github/config/uncovered_files.csv

src/common/extra_type_info.cpp

src/common/types.cpp

src/common/types/vector_buffer.cpp

taniabogatsch · 2023-10-11T07:47:10Z

src/core_functions/function_list.cpp

@@ -17,6 +17,7 @@
 #include "duckdb/core_functions/scalar/string_functions.hpp"
 #include "duckdb/core_functions/scalar/struct_functions.hpp"
 #include "duckdb/core_functions/scalar/union_functions.hpp"
+#include "duckdb/core_functions/scalar/array_functions.hpp"


Are these functions all supposed to go into the duckdb core functions? Or do we want to move some of them to the other functions?

idk, where else would they go? I thought everything in main duckdb is core_functions?

Yea, tbh, I am also confused about this. I just saw that we also had src/function/scalar/... when writing this comment and thought that they belonged there. But how/why do we have these two separate places?

All functions except for the super core functions should go into core_functions. The idea is that the core_functions will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min and max are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.

src/common/types/value.cpp

src/core_functions/scalar/array/array_value.cpp

src/core_functions/scalar/array/array_functions.cpp

Mytherin

Looks great! Some minor comments - otherwise good to go from my side.

src/common/types.cpp

src/common/types/vector.cpp

Mytherin · 2023-10-11T15:28:04Z

src/core_functions/function_list.cpp

@@ -17,6 +17,7 @@
 #include "duckdb/core_functions/scalar/string_functions.hpp"
 #include "duckdb/core_functions/scalar/struct_functions.hpp"
 #include "duckdb/core_functions/scalar/union_functions.hpp"
+#include "duckdb/core_functions/scalar/array_functions.hpp"


All functions except for the super core functions should go into core_functions. The idea is that the core_functions will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min and max are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.

Maxxen · 2023-10-12T07:23:12Z

All green! @Mytherin

Mytherin · 2023-10-12T18:03:36Z

Thanks!

Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg

Maxxen added 30 commits June 1, 2023 22:08

initial scaffolding

96b366f

wip

0d51224

initial expression execution/casting working

9ce63d5

merge with master

e7d1da5

list/array conversion

b3d0049

cosine similarity lol

4946417

add initial storage

9e30921

super basic row operations, although needs to handle validity and nes…

4bd4ebd

…ted types

fix broken vector copy op

7bf4767

got validity working

a72facf

nested types working

7be961e

format-fix

aa928c1

initial (new)row serialization works, does not work for nested arrays…

65dd833

… in arrays, needs refactor

format fix

5dc8a67

more tests

643d743

add serialization, typename fixes, list conversion

c41f023

implemented resize, so now nested seems to work

771ae55

solved nested array buggit status

88d2712

fixed nested sort

9bc0569

finally got tuple data working

76b228b

joins work

cf68093

fix gather/scatter

80bd4fb

remove templating

fa8b251

fix faulty check in storage

7c2a143

Merge branch 'master' into types/fixedsizelist

5271749

format fix

e5163c1

added list cast, added array size limit

7bb601b

Merge branch 'master' into types/fixedsizelist

29dd47c

tidy warnings

baaf54b

solved a bug, improve codecov, still cant create >2048 length child v…

fec6dee

…ectors...

Mytherin changed the base branch from main to feature October 10, 2023 07:02

Maxxen added 5 commits October 10, 2023 15:47

more list funcs, fix child validity not being set in Vector::Referenc…

967212b

…e and HeapGather

dont modify extension config

f381e11

dont set child validity unneccessarily

06538bd

add missing header

f631e51

merge

e4b4727

Maxxen marked this pull request as ready for review October 10, 2023 20:56

taniabogatsch suggested changes Oct 11, 2023

View reviewed changes

incorporate tanias feedback

db51712

github-actions bot marked this pull request as draft October 11, 2023 14:48

Mytherin approved these changes Oct 11, 2023

View reviewed changes

Maxxen added 3 commits October 11, 2023 18:14

add back array size limit, loosen child const vector

6189a52

merge

6497a3b

more headers

7322a64

Maxxen marked this pull request as ready for review October 11, 2023 17:42

Mytherin merged commit 1ed4d5a into duckdb:feature Oct 12, 2023
45 checks passed

thadguidry mentioned this pull request Oct 13, 2023

Array type with windowing like functionality? ohler55/ojg#143

Closed

szarnyasg added Needs Documentation Use for issues or PRs that require changes in the documentation and removed Needs Documentation Use for issues or PRs that require changes in the documentation labels Oct 23, 2023

szarnyasg mentioned this pull request Nov 14, 2023

Documentation for fixed size lists duckdb/duckdb-web#1517

Merged

szarnyasg added Needs Documentation Use for issues or PRs that require changes in the documentation and removed Needs Documentation Use for issues or PRs that require changes in the documentation labels Nov 14, 2023

duckdblabs-bot mentioned this pull request Nov 14, 2023

[duckdb/#8983] - Feature: Fixed size list nested type (ARRAY) needs documentation duckdb/duckdb-web#1521

Closed

This was referenced Feb 14, 2024

Upgrade DuckDB dialect to DuckDB 0.10.2 jOOQ/jOOQ#16287

Closed

Support DuckDB fixed size arrays jOOQ/jOOQ#16291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Fixed size list nested type (ARRAY) #8983

Feature: Fixed size list nested type (ARRAY) #8983

Maxxen commented Sep 18, 2023 •

edited

Loading

Maxxen commented Oct 10, 2023 •

edited

Loading

taniabogatsch left a comment

taniabogatsch Oct 11, 2023

Maxxen Oct 11, 2023

taniabogatsch Oct 11, 2023

Mytherin Oct 11, 2023

Mytherin left a comment

Mytherin Oct 11, 2023

Maxxen commented Oct 12, 2023

Mytherin commented Oct 12, 2023

Feature: Fixed size list nested type (ARRAY) #8983

Feature: Fixed size list nested type (ARRAY) #8983

Conversation

Maxxen commented Sep 18, 2023 • edited Loading

ARRAY Physical Type

Future work

ARRAY constructor and untangle array_ as alias for list

Specialize list functions for arrays

Example

Generic binding and casts

Small array optimizations

Add ARRAY to all_types, client libraries/integrations

Maxxen commented Oct 10, 2023 • edited Loading

taniabogatsch left a comment

Choose a reason for hiding this comment

General comments

Other thoughts

taniabogatsch Oct 11, 2023

Choose a reason for hiding this comment

Maxxen Oct 11, 2023

Choose a reason for hiding this comment

taniabogatsch Oct 11, 2023

Choose a reason for hiding this comment

Mytherin Oct 11, 2023

Choose a reason for hiding this comment

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Oct 11, 2023

Choose a reason for hiding this comment

Maxxen commented Oct 12, 2023

Mytherin commented Oct 12, 2023

Maxxen commented Sep 18, 2023 •

edited

Loading

`ARRAY` constructor and untangle `array_` as alias for list

Add ARRAY to `all_types`, client libraries/integrations

Maxxen commented Oct 10, 2023 •

edited

Loading