Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Fixed size list nested type (ARRAY) #8983

Merged
merged 78 commits into from
Oct 12, 2023

Conversation

Maxxen
Copy link
Member

@Maxxen Maxxen commented Sep 18, 2023

ARRAY Physical Type

This PR adds a new nested physical+logical type optimized for representing fixed-size-lists, with accompanying functions and full support for storage/joins/grouping/sorting. Internally an ARRAY vector is somewhat like a combination of a STRUCT and a LIST. The vector itself does not store any data except for a validity mask (like structs), but has a single child vector (like lists).

Unlike lists however there is no need to keep and store "list entries" with offsets and lengths because the data of the array at position i will always be at array_size * i. Therefore the child vector is always a flat vector and allocated at a fixed STANDARD_VECTOR_SIZE * array_size capacity, and is never shrunk or resized (unless the parent vector is too). This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, but in exchange they make child element access faster, but most importantly, predictable, which could enable for more optimized functions to be implemented over them (e.g. list lambdas, sorting, filtering, e.t.c.).

All the list functions should work with arrays, but the following array-native functions are implemented

  • array_value - Create an ARRAY containing the argument values
  • array_cross_product - Compute the cross product of two arrays with length 3.
  • array_cosine_similarity - Compute the cosine similarity between two arrays.
  • array_distance - Compute the distance between two arrays.
  • array_inner_product - Compute the inner product between two arrays.
  • length (overloaded) - Returns the length of the array (always constant).
  • array_length - Returns the length of the array along a given dimension. (useful for nested arrays).

Future work

ARRAY constructor and untangle array_ as alias for list

We've basically inherited the concept of "arrays" from Postgres and currently have a lot of aliases for list functions (and tests!) using the array_ prefix. We should probably remove/rename these to make a clear separation between lists and this new array type.

We also have the ARRAY[] syntax which I think we also should change to basically alias array_value to create this new type instead of lists. As of now the "cleanest" way to create an array literal is to do a cast, e.g. [1,2,3]::INT[3], but that is of course less efficient.

Specialize list functions for arrays

I've made it so that LIST and ARRAY are implicitly castable to each other if their child elements are implicitly castable, which makes it so that we can use arrays with all the existing list functions (although by introducing a manual cast in their bind hooks, read on below). This works, but is pretty inefficient, not just because we have to allocate a bunch of list_entry_t's but also because I think a lot of list functions could be implemented a lot easier and efficiently when you know all the list have the same length, which is always the case for arrays. This could be done incrementally over time though, so no big deal.

Example

-- Construct with the 'array_value' function
SELECT array_value(1,2,3);
┌──────────────────────┐
│ array_value(1, 2, 3) │
│      integer[3]      │
├──────────────────────┤
│ [1, 2, 3]            │
└──────────────────────┘

-- You can always implicitly cast to a list (and use list functions, like list_extract, '[i]')
SELECT array_value(1,2,3)[2];
┌─────────────────────────┐
│ array_value(1, 2, 3)[2] │
│          int32          │
├─────────────────────────┤
│                       2 │
└─────────────────────────┘

-- You can cast from list, but the dimensions have to match up!
SELECT [3,2,1]::INT[3];
┌──────────────────────────────────────────────┐
│ CAST(main.list_value(3, 2, 1) AS INTEGER[3]) │
│                  integer[3]                  │
├──────────────────────────────────────────────┤
│ [3, 2, 1]                                    │
└──────────────────────────────────────────────┘

SELECT [3,2,1]::INT[4];
Error: Conversion Error: Cannot cast list with length 3 to array with length 4

-- Arrays can of course also be nested!
SELECT array_value(array_value(1,2), array_value(3,4), array_value(5,6));
┌──────────────────────────────────────────────────────────────────────┐
│ array_value(array_value(1, 2), array_value(3, 4), array_value(5, 6)) │
│                            integer[2][3]                             │
├──────────────────────────────────────────────────────────────────────┤
│ [[1, 2], [3, 4], [5, 6]]                                             │
└──────────────────────────────────────────────────────────────────────┘

SELECT array_value({'a': 1, 'b': 2}, {'a': 3, 'b': 4});
┌─────────────────────────────────────────────────────────────────────────────────┐
│ array_value(main.struct_pack(a := 1, b := 2), main.struct_pack(a := 3, b := 4)) │
│                         struct(a integer, b integer)[2]                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│ [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]                                            │
└─────────────────────────────────────────────────────────────────────────────────┘

Generic binding and casts

It is currently a bit awkward to write functions for the list type as you usually want them to be somewhat polymorphic over the child type and do a lot of binding logic and casting in explicit bind functions when implementing e.g. a scalar. For arrays this is even more problematic since you usually also want your function to accept arrays - not just of any type - but also any size. So far I've implemented the new functions for the array type as taking LogicalType::ANY and then performing all type checking manually in the bind hook. While LogicalType::ANY works to signal placeholder types we would need something similar to signal "any number", basically like a non-type template parameter.

I think this could be an opportunity to look into extending our type system to handle both the nested ANY type better and perhaps also support non-type parameter placeholders/generic parameters.

Note that I think this is primarily useful for docs/error messages, I don't mind too much if you have to actually enforce the types manually when binding (similar to how ANY works now I guess).

Small array optimizations

An interesting idea is to allow "small" arrays of fixed-size types, e.g. INTEGER[4], DOUBLE[2] to be stored in the parent vectors data as some sort of small blob/struct up to 16/32 bytes, avoiding having a child vector altogether. You would also probably need to store an additional validity mask for the array elements as an auxiliary vector buffer though. But since this would change so much of the execution/storage maybe it would make more sense to have this as another separate physical type? (like a TUPLE or something). idk.

Add ARRAY to all_types, client libraries/integrations

Notably this type maps better to e.g. Arrow FixedSizeLists and other integrations like numpy arrays.

@Mytherin Mytherin changed the base branch from main to feature October 10, 2023 07:02
@Maxxen Maxxen marked this pull request as ready for review October 10, 2023 20:56
@Maxxen
Copy link
Member Author

Maxxen commented Oct 10, 2023

Alright, I've fixed all the issues @lnkuiper brought up and added his examples to the tests. Other work:

  • You can now roundtrip parquet/json/csv containing ARRAY types, although not optimized (I think parquet performs VARCAHR conversions for now)
  • For binding purposes its possible to construct a array type of "unknown size", similar to LogicalType::ANY
  • implicit list->array casts are allowed if the target array type is of known size.
  • Squashed a whole bunch of bugs regarding child-vector validity at all levels of the system. Similar to structs the validity of the child vector have to be synchronized with the parent (array) vector.

After working on all this (the validity bug fixing in particular) Im actually pretty confident that everything works as it should now. Maybe we can have a look tomorrow or later this week @Mytherin ?

Copy link
Contributor

@taniabogatsch taniabogatsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Maxxen! Really cool PR! I looked at the first part of my review assignment haha maybe @taniabogatsch can do the list/cast/binding/frontend-y. I added comments, some nitpicks, and questions!

I'll add another review to cover binding and casts.

General comments

We should probably remove/rename these to make a clear separation between lists and this new array type.

I completely agree. As we've already discussed, some list functions even match the new ARRAY type better! Let's pick this up in a future PR and clean up our functions there.

An interesting idea is to allow "small" arrays of fixed-size types [...] to be stored in the parent vectors data [...]

I don't think that this makes a lot of impact performance-wise. We don't assume people to have a few/a single row in their table with a small array size. Then, the child vector will be filled sufficiently for each processed data chunk, and our scalar function execution will carry the performance.

Other thoughts

This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, [...]

This comment directly made me wonder how much that impacts aggregation performance, as I fixed this for LIST by writing the custom list_segment implementation. But then I noticed we do not support ARRAY as an aggregate function yet. Would that even make sense to have in the future, as that would require all groups to have the same size?

D SELECT LIST(i) FROM t;
┌─────────────────────┐
│       list(i)       │
│      int32[][]      │
├─────────────────────┤
│ [[1, 2], [1, 2, 3]] │
└─────────────────────┘
D SELECT ARRAY(i) FROM t;
Error: Parser Error: syntax error at or near "i"
LINE 1: SELECT ARRAY(i) FROM t;

.github/config/uncovered_files.csv Show resolved Hide resolved
src/common/extra_type_info.cpp Outdated Show resolved Hide resolved
src/common/types.cpp Show resolved Hide resolved
src/common/types/vector_buffer.cpp Outdated Show resolved Hide resolved
@@ -17,6 +17,7 @@
#include "duckdb/core_functions/scalar/string_functions.hpp"
#include "duckdb/core_functions/scalar/struct_functions.hpp"
#include "duckdb/core_functions/scalar/union_functions.hpp"
#include "duckdb/core_functions/scalar/array_functions.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these functions all supposed to go into the duckdb core functions? Or do we want to move some of them to the other functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk, where else would they go? I thought everything in main duckdb is core_functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, tbh, I am also confused about this. I just saw that we also had src/function/scalar/... when writing this comment and thought that they belonged there. But how/why do we have these two separate places?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All functions except for the super core functions should go into core_functions. The idea is that the core_functions will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min and max are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.

src/common/types/value.cpp Outdated Show resolved Hide resolved
src/core_functions/scalar/array/array_value.cpp Outdated Show resolved Hide resolved
@github-actions github-actions bot marked this pull request as draft October 11, 2023 14:48
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Some minor comments - otherwise good to go from my side.

src/common/types.cpp Outdated Show resolved Hide resolved
src/common/types/vector.cpp Outdated Show resolved Hide resolved
@@ -17,6 +17,7 @@
#include "duckdb/core_functions/scalar/string_functions.hpp"
#include "duckdb/core_functions/scalar/struct_functions.hpp"
#include "duckdb/core_functions/scalar/union_functions.hpp"
#include "duckdb/core_functions/scalar/array_functions.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All functions except for the super core functions should go into core_functions. The idea is that the core_functions will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min and max are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.

@Maxxen Maxxen marked this pull request as ready for review October 11, 2023 17:42
@Maxxen
Copy link
Member Author

Maxxen commented Oct 12, 2023

All green! @Mytherin

@Mytherin Mytherin merged commit 1ed4d5a into duckdb:feature Oct 12, 2023
45 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@szarnyasg szarnyasg added Needs Documentation Use for issues or PRs that require changes in the documentation and removed Needs Documentation Use for issues or PRs that require changes in the documentation labels Oct 23, 2023
@szarnyasg szarnyasg added Needs Documentation Use for issues or PRs that require changes in the documentation and removed Needs Documentation Use for issues or PRs that require changes in the documentation labels Nov 14, 2023
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param
Merge pull request duckdb/duckdb#9185 from pdet/adbc_07
Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata
Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema
Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure
Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2
Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast
Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas
Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist
Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused
Merge pull request duckdb/duckdb#9220 from hawkfish/exclude
Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization
Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer
Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes
Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade
Merge fixes
Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance
Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan
Merge pull request duckdb/duckdb#9516 from carlopi/fixformat
Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization
Merge pull request duckdb/duckdb#9388 from chrisiou/issue217
Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes
Merge pull request duckdb/duckdb#9583 from carlopi/feature
Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions
Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow
Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional
Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants