-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Fixed size list nested type (ARRAY) #8983
Conversation
… in arrays, needs refactor
Alright, I've fixed all the issues @lnkuiper brought up and added his examples to the tests. Other work:
After working on all this (the validity bug fixing in particular) Im actually pretty confident that everything works as it should now. Maybe we can have a look tomorrow or later this week @Mytherin ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Maxxen! Really cool PR! I looked at the first part of my review assignment haha maybe @taniabogatsch can do the list/cast/binding/frontend-y
. I added comments, some nitpicks, and questions!
I'll add another review to cover binding and casts.
General comments
We should probably remove/rename these to make a clear separation between lists and this new array type.
I completely agree. As we've already discussed, some list functions even match the new ARRAY
type better! Let's pick this up in a future PR and clean up our functions there.
An interesting idea is to allow "small" arrays of fixed-size types [...] to be stored in the parent vectors data [...]
I don't think that this makes a lot of impact performance-wise. We don't assume people to have a few/a single row in their table with a small array size. Then, the child vector will be filled sufficiently for each processed data chunk, and our scalar function execution will carry the performance.
Other thoughts
This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, [...]
This comment directly made me wonder how much that impacts aggregation performance, as I fixed this for LIST
by writing the custom list_segment
implementation. But then I noticed we do not support ARRAY
as an aggregate function yet. Would that even make sense to have in the future, as that would require all groups to have the same size?
D SELECT LIST(i) FROM t;
┌─────────────────────┐
│ list(i) │
│ int32[][] │
├─────────────────────┤
│ [[1, 2], [1, 2, 3]] │
└─────────────────────┘
D SELECT ARRAY(i) FROM t;
Error: Parser Error: syntax error at or near "i"
LINE 1: SELECT ARRAY(i) FROM t;
@@ -17,6 +17,7 @@ | |||
#include "duckdb/core_functions/scalar/string_functions.hpp" | |||
#include "duckdb/core_functions/scalar/struct_functions.hpp" | |||
#include "duckdb/core_functions/scalar/union_functions.hpp" | |||
#include "duckdb/core_functions/scalar/array_functions.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these functions all supposed to go into the duckdb core functions? Or do we want to move some of them to the other functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk, where else would they go? I thought everything in main duckdb is core_functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, tbh, I am also confused about this. I just saw that we also had src/function/scalar/...
when writing this comment and thought that they belonged there. But how/why do we have these two separate places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functions except for the super core functions should go into core_functions
. The idea is that the core_functions
will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min
and max
are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Some minor comments - otherwise good to go from my side.
@@ -17,6 +17,7 @@ | |||
#include "duckdb/core_functions/scalar/string_functions.hpp" | |||
#include "duckdb/core_functions/scalar/struct_functions.hpp" | |||
#include "duckdb/core_functions/scalar/union_functions.hpp" | |||
#include "duckdb/core_functions/scalar/array_functions.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All functions except for the super core functions should go into core_functions
. The idea is that the core_functions
will eventually move to an extension, so you can have DuckDB without this set of functions (although in general all DuckDB installations will include these by default). This is not possible for all functions, for example min
and max
are required to implement other machinery (e.g. subqueries) so they must ALWAYS be bundled with DuckDB.
All green! @Mytherin |
Thanks! |
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
ARRAY Physical Type
This PR adds a new nested physical+logical type optimized for representing fixed-size-lists, with accompanying functions and full support for storage/joins/grouping/sorting. Internally an
ARRAY
vector is somewhat like a combination of aSTRUCT
and aLIST
. The vector itself does not store any data except for a validity mask (like structs), but has a single child vector (like lists).Unlike lists however there is no need to keep and store "list entries" with offsets and lengths because the data of the array at position
i
will always be atarray_size * i
. Therefore the child vector is always a flat vector and allocated at a fixedSTANDARD_VECTOR_SIZE * array_size
capacity, and is never shrunk or resized (unless the parent vector is too). This makes array vectors less memory efficient when they contain a lot of nulls or not enough elements to fill an entire vector, but in exchange they make child element access faster, but most importantly, predictable, which could enable for more optimized functions to be implemented over them (e.g. list lambdas, sorting, filtering, e.t.c.).All the list functions should work with arrays, but the following array-native functions are implemented
array_value
- Create an ARRAY containing the argument valuesarray_cross_product
- Compute the cross product of two arrays with length 3.array_cosine_similarity
- Compute the cosine similarity between two arrays.array_distance
- Compute the distance between two arrays.array_inner_product
- Compute the inner product between two arrays.length
(overloaded) - Returns the length of the array (always constant).array_length
- Returns the length of the array along a given dimension. (useful for nested arrays).Future work
ARRAY
constructor and untanglearray_
as alias for listWe've basically inherited the concept of "arrays" from Postgres and currently have a lot of aliases for list functions (and tests!) using the
array_
prefix. We should probably remove/rename these to make a clear separation between lists and this new array type.We also have the
ARRAY[]
syntax which I think we also should change to basically aliasarray_value
to create this new type instead of lists. As of now the "cleanest" way to create an array literal is to do a cast, e.g.[1,2,3]::INT[3]
, but that is of course less efficient.Specialize list functions for arrays
I've made it so that
LIST
andARRAY
are implicitly castable to each other if their child elements are implicitly castable, which makes it so that we can use arrays with all the existing list functions (although by introducing a manual cast in their bind hooks, read on below). This works, but is pretty inefficient, not just because we have to allocate a bunch of list_entry_t's but also because I think a lot of list functions could be implemented a lot easier and efficiently when you know all the list have the same length, which is always the case for arrays. This could be done incrementally over time though, so no big deal.Example
Generic binding and casts
It is currently a bit awkward to write functions for the list type as you usually want them to be somewhat polymorphic over the child type and do a lot of binding logic and casting in explicit bind functions when implementing e.g. a scalar. For arrays this is even more problematic since you usually also want your function to accept arrays - not just of any type - but also any size. So far I've implemented the new functions for the array type as taking
LogicalType::ANY
and then performing all type checking manually in the bind hook. WhileLogicalType::ANY
works to signal placeholder types we would need something similar to signal "any number", basically like a non-type template parameter.I think this could be an opportunity to look into extending our type system to handle both the nested ANY type better and perhaps also support non-type parameter placeholders/generic parameters.
Note that I think this is primarily useful for docs/error messages, I don't mind too much if you have to actually enforce the types manually when binding (similar to how ANY works now I guess).
Small array optimizations
An interesting idea is to allow "small" arrays of fixed-size types, e.g.
INTEGER[4]
,DOUBLE[2]
to be stored in the parent vectors data as some sort of small blob/struct up to 16/32 bytes, avoiding having a child vector altogether. You would also probably need to store an additional validity mask for the array elements as an auxiliary vector buffer though. But since this would change so much of the execution/storage maybe it would make more sense to have this as another separate physical type? (like aTUPLE
or something). idk.Add ARRAY to
all_types
, client libraries/integrationsNotably this type maps better to e.g. Arrow FixedSizeLists and other integrations like numpy arrays.