-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map restructure #5768
Map restructure #5768
Conversation
Sounds like a good change! Regarding:
Do you then also verify that keys don't contain duplicates? |
Yes! It goes through the MapConversionVerify function, that makes all those checks and returns the aproppiate error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Great rework. Some minor comments, otherwise this looks good to me:
Map Restructure
Currently MAP is internally stored as a struct containing two lists -> STRUCT{ [keys], [values] }
This restructure stores MAP as a list of structs -> LIST[ {key, val}, {key, val}, …] and allows map to be treated internally a list and therefore re-uses a lot of the list functionalities.
This is now in line with the Arrow structure:
The memory usage has diminished since now only one list_entry_t struct per map is needed instead of two.
Previous state (Two list_entry_t structs were required for the keys and values separately):
![Screenshot 2022-12-21 at 20 01 28](https://user-images.githubusercontent.com/69161963/209162728-d0b22d12-7c7b-42d4-a197-ed10c8758e1d.png)
This PR (One list_entry_t struct for the key/value structs):
![Screenshot 2022-12-21 at 20 08 34](https://user-images.githubusercontent.com/69161963/209162575-8441b28c-4e9c-445d-8480-b327991941cc.png)
This alteration also impacted other map functions, namely map_from_entries. Since the input is already parsed as a list of key/value pair structs, this memory could be referenced as a whole instead of moving each value into a new block of memory. This does however change the former functionality where NULL elements are (silently) skipped instead of throwing an error when performing this query on multiple rows. See #3600, I’ve also discussed this with @Tishj. To accommodate this functionality, it would not be possible to re-use the memory, since it impacts the lengths and offsets. Or perhaps use a selection vector
So in this PR I’ve opted for referencing the allocated memory and throwing an error when encountering a NULL value (keys can not be NULL)