Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map restructure #5768

Merged
merged 19 commits into from
Jan 2, 2023
Merged

Conversation

LindsayWray
Copy link

@LindsayWray LindsayWray commented Dec 22, 2022

Map Restructure

Currently MAP is internally stored as a struct containing two lists -> STRUCT{ [keys], [values] }
This restructure stores MAP as a list of structs -> LIST[ {key, val}, {key, val}, …] and allows map to be treated internally a list and therefore re-uses a lot of the list functionalities.

This is now in line with the Arrow structure:

Map data is nested data where each value is a variable number of key-item pairs. Its physical representation is the same as a list of {key, item} structs


The memory usage has diminished since now only one list_entry_t struct per map is needed instead of two.

Previous state (Two list_entry_t structs were required for the keys and values separately):
Screenshot 2022-12-21 at 20 01 28

This PR (One list_entry_t struct for the key/value structs):
Screenshot 2022-12-21 at 20 08 34

This alteration also impacted other map functions, namely map_from_entries. Since the input is already parsed as a list of key/value pair structs, 
this memory could be referenced as a whole instead of moving each value into a new block of memory. This does however change the former functionality where NULL elements are (silently) skipped instead of throwing an error when performing this query on multiple rows. See #3600, I’ve also discussed this with @Tishj. To accommodate this functionality, it would not be possible to re-use the memory, since it impacts the lengths and offsets. Or perhaps use a selection vector

So in this PR I’ve opted for referencing the allocated memory and throwing an error when encountering a NULL value (keys can not be NULL)

@Tishj
Copy link
Contributor

Tishj commented Dec 22, 2022

Sounds like a good change!

Regarding:

So in this PR I’ve opted for moving the memory and throwing an error when encountering a NULL value (keys can not be NULL)

Do you then also verify that keys don't contain duplicates?

@LindsayWray
Copy link
Author

Sounds like a good change!

Regarding:

So in this PR I’ve opted for moving the memory and throwing an error when encountering a NULL value (keys can not be NULL)

Do you then also verify that keys don't contain duplicates?

Yes! It goes through the MapConversionVerify function, that makes all those checks and returns the aproppiate error

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Great rework. Some minor comments, otherwise this looks good to me:

src/include/duckdb/common/types/vector.hpp Outdated Show resolved Hide resolved
src/storage/statistics/base_statistics.cpp Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants