New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grand Row Unification #1791
Comments
One problem that just occurred to me is that we have no way of serialising/deserialising aggregate states. IIRC Postgres requires aggregate states to be compound types (e.g., |
We have discussed serialization/deserialization of aggregate states into BLOBs before for the purpose of enabling incremental computation of aggregates. Specifically the idea was to have some functionality for extracting an aggregate state as a BLOB on the SQL level, rather than the actual result of the aggregate, e.g. you could do this: SELECT key, STATE(SUM(value))
FROM table
GROUP BY key This would give you the aggregate state of a specific type ("sum_state") stored as a blob. We could then expose the other aggregate functions on this state, specifically CREATE TABLE partial_state AS
SELECT key, STATE(SUM(value)) state
FROM 'file1.parquet'
GROUP BY key Then when new information comes in, you can do an upsert to either insert new states (if key is not present) or combine (if the specific key is present): INSERT INTO partial_state
SELECT key, STATE(SUM(value)) state
FROM 'file2.parquet'
GROUP BY key
ON CONFLICT (key) DO UPDATE SET state=COMBINE_STATE(state, excluded.state)
WHERE key=excluded.key; At any point, you can retrieve the current aggregates by calling finalize on them: SELECT key, STATE_FINALIZE(state)
FROM partial_state Perhaps that idea could be semi-combined with this idea, if we are adding functions to serialize/deserialize states into binary objects. |
Move row initialisation for aggregate hash tables into layout-aware methods. Remove assumption that aggregate states are POD.
Add data column for hashes. Not used yet and a buffer manager test fails.
Use hash values stored in the data column instead of the head of the row.
Break out RowLayout into a separate file.
Start RowOperations sub-library.
Move aggregate destruction to RowOperations.
Move aggregate finalize to RowOperations.
Add combine and finalize to RowOperations.
Convert PerfectAggregateHashTable to use proper state management.
Only initialize states that will be used.
Pull aggregate alignment into the aggregation operations so that there are no longer these strange state dependencies and code replication.
Tidy and format.
Add RowLayout and use it to replace fixed offset computations.
Use RowLayout for data read operations.
Use RowLayout for scatter.
Factor out scatter code for reuse.
Move row gather operations to RowOperations.
Replace join internal scatter operations with the one in RowOperations. This makes sure that the validity mask is initialised correctly and all columns are written.
Fix stash madness.
Unify Gather operations
Fix indexing typo.
Factor out Match logic for reuse with joins.
Extend RowOperations:Match to handle arbitraray predicates.
Replace join comparisons with RowOperations:Match.
Add missing include.
Switch to using a RowDataCollection instead of a StringHeap.
Delegate nested row matching to the column matching logic. Fix incorrect short-circuiting for IS (NOT) DISTINCT tests (the values may not be valid if they are NULL).
Implement PositionComparator variants.
Convert STRUCT hashing from Values to recursive scalar evaluation.
Convert LIST hashing from Values to recursive scalar evaluation.
Fix selection typo.
Factor out common STRUCT code for reuse with LIST. Compile LIST variant, but don't deploy.
Switch over to recursive vectorised LIST comparisons. Fix bug in Is(Not)Distinct code where matches were not being selected.
Back out unnecessary Value changes.
Add tests for long strings and lists.
Remove pointless STRUCT child munging.
Implement nested Is (Not) Distinct comparators.
Add tests to verify grouping by nested types.
Refactor to avoid name collisions.
Fix confusing cancelling coding errors.
Vectorised Combine for LIST.
Implement (in)equality boolean comparison functions for nested types
Expose hash table selection logic as comparison/distinction predicates.
Replace duplicate code with calls to the original.
Fix LIST hashing and comparing to match the selection semantics of the functions being leveraged.
Fix STRUCT hashing and comparing to match the selection semantics of the functions being leveraged.
Add filtering by a constant inequality tests for STRUCT and LIST.
Add tests for column and constant distinctions.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
This issue was closed because it has been stale for 30 days with no activity. |
There are 2-3 different row representations in the code use for aggregation, joins and sorting, and it would be helpful architecturally if they could be combined in some fashion to avoid code duplication and standardise the representations.
There is an existing discussion covering the initial discussions. A quick summary of the requirements so far:
memcmp
representationsThe text was updated successfully, but these errors were encountered: