Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grand Row Unification #1791

Closed
hawkfish opened this issue May 24, 2021 · 4 comments
Closed

Grand Row Unification #1791

hawkfish opened this issue May 24, 2021 · 4 comments
Labels

Comments

@hawkfish
Copy link
Contributor

hawkfish commented May 24, 2021

There are 2-3 different row representations in the code use for aggregation, joins and sorting, and it would be helpful architecturally if they could be combined in some fashion to avoid code duplication and standardise the representations.

There is an existing discussion covering the initial discussions. A quick summary of the requirements so far:

  • Ability to handle both data and aggregates
  • Ability to handle either native or memcmp representations
  • Ability to spool to disk for external sorting and scalable joins and aggregation
@hawkfish
Copy link
Contributor Author

One problem that just occurred to me is that we have no way of serialising/deserialising aggregate states. IIRC Postgres requires aggregate states to be compound types (e.g., STRUCTs?) which makes serialisation generic. To do this we would have to add two more aggregate methods for states.

@Mytherin
Copy link
Collaborator

We have discussed serialization/deserialization of aggregate states into BLOBs before for the purpose of enabling incremental computation of aggregates. Specifically the idea was to have some functionality for extracting an aggregate state as a BLOB on the SQL level, rather than the actual result of the aggregate, e.g. you could do this:

SELECT key, STATE(SUM(value))
FROM table
GROUP BY key

This would give you the aggregate state of a specific type ("sum_state") stored as a blob. We could then expose the other aggregate functions on this state, specifically combine and finalize. This allows you to perform incremental computation of aggregates. For example, you could create a table holding this state:

CREATE TABLE partial_state AS
SELECT key, STATE(SUM(value)) state
FROM 'file1.parquet'
GROUP BY key

Then when new information comes in, you can do an upsert to either insert new states (if key is not present) or combine (if the specific key is present):

INSERT INTO partial_state
SELECT key, STATE(SUM(value)) state
FROM 'file2.parquet'
GROUP BY key
ON CONFLICT (key) DO UPDATE SET state=COMBINE_STATE(state, excluded.state)
WHERE key=excluded.key;

At any point, you can retrieve the current aggregates by calling finalize on them:

SELECT key, STATE_FINALIZE(state)
FROM partial_state

Perhaps that idea could be semi-combined with this idea, if we are adding functions to serialize/deserialize states into binary objects.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Move row initialisation for aggregate hash tables
into layout-aware methods.
Remove assumption that aggregate states are POD.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Add data column for hashes.
Not used yet and a buffer manager test fails.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Use hash values stored in the data column
instead of the head of the row.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Break out RowLayout into a separate file.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Start RowOperations sub-library.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Move aggregate destruction to RowOperations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Move aggregate finalize to RowOperations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Add combine and finalize to RowOperations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Convert PerfectAggregateHashTable to use
proper state management.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Only initialize states that will be used.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021
Pull aggregate alignment into the aggregation operations
so that there are no longer these strange state
dependencies and code replication.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 28, 2021
Mytherin added a commit that referenced this issue May 31, 2021
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Add RowLayout and use it to replace fixed offset computations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Use RowLayout for data read operations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Use RowLayout for scatter.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Factor out scatter code for reuse.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Move row gather operations to RowOperations.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Replace join internal scatter operations
with the one in RowOperations.
This makes sure that the validity mask
is initialised correctly and all columns are written.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Fix stash madness.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021
Unify Gather operations
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Fix indexing typo.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Factor out Match logic for reuse with joins.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Extend RowOperations:Match to handle arbitraray predicates.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Replace join comparisons with RowOperations:Match.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Add missing include.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021
Switch to using a RowDataCollection
instead of a StringHeap.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021
Delegate nested row matching to the column matching logic.
Fix incorrect short-circuiting for IS (NOT) DISTINCT tests
(the values may not be valid if they are NULL).
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021
Implement PositionComparator variants.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021
Convert STRUCT hashing from Values
to recursive scalar evaluation.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021
Convert LIST hashing from Values
to recursive scalar evaluation.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021
Factor out common STRUCT code for reuse with LIST.
Compile LIST variant, but don't deploy.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021
Switch over to recursive vectorised LIST comparisons.
Fix bug in Is(Not)Distinct code where matches were not
being selected.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021
Back out unnecessary Value changes.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021
Add tests for long strings and lists.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021
Remove pointless STRUCT child munging.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021
Implement nested Is (Not) Distinct comparators.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021
Add tests to verify grouping by nested types.
Mytherin added a commit that referenced this issue Jun 12, 2021
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 12, 2021
Refactor to avoid name collisions.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021
Fix confusing cancelling coding errors.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021
Vectorised Combine for LIST.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021
Implement (in)equality boolean comparison functions
for nested types
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021
Expose hash table selection logic as comparison/distinction predicates.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 15, 2021
Replace duplicate code with calls to the original.
Mytherin added a commit that referenced this issue Jun 15, 2021
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021
Fix LIST hashing and comparing to match the selection semantics
of the functions being leveraged.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021
Fix STRUCT hashing and comparing to match the selection semantics
of the functions being leveraged.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021
Add filtering by a constant inequality tests for STRUCT and LIST.
hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021
Add tests for column and constant distinctions.
Mytherin added a commit that referenced this issue Jun 16, 2021
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Aug 3, 2023
@github-actions
Copy link

github-actions bot commented Sep 3, 2023

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants