Grand Row Unification #1791

hawkfish · 2021-05-24T18:37:12Z

There are 2-3 different row representations in the code use for aggregation, joins and sorting, and it would be helpful architecturally if they could be combined in some fashion to avoid code duplication and standardise the representations.

There is an existing discussion covering the initial discussions. A quick summary of the requirements so far:

Ability to handle both data and aggregates
Ability to handle either native or memcmp representations
Ability to spool to disk for external sorting and scalable joins and aggregation

The text was updated successfully, but these errors were encountered:

hawkfish · 2021-05-25T15:32:35Z

One problem that just occurred to me is that we have no way of serialising/deserialising aggregate states. IIRC Postgres requires aggregate states to be compound types (e.g., STRUCTs?) which makes serialisation generic. To do this we would have to add two more aggregate methods for states.

Mytherin · 2021-05-25T16:08:41Z

We have discussed serialization/deserialization of aggregate states into BLOBs before for the purpose of enabling incremental computation of aggregates. Specifically the idea was to have some functionality for extracting an aggregate state as a BLOB on the SQL level, rather than the actual result of the aggregate, e.g. you could do this:

SELECT key, STATE(SUM(value))
FROM table
GROUP BY key

This would give you the aggregate state of a specific type ("sum_state") stored as a blob. We could then expose the other aggregate functions on this state, specifically combine and finalize. This allows you to perform incremental computation of aggregates. For example, you could create a table holding this state:

CREATE TABLE partial_state AS
SELECT key, STATE(SUM(value)) state
FROM 'file1.parquet'
GROUP BY key

Then when new information comes in, you can do an upsert to either insert new states (if key is not present) or combine (if the specific key is present):

INSERT INTO partial_state
SELECT key, STATE(SUM(value)) state
FROM 'file2.parquet'
GROUP BY key
ON CONFLICT (key) DO UPDATE SET state=COMBINE_STATE(state, excluded.state)
WHERE key=excluded.key;

At any point, you can retrieve the current aggregates by calling finalize on them:

SELECT key, STATE_FINALIZE(state)
FROM partial_state

Perhaps that idea could be semi-combined with this idea, if we are adding functions to serialize/deserialize states into binary objects.

Move row initialisation for aggregate hash tables into layout-aware methods. Remove assumption that aggregate states are POD.

Add data column for hashes. Not used yet and a buffer manager test fails.

Use hash values stored in the data column instead of the head of the row.

Break out RowLayout into a separate file.

Start RowOperations sub-library.

Move aggregate destruction to RowOperations.

Move aggregate finalize to RowOperations.

Add combine and finalize to RowOperations.

Convert PerfectAggregateHashTable to use proper state management.

Only initialize states that will be used.

Pull aggregate alignment into the aggregation operations so that there are no longer these strange state dependencies and code replication.

Tidy and format.

Issue #1791: Aggregate Row Layout

Add RowLayout and use it to replace fixed offset computations.

Use RowLayout for data read operations.

Use RowLayout for scatter.

Factor out scatter code for reuse.

Move row gather operations to RowOperations.

Replace join internal scatter operations with the one in RowOperations. This makes sure that the validity mask is initialised correctly and all columns are written.

Fix stash madness.

Unify Gather operations

Tidy includes.

Fix indexing typo.

Factor out Match logic for reuse with joins.

Extend RowOperations:Match to handle arbitraray predicates.

Replace join comparisons with RowOperations:Match.

Add missing include.

Switch to using a RowDataCollection instead of a StringHeap.

Delegate nested row matching to the column matching logic. Fix incorrect short-circuiting for IS (NOT) DISTINCT tests (the values may not be valid if they are NULL).

Implement PositionComparator variants.

Convert STRUCT hashing from Values to recursive scalar evaluation.

Convert LIST hashing from Values to recursive scalar evaluation.

Fix selection typo.

Factor out common STRUCT code for reuse with LIST. Compile LIST variant, but don't deploy.

Switch over to recursive vectorised LIST comparisons. Fix bug in Is(Not)Distinct code where matches were not being selected.

Back out unnecessary Value changes.

Add tests for long strings and lists.

Remove pointless STRUCT child munging.

Implement nested Is (Not) Distinct comparators.

Add tests to verify grouping by nested types.

#1791: Nested join payloads

Refactor to avoid name collisions.

Fix confusing cancelling coding errors.

Vectorised Combine for LIST.

Implement (in)equality boolean comparison functions for nested types

Expose hash table selection logic as comparison/distinction predicates.

Replace duplicate code with calls to the original.

#1791: Nested hash table entries

Fix LIST hashing and comparing to match the selection semantics of the functions being leveraged.

Fix STRUCT hashing and comparing to match the selection semantics of the functions being leveraged.

Add filtering by a constant inequality tests for STRUCT and LIST.

Add tests for column and constant distinctions.

Issue #1791: Nested type predicates

github-actions · 2023-08-03T00:33:06Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2023-09-03T00:31:05Z

This issue was closed because it has been stale for 30 days with no activity.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

15842b2

Move row initialisation for aggregate hash tables into layout-aware methods. Remove assumption that aggregate states are POD.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

34c1235

Add data column for hashes. Not used yet and a buffer manager test fails.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

7f1d07d

Use hash values stored in the data column instead of the head of the row.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

2fa5ad1

Break out RowLayout into a separate file.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

d3ae22b

Start RowOperations sub-library.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

94a5a54

Move aggregate destruction to RowOperations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

36fa4f3

Move aggregate finalize to RowOperations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

2fed906

Add combine and finalize to RowOperations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

24cd652

Convert PerfectAggregateHashTable to use proper state management.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

1310af6

Only initialize states that will be used.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 27, 2021

Issue duckdb#1791: Aggregate Row Layout

3dee4f7

Pull aggregate alignment into the aggregation operations so that there are no longer these strange state dependencies and code replication.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue May 28, 2021

Issue duckdb#1791: Aggregate Row Layout

d8170f3

Tidy and format.

Mytherin added a commit that referenced this issue May 31, 2021

Merge pull request #1813 from hawkfish/hawkfish-row-layout

cfec759

Issue #1791: Aggregate Row Layout

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

b9fb06c

Add RowLayout and use it to replace fixed offset computations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

8915334

Use RowLayout for data read operations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

c01ea51

Use RowLayout for scatter.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Aggregate Row Layout

d9a933e

Factor out scatter code for reuse.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

0ca10a3

Move row gather operations to RowOperations.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

6329296

Replace join internal scatter operations with the one in RowOperations. This makes sure that the validity mask is initialised correctly and all columns are written.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

9ceeb88

Fix stash madness.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 1, 2021

Issue duckdb#1791: Join Row Layout

407182f

Unify Gather operations

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

bad0d82

Tidy includes.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

72ffda8

Fix indexing typo.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

d13b568

Factor out Match logic for reuse with joins.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

b26c85e

Extend RowOperations:Match to handle arbitraray predicates.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

68b4881

Replace join comparisons with RowOperations:Match.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Join Row Layout

1e5cdb5

Add missing include.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 2, 2021

Issue duckdb#1791: Hashtable nested types

6018452

Switch to using a RowDataCollection instead of a StringHeap.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021

Issue duckdb#1791: Nested row grouping

35d8de6

Implement PositionComparator variants.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021

Issue duckdb#1791: Nested row grouping

c9e8c73

Convert STRUCT hashing from Values to recursive scalar evaluation.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 9, 2021

Issue duckdb#1791: Nested row grouping

cc467b8

Convert LIST hashing from Values to recursive scalar evaluation.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021

Issue duckdb#1791: Nested row grouping

8760779

Fix selection typo.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021

Issue duckdb#1791: Nested row grouping

4b51ce3

Factor out common STRUCT code for reuse with LIST. Compile LIST variant, but don't deploy.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021

Issue duckdb#1791: Nested row grouping

a2e43b1

Switch over to recursive vectorised LIST comparisons. Fix bug in Is(Not)Distinct code where matches were not being selected.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 10, 2021

Issue duckdb#1791: Nested row joins

924d21e

Back out unnecessary Value changes.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021

Issue duckdb#1791: Nested join payloads

250c176

Add tests for long strings and lists.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021

Issue duckdb#1791: Nested row grouping

b674452

Remove pointless STRUCT child munging.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021

Issue duckdb#1791: Nested row joins

6d12f34

Implement nested Is (Not) Distinct comparators.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 11, 2021

Issue duckdb#1791: Nested row aggregates

fe1dee8

Add tests to verify grouping by nested types.

Mytherin added a commit that referenced this issue Jun 12, 2021

Merge pull request #1845 from hawkfish/hawkfish-row-nested

6f24d93

#1791: Nested join payloads

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 12, 2021

Issue duckdb#1791: Nested row joins

906c519

Refactor to avoid name collisions.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021

Issue duckdb#1791: Nested row joins

98e3d42

Fix confusing cancelling coding errors.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021

Issue duckdb#1791: Nested hash table entries

bcf49f7

Vectorised Combine for LIST.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021

Issue duckdb#1791: Nested row comparisons

d5d2194

Implement (in)equality boolean comparison functions for nested types

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 14, 2021

Issue duckdb#1791: Nested row predicates

1bcbfdc

Expose hash table selection logic as comparison/distinction predicates.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 15, 2021

Issue duckdb#1791: Nested row predicates

5ba45af

Replace duplicate code with calls to the original.

Mytherin added a commit that referenced this issue Jun 15, 2021

Merge pull request #1863 from hawkfish/hawkfish-nested-keys

82f9bf1

#1791: Nested hash table entries

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021

Issue duckdb#1791: Nested row grouping

8243b45

Fix LIST hashing and comparing to match the selection semantics of the functions being leveraged.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021

Issue duckdb#1791: Nested row grouping

aeec836

Fix STRUCT hashing and comparing to match the selection semantics of the functions being leveraged.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021

Issue duckdb#1791: Nested row comparisons

8e10a9c

Add filtering by a constant inequality tests for STRUCT and LIST.

hawkfish pushed a commit to hawkfish/duckdb that referenced this issue Jun 16, 2021

Issue duckdb#1791: Nested row predicates

a4ed72f

Add tests for column and constant distinctions.

Mytherin added a commit that referenced this issue Jun 16, 2021

Merge pull request #1869 from hawkfish/hawkfish-nested-keys

f12c225

Issue #1791: Nested type predicates

hawkfish mentioned this issue Jun 16, 2021

Comparison semantics for nested types #1872

Closed

github-actions bot added the stale label Aug 3, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grand Row Unification #1791

Grand Row Unification #1791

hawkfish commented May 24, 2021 •

edited

hawkfish commented May 25, 2021

Mytherin commented May 25, 2021

github-actions bot commented Aug 3, 2023

github-actions bot commented Sep 3, 2023

Grand Row Unification #1791

Grand Row Unification #1791

Comments

hawkfish commented May 24, 2021 • edited

hawkfish commented May 25, 2021

Mytherin commented May 25, 2021

github-actions bot commented Aug 3, 2023

github-actions bot commented Sep 3, 2023

hawkfish commented May 24, 2021 •

edited