exec: filtering aggregations #39241

jordanlewis · 2019-08-01T18:42:00Z

Add filtering aggregator support to the vectorized engine. This is an extra filter step that only includes rows in the aggregation if they pass a filter.

	// If set, this column index specifies a boolean argument; rows for which
	// this value is not true don't contribute to this aggregation. This enables
	// the filter clause, e.g.:
	//   SELECT SUM(x) FILTER (WHERE y > 1), SUM(x) FILTER (WHERE y < 1) FROM t
	FilterColIdx *uint32 `protobuf:"varint,4,opt,name=filter_col_idx,json=filterColIdx" json:"filter_col_idx,omitempty"`

Jira issue: CRDB-5586

The text was updated successfully, but these errors were encountered:

rohany · 2019-08-01T21:06:01Z

@jordanlewis whats your status on the case operator? I feel like the strategy for repeating the batch as done there to keep adding to a selection vector could be good here.

jordanlewis · 2019-08-01T21:23:43Z

Status is that it's kinda on hold til the FK stuff gets finished, and hoping to get some more reviews on it too. I saw your review (thanks!) but it's kinda complicated and I'd appreciate some more scrutiny :)

jordanlewis · 2019-08-01T21:24:18Z

I agree though that your proposed strategy could work. It actually might be as simple as planning a selection right before the aggregation step actually runs.

rohany · 2019-08-01T21:25:08Z

yeah, thats what I was thinking. if we just select out the rows that don't pass the filter then we might be good. No worries about my review -- i don't have much authority over the codebase yet.

yuzefovich · 2020-08-20T16:41:32Z

#50721 adds the support for filtering hash aggregation, but filtering ordered aggregation seems to be a lot harder (I think it might require plumbing the filtering logic right into the "heart" of each aggregate function).

One edge case is what happens when the whole group is filtered out? We still need to let aggregate function populate the output for such group, otherwise there will be a misalignment between any_not_null and all other functions. I think if we don't have any FILTER clauses, handling DISTINCT is relatively easy, but FILTER clause makes it extremely hard to not plumb the logic into each aggregate function.

It's been a while since I struggled with that (but I did spend like a day trying to get it working in all cases), and here is a snippet of what I had in the end. I can't easily describe why I came to realization that there is no way around plumbing the FILTERing logic into the aggregate function because I forgot and lost all the context, but at the time I was certain of it.

In the code snippet I was trying to modify the groups vector so that FILTERing would work in all cases (another case that comes to mind: how do you handle the scenario when the "head" of the group is filtered out? (the "head" is the first tuple in the group for which groups[i]==true)).

50721: colexec: add support for DISTINCT and FILTER hash aggregation r=yuzefovich a=yuzefovich This commit adds the support of DISTINCT and FILTERing hash aggregation. The approach is as follows: - to handle FILTER we run a selection operator on the input state - to handle DISTINCT we encode aggregation columns, one tuple at a time, and update the selection vector to include tuples we haven't yet seen - then we run the aggregation on the remaining selected tuples - and then restore the state with the original length and selection vector. Such handling of FILTER clause sounds good to me, but the handling of DISTINCT is somewhat unfortunate: we perform encoding one tuple at a time. Other approaches have been prototyped but showed worse performance: - using the vectorized hash table - the benefit of such approach is that we don't reduce ourselves to one tuple at a time (because we would be hashing the full columns at once), but the big disadvantage is that the full tuples are stored in the hash table (instead of an encoded representation) - using a single global map for a particular aggregate function that is shared among all aggregation groups - the benefit of such approach is that we only have a handful of map, but it turned out that such global map grows a lot bigger and has worse performance. Addresses: #39241. Addresses: #39242. Release note (sql change): Vectorized execution engine now natively supports DISTINCT and FILTERing hash aggregation. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

github-actions · 2023-09-20T11:08:00Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

jordanlewis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-vec SQL vectorized engine labels Aug 1, 2019

jordanlewis added this to Triage in [DEPRECATED] Old SQLExec board. Don't move stuff here via automation Aug 1, 2019

jordanlewis moved this from Triage to Backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here Aug 1, 2019

asubiotto added this to Triage in BACKLOG, NO NEW ISSUES: SQL Execution via automation Oct 17, 2019

asubiotto removed this from Backlog in [DEPRECATED] Old SQLExec board. Don't move stuff here Oct 17, 2019

asubiotto moved this from Triage to Enhancements/Features in BACKLOG, NO NEW ISSUES: SQL Execution Oct 22, 2019

yuzefovich self-assigned this Jun 7, 2020

yuzefovich mentioned this issue Jun 26, 2020

colexec: add support for DISTINCT and FILTER hash aggregation #50721

Merged

yuzefovich removed their assignment Aug 28, 2020

jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021

github-actions bot added the no-issue-activity label Sep 20, 2023

yuzefovich removed the no-issue-activity label Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec: filtering aggregations #39241

exec: filtering aggregations #39241

jordanlewis commented Aug 1, 2019 •

edited by cockroach-jira-scripts

rohany commented Aug 1, 2019

jordanlewis commented Aug 1, 2019

jordanlewis commented Aug 1, 2019

rohany commented Aug 1, 2019

yuzefovich commented Aug 20, 2020

github-actions bot commented Sep 20, 2023

exec: filtering aggregations #39241

exec: filtering aggregations #39241

Comments

jordanlewis commented Aug 1, 2019 • edited by cockroach-jira-scripts

rohany commented Aug 1, 2019

jordanlewis commented Aug 1, 2019

jordanlewis commented Aug 1, 2019

rohany commented Aug 1, 2019

yuzefovich commented Aug 20, 2020

github-actions bot commented Sep 20, 2023

jordanlewis commented Aug 1, 2019 •

edited by cockroach-jira-scripts