Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation function that respects the order of data #62777

Closed
Alex-Cheng opened this issue Apr 19, 2024 · 8 comments
Closed

Aggregation function that respects the order of data #62777

Alex-Cheng opened this issue Apr 19, 2024 · 8 comments

Comments

@Alex-Cheng
Copy link
Contributor

(you don't have to strictly follow this form)

Use case

There is a task that consume and aggregate a bunch of rows in order, i.e. given rows R = {r1,r2,...,r_n} and a stateful function(similar to aggretion) a(x), I need to invoke a(r1), a(r2), ..., a(r_n) and then populate a final aggregation result. The result is impacted by the order of invocations. For getting a stable result, I have to make sure the invocations of a(r1), a(r2), ..., a(r_n) are strictly in order.

Describe the solution you'd like

I may need a setting to force AggregationTransform working on an strictly ordered data. Actually I had achieved this by the SQL pattern like:

select functionA(x) from (select `x` from `R_table` order by `x`) settings max_threads=1;

However, the solution does no longer work in the latest ClickHouse.

@canhld94
Copy link
Contributor

This can be done in several ways.

  1. groupArray -> arraySort -> arrayReduce
SELECT arrayReduce('functionA', arraySort(groupArray(x))) FROM R_table; 
  1. materialized CTE Materialized CTE #61086
WITH t AS MATERIALIZED (SELECT x FROM R_table ORDER BY x) SELECT functionA(x) FROM t;

@den-crane
Copy link
Contributor

den-crane commented Apr 19, 2024

@Alex-Cheng which aggregate function? https://fiddle.clickhouse.com/88a7eba5-01f5-41d9-9577-4865ca007def

Modern optimizer removes excessive order by if result of a query does not depend on it.
You may use limit to overcome optimizer.

select functionA(x) from (select `x` from `R_table` order by `x` 
   limit toUInt64(-1)   ------<<<<<<<<---
) settings max_threads=1

@Algunenano
Copy link
Member

CH is not designed to do this, and you will face multiple problems and changes in the future because of this:

  • CH is multithreaded by design, without any order guarantees. You can use max_threads=1
  • CH does not guarantee in which order it will read the data.
  • CH does not guarantee the size of the batches. It's usually max_block_size, but it's also not guaranteed.
  • CH also does not guarantee in which order the aggregation states will be merged.

You can try the suggested workarounds, but they might stop working at any time.

The only similar thing to what you are describing are window functions.

@Alex-Cheng
Copy link
Contributor Author

Alex-Cheng commented Apr 25, 2024

CH is not designed to do this, and you will face multiple problems and changes in the future because of this:

  • CH is multithreaded by design, without any order guarantees. You can use max_threads=1
  • CH does not guarantee in which order it will read the data.
  • CH does not guarantee the size of the batches. It's usually max_block_size, but it's also not guaranteed.
  • CH also does not guarantee in which order the aggregation states will be merged.

You can try the suggested workarounds, but they might stop working at any time.

The only similar thing to what you are describing are window functions.

Yes, the approaches you suggests were used for resolving the problem, e.g. max_threads=1 and max_block_size=<very big number>. However as same as you mentioned, it stops working after upgrading to 24.3. I understood that the aggregation functions should not depend on data order. However my aggregation algorithm requires data odering for my special requirement. I am wondering if there is same requirement that aggregation depends on data ordering?

Is my requirement too special?

Maybe I should transform the aggregation function into a regular function that accepts Array as inputs, just as mentioned by @canhld94 .

@Alex-Cheng
Copy link
Contributor Author

Alex-Cheng commented Apr 25, 2024

The approach suggested by @den-crane (thank you, it is very skillful workaround) is working and the solution requires the least workload. It could be a temporary solution for a while, and finally I need to think out of an alternative algorithm that does not depend on the order of input data.

@den-crane
Copy link
Contributor

den-crane commented Apr 25, 2024

@Alex-Cheng check the code of groupArray.
I think there is (or was) some trait which allows to mark a function as dependent of data order.
Earlier groupArray 100% respected the order of input blocks. And even now it's mentioned in the doc: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/grouparray
I think this trait forbids the optimizer to change a plan (remove orderbys) or something like this.

@UnamedRus
Copy link
Contributor

Earlier groupArray 100% respected the order of input blocks.

With important exception of GROUP BY in external memory AFAIK

@Alex-Cheng
Copy link
Contributor Author

@Alex-Cheng check the code of groupArray. I think there is (or was) some trait which allows to mark a function as dependent of data order. Earlier groupArray 100% respected the order of input blocks. And even now it's mentioned in the doc: https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/grouparray I think this trait forbids the optimizer to change a plan (remove orderbys) or something like this.

The document says, "Creates an array of argument values. Values can be added to the array in any (indeterminate) order." And it also explains some cases where can still rely on the order of execution, but it requires "the subquery result is small enough."

In some cases, you can still rely on the order of execution. This applies to cases when SELECT comes from a subquery that uses ORDER BY if the subquery result is small enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants