New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Guarantee that group_by
has stable ordering.
#36709
Comments
Test fix to not rely on ordering: apache/arrow#36709. Distinct aggregation supports aliasing.
Another effect of having keys/batches unordered is that |
The groupby feature is part of the Acero query execution engine (https://arrow.apache.org/docs/dev/cpp/streaming_execution.html), and in general Acero doesn't guarantee stable ordering of batches that are executed. Now, I can reproduce the difference that this was indeed stable in 12.0, while not always stable in 13.0. I am not sure if something changed in the 13.0 release cycle that might have caused this, but I think in general the new behaviour is what can be expected cc @westonpace |
Actually, looking at what changed in the groupby implementation the last months, I suppose my clean-up PR #34769 will have caused this. Before that, pyarrow's That should have been fully equivalent, but now I see that the So that will probably explain the difference in behaviour: in 12.0, the group_by method was not yet running in parallel, while now it is. The question is still whether we are fine with this change. We actually do have some (hash) aggregations that do depend on the input being ordered (e.g. first/last), but I don't think there is a way to "force" doing the calculation ordered for other aggregations (like We should probably at least expose |
group_by
has stable ordering.group_by
has stable ordering.
…group_by to have stable ordering
+1 for exposing There are two different things here and I'm not sure which we are discussing:
Stable ordering of the keys is tricky since we are essentially using an "unordered hashmap" for our grouping. Changes in the order data arrives, or even just changes in the amount of data could, in theory, change the order of the resulting keys. Perhaps the easiest thing to do for now is to just sort the results by key. At some point we could investigate an ordered hashmap if desired. In classic SQL one generally needs to add an order-by clause to the end of the query to order by the keys (and the underlying implementation may just add a sort node or it may do something more clever with the groupby).
Stable ordering of the values is a bit trickier, especially in parallel. There are only a few aggregate functions which depend on this. In postgres there is special syntax for dealing with this case. Note, this is also something that can be expressed in Substrait.
Acero doesn't yet have the components that would be needed to do this. I think, at a minimum, you would want a way to force the aggregate "consume" operation to be serialized (with some kind of sequencing queue). Then you could use a regular order-by followed by the group-by node and get predictable results. Since you're paying the cost of ordering you could probably order first by the grouping keys, and then by the measure column. Then you could use a streaming group-by operator. Then, of course, there is a whole different can of worms about how to effectively wrap this all up in pyarrow :) |
Using the current unordered hashmap implementation would be the same keys from the same group next to each other? so eg.: select col, sum(x) from tbl group by 1
This sounds pretty easy to fix just slicing / memcpy the columns. |
No. If the columns are keys then there will only be one output row for each combination of keys. So there is no concept of "keys from the same group".
This is odd syntax to me. However, it looks to be the same as
|
👍 >>> table.group_by('key').aggregate([('value', 'first')])
...
ArrowNotImplementedError: Using ordered aggregator in multiple threaded execution is not supported |
Some more data points:
The latter implies the issue is parallel execution, not an unordered hash map. I'm not clear on why grouping can't use ordered execution and have both features.
|
To be clear, I fully support allowing for both ordered keys and ordered values within keys. I'm just explaining why the behavior is what it is today. I didn't expect things to be quite as predictable as you found them. However, I do verify that I get the same results you do. I wouldn't recommend putting too much faith in the way things happen to work now. The general expectation for group_by, in C++, is that it does not maintain order. There is no regression in place to verify whatever current ordering behavior you are seeing and it could very well change. If we want to explicitly maintain key and/or value order then I think those need to be features in their own right. |
Replaces the explicit batch iteration, with support for all aggregates. The fragment optimization is still significant enough to retain. Natively supports aliases, `first_last`, and `use_thread=False` fixes the batch ordering regression with no noticeable performance difference: apache/arrow#36709.
…by to have stable ordering (#36768) ### Rationale for this change Add a `use_threads` keyword to the `group_by` method on Table, and passes this through to the Declaration.to_table call. This also allows to specify `use_threads=False` to get stable ordering of the output, and which is also required to specify for certain aggregations (eg `"first"` will fail with the default of `use_threads=True`) ### Are these changes tested? Yes, added a test (similar to the one we have for this for `filter`), that would fail (>50% of the times) if the output was no longer ordered. * Closes: #36709 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…group_by to have stable ordering (apache#36768) ### Rationale for this change Add a `use_threads` keyword to the `group_by` method on Table, and passes this through to the Declaration.to_table call. This also allows to specify `use_threads=False` to get stable ordering of the output, and which is also required to specify for certain aggregations (eg `"first"` will fail with the default of `use_threads=True`) ### Are these changes tested? Yes, added a test (similar to the one we have for this for `filter`), that would fail (>50% of the times) if the output was no longer ordered. * Closes: apache#36709 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…group_by to have stable ordering (apache#36768) ### Rationale for this change Add a `use_threads` keyword to the `group_by` method on Table, and passes this through to the Declaration.to_table call. This also allows to specify `use_threads=False` to get stable ordering of the output, and which is also required to specify for certain aggregations (eg `"first"` will fail with the default of `use_threads=True`) ### Are these changes tested? Yes, added a test (similar to the one we have for this for `filter`), that would fail (>50% of the times) if the output was no longer ordered. * Closes: apache#36709 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…group_by to have stable ordering (apache#36768) ### Rationale for this change Add a `use_threads` keyword to the `group_by` method on Table, and passes this through to the Declaration.to_table call. This also allows to specify `use_threads=False` to get stable ordering of the output, and which is also required to specify for certain aggregations (eg `"first"` will fail with the default of `use_threads=True`) ### Are these changes tested? Yes, added a test (similar to the one we have for this for `filter`), that would fail (>50% of the times) if the output was no longer ordered. * Closes: apache#36709 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Describe the enhancement requested
As far as I can tell, it's not officially documented that grouping maintains the order of its keys, but it has seemed that way until recently. Starting with 13.0.0.dev, I've noticed ordering changes intermittently. I think this is an important feature of grouping, as with sorting.
This example outputs differences reliably on 13.0.0.dev516.
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: