Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Group_By from Substrait Shuffles Column Names #33616

Closed
ksuarez1423 opened this issue Jan 11, 2023 · 4 comments · Fixed by #34551
Closed

[C++] Group_By from Substrait Shuffles Column Names #33616

ksuarez1423 opened this issue Jan 11, 2023 · 4 comments · Fixed by #34551
Assignees
Labels
Breaking Change Includes a breaking change to the API Component: C++ Type: bug
Milestone

Comments

@ksuarez1423
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

When using a Substrait plan that includes a group_by at the end, it appears that the resulting table will have correct results, but the columns have their names shuffled. This occurs on nightly Arrow 11.0.0.dev406, and a reproducer is included below. The plan in question is included in blob form in the reproducer, but comes from the following Ibis code:

total_population = countries2['population'].sum().name('total_population')
ex_expr = countries2.group_by('continent').aggregate(
    [total_population, countries2['area_km2'].mean().name('average_area')]
)

The particular error is as follows: instead of the continent column storing continents in the output table, it stores total population. The total population column stores average area, and the average area column stores continents. This is interpretable, but likely problematic if this table is meant for ingestion elsewhere.

Reproducer:

import pyarrow as pa
import pyarrow.substrait as substrait
​
tab = pa.Table.from_pydict({
    "iso_alpha2": ["x", "y"],
    "iso_alpha3": ["a", "b"],
    "iso_numeric": [1, 2],
    "fips": ["aa", "bb"],
    "name": ["xx", "yy"],
    "capital": ["Olympia", "Denver"],
    "area_km2": [100.1, 200.1],
    "population": [1000, 2000],
    "continent": ["NA", "NA"]
    })
​
def table_provider(names):
    if len(names) == 1 and names[0] == "countries2":
        return tab
    raise Error(f"Unexpected table name: {names}")
​
query = b'\n\x02\x08\x01\x12\x0b\x1a\t\x08\x01\x10\x01\x1a\x03sum\x12\x0b\x1a\t\x08\x01\x10\x02\x1a\x03avg\x1a\xaa\x02\x12\xa7\x02\n\xf9\x01"\xf6\x01\x12\xad\x01\n\xaa\x01\n\x02\n\x00\x12\x95\x01\n\niso_alpha2\n\niso_alpha3\n\x0biso_numeric\n\x04fips\n\x04name\n\x07capital\n\x08area_km2\n\npopulation\n\tcontinent\x128\n\x04b\x02\x10\x01\n\x04b\x02\x10\x01\n\x04*\x02\x10\x01\n\x04b\x02\x10\x01\n\x04b\x02\x10\x01\n\x04b\x02\x10\x01\n\x04Z\x02\x10\x01\n\x04*\x02\x10\x01\n\x04b\x02\x10\x01\x18\x02:\x0c\n\ncountries2\x1a\x0c\n\n\x12\x08\n\x04\x12\x02\x08\x08"\x00"\x1a\n\x18\x08\x01 \x03*\x04:\x02\x10\x01:\x0c\x1a\n\x12\x08\n\x04\x12\x02\x08\x07"\x00"\x1a\n\x18\x08\x02 \x03*\x04Z\x02\x10\x01:\x0c\x1a\n\x12\x08\n\x04\x12\x02\x08\x06"\x00\x12\tcontinent\x12\x10total_population\x12\x0caverage_area'
​
result = substrait.run_query(query, table_provider=table_provider)
​
print(result.read_all())
​
# pyarrow.Table
# continent: int64
# total_population: double
# average_area: string
# ----
# continent: [[3000]]
# total_population: [[150.1]]
# average_area: [["NA"]]

Component(s)

C++

@westonpace
Copy link
Member

I'm pretty sure the reason this happens is that Acero's aggregate node outputs in agg1, ..., aggN, key1, ..., keyN order and the Substrait aggregate node expects output in key1, ..., keyN, agg1, ..., aggN order.

We could change the aggregate node but this would be a breaking change for things like pyarrow.group_by (admittedly, it may be a welcome change). @jorisvandenbossche

If we want to avoid the breaking change then we could patch this in the emit step. This is very similar to a problem faced by the asof join node which is being addressed in #14799 or the solution that we used for project where we provided a specialization of ProcessEmitInfo.

@westonpace
Copy link
Member

westonpace commented Jan 11, 2023

This is a duplicate of #32897 but this issue has a reproducer and more information so I am going to close the other one.

@westonpace
Copy link
Member

@jorisvandenbossche @amol- @AlenkaF any particular opinion on this change? I agree that keys followed by measures is more intuitive (this matches SQL). However, it would be a breaking change so I didn't want to make the change in pyarrow if not needed (I can fix Substrait without changing pyarrow behavior pretty easily)

@jorisvandenbossche
Copy link
Member

I am personally fine with changing this in the user-facing behaviour of pyarrow's group_by as well. I have always found it a peculiarity that the group keys are at the end of the result (being used to pandas where those columns are essentially put first)

westonpace added a commit that referenced this issue Apr 5, 2023
…e aggregates (#34551)

* Closes: #33616

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Apr 5, 2023
@wjones127 wjones127 added the Breaking Change Includes a breaking change to the API label May 8, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
… before aggregates (apache#34551)

* Closes: apache#33616

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
… before aggregates (apache#34551)

* Closes: apache#33616

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change Includes a breaking change to the API Component: C++ Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants