Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output #39804

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

JerAguilon
Copy link
Contributor

@JerAguilon JerAguilon commented Jan 25, 2024

Rationale for this change

Issue is described visually in #39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the column metadata for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the output_schema, rather than the input's schema.

If an input looks like

key_string_type,ts_int32_type,val

But our expected output schema looks like:

ts_int32,key_string_type,...

Then the hasher will think that the key_string_type's type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

What changes are included in this PR?

One line fix and test with string types.

Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

And after, they do match:

AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

...which is exactly what you want, since the key column for both tables looks like ["0", "1", ..."9"]

Are there any user-facing changes?

Copy link

⚠️ GitHub issue #39803 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou changed the title GH-39803: [C++][Acero] Fix key hashers for tables with differently ordered schemas than the output GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output Feb 12, 2024
@@ -16,6 +16,7 @@
// under the License.

#include <gmock/gmock-matchers.h>
#include <iostream> // nocommit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Also, what is "nocommit" for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - this was for some cout debugging, will remove

cpp/src/arrow/acero/asof_join_node_test.cc Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

Also, can you please rebase from git main?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 12, 2024
@JerAguilon
Copy link
Contributor Author

Also, can you please rebase from git main?

Done and addressed comments. Thanks for the review

@pitrou
Copy link
Member

pitrou commented Feb 19, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 11436fd

Submitted crossbow builds: ursacomputing/crossbow @ actions-4147b43cbc

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

@pitrou pitrou merged commit 4b74b45 into apache:main Feb 19, 2024
30 of 35 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Feb 19, 2024
@pitrou
Copy link
Member

pitrou commented Feb 19, 2024

Thanks a lot for this @JerAguilon !

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 4b74b45.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

raulcd pushed a commit that referenced this pull request Feb 20, 2024
…than the output (#39804)

### Rationale for this change

Issue is described visually in #39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: #39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…hemas than the output (apache#39804)

### Rationale for this change

Issue is described visually in apache#39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: apache#39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…hemas than the output (apache#39804)

### Rationale for this change

Issue is described visually in apache#39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: apache#39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Acero] As of join breaks depending on the key order
2 participants