Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Acero] As of join breaks depending on the key order #39803

Closed
JerAguilon opened this issue Jan 25, 2024 · 0 comments · Fixed by #39804
Closed

[Acero] As of join breaks depending on the key order #39803

JerAguilon opened this issue Jan 25, 2024 · 0 comments · Fixed by #39804
Assignees
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@JerAguilon
Copy link
Contributor

JerAguilon commented Jan 25, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I discovered a bug on the asofjoin, wherein if you change the order of the columns and make types different, it breaks.

Suppose the left table has:

ts,key_str_type,l_val
...data here

And the right table has:

key_str,ts,r_val
...data here

This ends up breaking, since the key hasher metadata is based on the output schema, which looks like:

ts,key_str,l_val,r_val

...And since the right table's 0th column is a string type, the hashes end up being all over the place.

I have a repro, test, and fix that I will attach to this. But you can see the test outputs here for before and after the fix: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Component(s)

C++

@pitrou pitrou added this to the 15.0.1 milestone Feb 19, 2024
pitrou added a commit that referenced this issue Feb 19, 2024
…than the output (#39804)

### Rationale for this change

Issue is described visually in #39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: #39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 15.0.1, 16.0.0 Feb 19, 2024
raulcd pushed a commit that referenced this issue Feb 20, 2024
…than the output (#39804)

### Rationale for this change

Issue is described visually in #39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: #39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Feb 27, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…hemas than the output (apache#39804)

### Rationale for this change

Issue is described visually in apache#39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: apache#39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…hemas than the output (apache#39804)

### Rationale for this change

Issue is described visually in apache#39803.

The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.

The issue is we are passing the `output_schema`, rather than the input's schema.

If an input looks like 

```
key_string_type,ts_int32_type,val
```

But our expected output schema looks like:

```
ts_int32,key_string_type,...
```
Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.

### What changes are included in this PR?

One line fix and test with string types.

### Are these changes tested?

Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc

Before the change, notice that inputs 0 and 1 have mismatched hashes:

```
AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
...
AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]

```

And after, they do match:

```
AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
...
AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
```

...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`

### Are there any user-facing changes?

* Closes: apache#39803

Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
3 participants