New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Acero] As of join breaks depending on the key order #39803
Labels
Component: C++
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
Type: bug
Milestone
Comments
pitrou
added a commit
that referenced
this issue
Feb 19, 2024
…than the output (#39804) ### Rationale for this change Issue is described visually in #39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: #39803 Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd
pushed a commit
that referenced
this issue
Feb 20, 2024
…than the output (#39804) ### Rationale for this change Issue is described visually in #39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: #39803 Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
amoeba
added
the
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
label
Feb 27, 2024
zanmato1984
pushed a commit
to zanmato1984/arrow
that referenced
this issue
Feb 28, 2024
…hemas than the output (apache#39804) ### Rationale for this change Issue is described visually in apache#39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: apache#39803 Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic
pushed a commit
to thisisnic/arrow
that referenced
this issue
Mar 8, 2024
…hemas than the output (apache#39804) ### Rationale for this change Issue is described visually in apache#39803. The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things. The issue is we are passing the `output_schema`, rather than the input's schema. If an input looks like ``` key_string_type,ts_int32_type,val ``` But our expected output schema looks like: ``` ts_int32,key_string_type,... ``` Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board. ### What changes are included in this PR? One line fix and test with string types. ### Are these changes tested? Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc Before the change, notice that inputs 0 and 1 have mismatched hashes: ``` AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039] ... AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` And after, they do match: ``` AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ... AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709] ``` ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]` ### Are there any user-facing changes? * Closes: apache#39803 Lead-authored-by: Jeremy Aguilon <jeraguilon@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Component: C++
Critical Fix
Bugfixes for security vulnerabilities, crashes, or invalid data.
Type: bug
Describe the bug, including details regarding any error messages, version, and platform.
I discovered a bug on the asofjoin, wherein if you change the order of the columns and make types different, it breaks.
Suppose the left table has:
And the right table has:
This ends up breaking, since the key hasher metadata is based on the output schema, which looks like:
...And since the right table's 0th column is a string type, the hashes end up being all over the place.
I have a repro, test, and fix that I will attach to this. But you can see the test outputs here for before and after the fix: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc
Component(s)
C++
The text was updated successfully, but these errors were encountered: