Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] joins segfault when data contains list column #30074

Closed
asfimport opened this issue Oct 29, 2021 · 7 comments
Closed

[C++] joins segfault when data contains list column #30074

asfimport opened this issue Oct 29, 2021 · 7 comments
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 29, 2021

When I run the R code below, it results in a segfault if one of the tables contains a list column.

library(arrow)
library(dplyr)

basic_tbl <- arrow_table(
  tibble::tibble(
    x = 1:3,
    y = c("a", "b", "c")
  )
)

basic_tbl2 <- arrow_table(
  tibble::tibble(
    x = 1:3,
    z = c(T, F, T)
  )
)

list_tbl <- arrow_table(
  tibble::tibble(
    z = list(c("first", "list", "col", "row"), c("second row ", "here")),
    x = 1:2
  )
)

# works
left_join(basic_tbl, basic_tbl2) %>%
  collect()

# segfaults
left_join(basic_tbl, list_tbl) %>%
  collect()

Reporter: Nicola Crane / @thisisnic
Assignee: David Li / @lidavidm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14519. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
It looks like it even segfaults when specifying the join keys with by = "x" (so it's not only that that list column is trying to be used as a join key!

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
This is because joining on lists is not supported, but the code path triggers an assertion instead of reporting an error. Also, it looks like the join code needs to pre-process all columns, so the presence of any unsupported type will cause this (as you found). We should at least raise an error instead of crashing, but I'm not familiar enough with the join code to know if we can handle unsupported types when they're not being used as the key.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Sounds related to ARROW-14181 yeah?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
ARROW-14181 will expand the set of supported types but won't affect things here. (Also, the code path in question here already supported dictionaries.)

@asfimport
Copy link
Collaborator Author

Michal Nowakiewicz / @michalursa:
We cannot easily support more types in hash join right now. That is because we transform and encode all the input values, key and non-key (row_encoder.h), so it would need another specialization for each additional type.

But we can return an error (from HashJoinSchema::ValidateSchemas where we check data types from input schemas and keys) instead of asserting.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
I don't follow, why do we have to transform and encode columns that we aren't joining by? They're just along for the ride.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 11625
#11625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants