-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312
Comments
Dewey Dunnington / @paleolimbot: library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")
# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#> mpg cyl disp hp drat wt qsec vs am carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 2
#> 2 30.4 4 95.1 113 3.77 1.51 16.9 1 1 2
#> 3 15.8 8 351 264 4.22 3.17 14.5 0 1 4
#> 4 19.7 6 145 175 3.62 2.77 15.5 0 1 6
#> 5 15 8 301 335 3.54 3.57 14.6 0 1 8
#> 6 21.4 6 258 110 3.08 3.22 19.4 1 0 1
#> 7 18.7 8 360 175 3.15 3.44 17.0 0 0 2
#> 8 18.1 6 225 105 2.76 3.46 20.2 1 0 1
#> 9 14.3 8 360 245 3.21 3.57 15.8 0 0 4
#> 10 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
#> # … with 22 more rows |
Dewey Dunnington / @paleolimbot: |
Carl Boettiger / @cboettig: > df %>% filter(gear < 3) %>% collect()
Error in lapply(args, function(x) { : object 'gear' not found The primary incentive to hive-partition I thought was to benefit from |
Sam Albers / @boshek: I can confirm that this does not happen when |
Neal Richardson / @nealrichardson: |
Consider this reprex:
Create a dataset with hive partitions in csv format with write_dataset() (so cool!):
In the first call to open_dataset, we don't pass a schema and things work as expected.
However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type. Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!
Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data.
Reporter: Carl Boettiger / @cboettig
PRs and other links:
Note: This issue was originally created as ARROW-15879. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: