Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

Open
Tracked by #33520
asfimport opened this issue Mar 9, 2022 · 5 comments
Open
Tracked by #33520

Comments

@asfimport
Copy link

Consider this reprex:

 

Create a dataset with hive partitions in csv format with write_dataset() (so cool!):

 

library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()
 

In the first call to open_dataset, we don't pass a schema and things work as expected. 

However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type.  Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!

Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data. 

Reporter: Carl Boettiger / @cboettig

PRs and other links:

Note: This issue was originally created as ARROW-15879. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
It's not all that intuitive, but if you skip the partitioning column I think it works!

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")

# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#>  1  26       4 120.     91  4.43  2.14  16.7     0     1     2
#>  2  30.4     4  95.1   113  3.77  1.51  16.9     1     1     2
#>  3  15.8     8 351     264  4.22  3.17  14.5     0     1     4
#>  4  19.7     6 145     175  3.62  2.77  15.5     0     1     6
#>  5  15       8 301     335  3.54  3.57  14.6     0     1     8
#>  6  21.4     6 258     110  3.08  3.22  19.4     1     0     1
#>  7  18.7     8 360     175  3.15  3.44  17.0     0     0     2
#>  8  18.1     6 225     105  2.76  3.46  20.2     1     0     1
#>  9  14.3     8 360     245  3.21  3.57  15.8     0     0     4
#> 10  16.4     8 276.    180  3.07  4.07  17.4     0     0     3
#> # … with 22 more rows

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
Right! I see what you mean...we loose 'gear' here. Flagging @thisisnic again just in case there's something I missed with respect to the CSV reader here.

@asfimport
Copy link
Author

Carl Boettiger / @cboettig:
Sorry my minimal example was too minimal.  Yes, I had noticed dropping the partition works, but I cannot then filter() on the partition column before collect.  Continuing from your reprex, try:

> df %>% filter(gear < 3) %>% collect()
Error in lapply(args, function(x) { : object 'gear' not found 

The primary incentive to hive-partition I thought was to benefit from arrow's ability not to even need to parse those files excluded by the filter.  (though admittedly hive-partition is more of a parquet concept I guess, I was initially very pleasantly surprised that write_dataset() would even partition in this way with format="csv", so very cool!) 

@asfimport
Copy link
Author

Sam Albers / @boshek:
I did some digging to the extent that I added a test that captured this failures here: #12831

I can confirm that this does not happen when format = 'parquet'. The error message is coming from here but that is about as far as I got. I think this is also related to ARROW-14743

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Confirmed that this is still an issue in 8.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant