[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

asfimport · 2022-03-09T04:53:47Z

Consider this reprex:

Create a dataset with hive partitions in csv format with write_dataset() (so cool!):

library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()

In the first call to open_dataset, we don't pass a schema and things work as expected.

However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type. Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!

Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data.

Reporter: Carl Boettiger / @cboettig

PRs and other links:

GitHub Pull Request #12831

_{Note: This issue was originally created as ARROW-15879. Please see the migration documentation for further details.}

asfimport · 2022-03-14T13:19:35Z

Dewey Dunnington / @paleolimbot:
It's not all that intuitive, but if you skip the partitioning column I think it works!

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")

# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#>  1  26       4 120.     91  4.43  2.14  16.7     0     1     2
#>  2  30.4     4  95.1   113  3.77  1.51  16.9     1     1     2
#>  3  15.8     8 351     264  4.22  3.17  14.5     0     1     4
#>  4  19.7     6 145     175  3.62  2.77  15.5     0     1     6
#>  5  15       8 301     335  3.54  3.57  14.6     0     1     8
#>  6  21.4     6 258     110  3.08  3.22  19.4     1     0     1
#>  7  18.7     8 360     175  3.15  3.44  17.0     0     0     2
#>  8  18.1     6 225     105  2.76  3.46  20.2     1     0     1
#>  9  14.3     8 360     245  3.21  3.57  15.8     0     0     4
#> 10  16.4     8 276.    180  3.07  4.07  17.4     0     0     3
#> # … with 22 more rows

asfimport · 2022-03-14T13:21:13Z

Dewey Dunnington / @paleolimbot:
Right! I see what you mean...we loose 'gear' here. Flagging @thisisnic again just in case there's something I missed with respect to the CSV reader here.

asfimport · 2022-03-14T14:15:11Z

Carl Boettiger / @cboettig:
Sorry my minimal example was too minimal. Yes, I had noticed dropping the partition works, but I cannot then filter() on the partition column before collect. Continuing from your reprex, try:

> df %>% filter(gear < 3) %>% collect()
Error in lapply(args, function(x) { : object 'gear' not found

The primary incentive to hive-partition I thought was to benefit from arrow's ability not to even need to parse those files excluded by the filter. (though admittedly hive-partition is more of a parquet concept I guess, I was initially very pleasantly surprised that write_dataset() would even partition in this way with format="csv", so very cool!)

asfimport · 2022-04-11T16:37:11Z

Sam Albers / @boshek:
I did some digging to the extent that I added a test that captured this failures here: #12831

I can confirm that this does not happen when format = 'parquet'. The error message is coming from here but that is about as far as I got. I think this is also related to ARROW-14743

asfimport · 2022-05-13T18:30:50Z

Neal Richardson / @nealrichardson:
Confirmed that this is still an issue in 8.0.0

asfimport mentioned this issue Jan 11, 2023

[R] Datasets API interface improvements #33520

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

asfimport commented Mar 9, 2022

asfimport commented Mar 14, 2022

asfimport commented Mar 14, 2022

asfimport commented Mar 14, 2022

asfimport commented Apr 11, 2022

asfimport commented May 13, 2022

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

[R] passing a schema calls open_dataset to fail on hive-partitioned csv files #31312

Comments

asfimport commented Mar 9, 2022

PRs and other links:

asfimport commented Mar 14, 2022

asfimport commented Mar 14, 2022

asfimport commented Mar 14, 2022

asfimport commented Apr 11, 2022

asfimport commented May 13, 2022