-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355
Comments
Dewey Dunnington / @paleolimbot: Pinging @thisisnic, since they've done the most work getting Re-rendering your excellent reprex below to be Jira friendly: library(tidyverse)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
tmpf <- tempfile()
dat <- tribble(
~key, ~val1,
"A", "1",
"B", "2",
)
write_csv(dat, tmpf)
read_csv_arrow(
tmpf,
convert_options = CsvConvertOptions$create(
include_columns = "key"
))
#> # A tibble: 2 × 1
#> key
#> <chr>
#> 1 A
#> 2 B
open_dataset(
tmpf, format = "csv",
convert_options = CsvConvertOptions$create(
include_columns = "key"
)) %>% collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key: [
#> "A",
#> "B"
#> ]
#> key: [
#> "A",
#> "B"
#> ]
#>
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1727 CheckNonMultiple(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1759 FindOneOrNone(root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:438 FieldRef(field->name()).GetOneOrNone(partial_batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:865 compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484 iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:336 ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:347 ReadAll(&batches)
open_dataset(tmpf, format = "csv") %>%
select(key) %>%
collect()
#> # A tibble: 2 × 1
#> key
#> <chr>
#> 1 A
#> 2 B |
Nicola Crane / @thisisnic: |
I'm closing this issue due to the fact that we recently merged #15270 which allows users to pass in read/parse/convert options as lists - when users do this, we can manually check the values in these lists to see if they're compatible with the datasets API or not. We cannot do this with CsvOptions objects. We should consider direct instantiation of CsvOptions etc objects "off-label" usage, and encourage passing in items as lists only. |
I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step.
Created on 2022-03-12 by the reprex package (v2.0.1)
I have tried this both with CRAN version 7 and the nightly version.
Environment: Windows 10
Reporter: Jameel Alsalam
Note: This issue was originally created as ARROW-15926. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: