Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

Closed
Tracked by #33520
asfimport opened this issue Mar 12, 2022 · 3 comments

Comments

@asfimport
Copy link

I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step.

 

library(tidyverse)
library(arrow)
#
#> Attaching package: 'arrow'

tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)

1. works in read_csv_arrow, errors in open_dataset:
   
   read_csv_arrow(
     tmpf,
     convert_options = CsvConvertOptions$create(
       include_columns = "key"
     ))
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]

1. Note that it does work to select after open_dataset, thus not a blocking issue:
   
   open_dataset(tmpf, format = "csv") %>%
     select(key) %>%
     collect()
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

Created on 2022-03-12 by the reprex package (v2.0.1)

 

I have tried this both with CRAN version 7 and the nightly version.

Environment: Windows 10
Reporter: Jameel Alsalam

Note: This issue was originally created as ARROW-15926. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Dewey Dunnington / @paleolimbot:
Thank you for reporting! It is definitely a confusing message that we need to fix. You're correct that the preferred idiom is csv_dataset %>% select(key) in this case, which should automatically only read the necessary columns.

Pinging @thisisnic, since they've done the most work getting read_csv_arrow() and open_dataset(format = "csv") to agree with eachother.

Re-rendering your excellent reprex below to be Jira friendly:

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)


read_csv_arrow(
  tmpf,
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  ))
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]
#> 
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1727  CheckNonMultiple(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1759  FindOneOrNone(root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:438  FieldRef(field->name()).GetOneOrNone(partial_batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:865  compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:336  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:347  ReadAll(&batches)

open_dataset(tmpf, format = "csv") %>%
  select(key) %>%
  collect()
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

@asfimport
Copy link
Author

Nicola Crane / @thisisnic:
I had a look into this, and although it is technically possible to pass the convert_options parameter through like this, I don't think we actually support that option with datasets.  As @paleolimbot  has said above, we should definitely improve the error message here - I'll update the ticket to reflect that.

@thisisnic
Copy link
Member

I'm closing this issue due to the fact that we recently merged #15270 which allows users to pass in read/parse/convert options as lists - when users do this, we can manually check the values in these lists to see if they're compatible with the datasets API or not. We cannot do this with CsvOptions objects. We should consider direct instantiation of CsvOptions etc objects "off-label" usage, and encourage passing in items as lists only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants