[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

asfimport · 2022-03-12T23:34:47Z

I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step.

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'

tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)

1. works in read_csv_arrow, errors in open_dataset:
   
   read_csv_arrow(
     tmpf,
     convert_options = CsvConvertOptions$create(
       include_columns = "key"
     ))
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]

1. Note that it does work to select after open_dataset, thus not a blocking issue:
   
   open_dataset(tmpf, format = "csv") %>%
     select(key) %>%
     collect()
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

^{Created on 2022-03-12 by the reprex package (v2.0.1)}

I have tried this both with CRAN version 7 and the nightly version.

Environment: Windows 10
Reporter: Jameel Alsalam

_{Note: This issue was originally created as ARROW-15926. Please see the migration documentation for further details.}

asfimport · 2022-03-14T12:38:59Z

Dewey Dunnington / @paleolimbot:
Thank you for reporting! It is definitely a confusing message that we need to fix. You're correct that the preferred idiom is csv_dataset %>% select(key) in this case, which should automatically only read the necessary columns.

Pinging @thisisnic, since they've done the most work getting read_csv_arrow() and open_dataset(format = "csv") to agree with eachother.

Re-rendering your excellent reprex below to be Jira friendly:

library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)


read_csv_arrow(
  tmpf,
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  ))
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]
#> 
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1727  CheckNonMultiple(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1759  FindOneOrNone(root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:438  FieldRef(field->name()).GetOneOrNone(partial_batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:865  compute::MakeExecBatch(*scan_options->dataset_schema, partial.record_batch.value)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484  iterator_.Next()
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:336  ReadNext(&batch)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:347  ReadAll(&batches)

open_dataset(tmpf, format = "csv") %>%
  select(key) %>%
  collect()
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

asfimport · 2022-03-18T14:29:22Z

Nicola Crane / @thisisnic:
I had a look into this, and although it is technically possible to pass the convert_options parameter through like this, I don't think we actually support that option with datasets. As @paleolimbot has said above, we should definitely improve the error message here - I'll update the ticket to reflect that.

thisisnic · 2023-01-12T11:52:18Z

I'm closing this issue due to the fact that we recently merged #15270 which allows users to pass in read/parse/convert options as lists - when users do this, we can manually check the values in these lists to see if they're compatible with the datasets API or not. We cannot do this with CsvOptions objects. We should consider direct instantiation of CsvOptions etc objects "off-label" usage, and encourage passing in items as lists only.

asfimport mentioned this issue Jan 11, 2023

[R] Datasets API interface improvements #33520

Open

13 tasks

thisisnic closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

asfimport commented Mar 12, 2022

asfimport commented Mar 14, 2022

asfimport commented Mar 18, 2022

thisisnic commented Jan 12, 2023

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

Comments

asfimport commented Mar 12, 2022

asfimport commented Mar 14, 2022

asfimport commented Mar 18, 2022

thisisnic commented Jan 12, 2023