[R] how to enforce type conversion in open_dataset()

Here is a small example:

``
```java

library(arrow)
df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
str(df_numbers)
#> tibble [8 x 1] (S3: tbl_df/tbl/data.frame)
#>  $ number: chr [1:8] "1" "2" "3" "error" ...
write_parquet(df_numbers, "numbers.parquet")
open_dataset("numbers.parquet") 
#> FileSystemDataset with 1 Parquet file
#> number: string
open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Failed to parse string: 'error' as a scalar of type int8

```
The expected result is having an input column of integers; where the non-integer values are converted to NAs.

How this type conversion can be enforced using schema definition in in the  `{}open_dataset(){`}? 

Rationale: I would like to include this in a code chunk  which imports a csv dataset and saves to parquet dataset (open_dataset -> write_dataset); where the type conversion based on a preset schema would be done at the same time.  And all these steps without loading all the data in memory.

**Reporter**: [Zsolt Kegyes-Brassai](https://issues.apache.org/jira/browse/ARROW-16833) / @kbzsl

<sub>**Note**: *This issue was originally created as [ARROW-16833](https://issues.apache.org/jira/browse/ARROW-16833). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] how to enforce type conversion in open_dataset() #32162

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[R] how to enforce type conversion in open_dataset() #32162

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions