Skip to content

[R] Support for csv options on open_dataset #30600

@asfimport

Description

@asfimport

There's a lot of gotchas created around heterogeneity in arrow's support for csv parsing options beween read_csv_arrow() and open_dataset() (and further issues arising from migrating from readr::read_csv()).  Not sure if it's more helpful to report these in one place or as separate issues, but here's a few that keep tripping me up:

 

  • "na" (defining the na-character choices) is not implemented on open_dataset(), though it is on read_csv_arrow()

  • somewhat confusingly, open_dataset does support null_strings though, which appears to play the same roll.   The docs however suggest that open_dataset() ... options are passed to dataset_factory().  I think those docs should link to https://arrow.apache.org/docs/r/reference/CsvReadOptions.html .  https://arrow.apache.org/docs/r/reference/FileFormat.html suggests that null_strings is not one of the recognized CsvReadOptions, but it seems that it now is.  I appreciate the challenge of supporting both the readr-like options and the native arrow option names here, but the functionality and documentation remains very confusing!

    Also another gotcha: in arrow 6.0 release, if we supply an arrow schema, open_dataset assumes the first line of the csv is data and not column headers, so we have to do skip=1.  I see the logic (the schema names the columns anyway, so assuming we're going with those names why parse the names from the csv), but it's surprising since reading without the schema we do not use skip=1, and it's natural to want to go and declare column types while preserving csv column names.  The error messages on doing so aren't helpful, since if you forget skip=1, you are just told that any column that is not a string is "the incorrect type".  The open_dataset() docs imply that we can use read_csv_arrow() options, which suggest that we could provide types using col_types() instead of schema, but this appears not to be the case.  Also

Reporter: Carl Boettiger / @cboettig

Note: This issue was originally created as ARROW-15088. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions