-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed as not planned
Labels
Description
I am attempting to use open_dataset() on a large collection of CSV files in which a timestamp column sometimes has a date format and sometimes a timezone format.
readr is fine reading these both in with a col_type set to "timestamp" (i.e. see below), but arrow_read_csv insists the one must use tz="UTC" while the other must not use tz="UTC" in order for the schema to be valid. Easiest to see this in a simple example:
x <- tempfile()
df <- data.frame(time = '2021-02-01T00:00:00Z')
readr::write_csv(df, x)
schema = arrow::schema(time = timestamp("s", ""))
# ERROR cannot parse w/o tz="UTC" in the schema:
arrow::read_csv_arrow(x,schema = schema, skip=1)
df2 <- readr::read_csv(x, col_types="T") # works finedf <- data.frame(time = '2021-02-01')
readr::write_csv(df, x)
## ERROR cannot parse w/ tz="UTC" :
schema = arrow::schema(time = timestamp("s", "UTC"))
arrow::read_csv_arrow(x,schema = schema, skip=1)
## Once again, readr has no issues:
df2 <- readr::read_csv(x, col_types="T")
Reporter: Carl Boettiger / @cboettig
Watchers: Rok Mihevc / @rok
Note: This issue was originally created as ARROW-15124. Please see the migration documentation for further details.
Reactions are currently unavailable