-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] unify_schemas=FALSE does not improve open_dataset() read times #33312
Comments
Alessandro Molina / @amol-: Lines 135 to 142 in afc6840
Is this an expected behaviour? |
Weston Pace / @westonpace: |
Carl Boettiger / @cboettig: |
Carl Boettiger / @cboettig: |
Carl Boettiger / @cboettig: forecast_schema <- function() {
arrow::schema(target_id = arrow::string(),
datetime = arrow::timestamp("us", timezone = "UTC"),
parameter=arrow::string(),
variable = arrow::string(),
prediction=arrow::float64(),
family=arrow::string(),
reference_datetime=arrow::string(),
site_id=arrow::string(),
model_id = arrow::string(),
date=arrow::string()
)
}
s3 <- arrow::s3_bucket("neon4cast-forecasts/parquet/phenology", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
ds <- arrow::open_dataset(s3, schema=forecast_schema()) |
I want to echo @cboettig about this bug. I am currently making design decisions about partitioning that are influenced by how slow the |
@westonpace did you end up finding anything for this? |
Ah, it seems that the problem is that the default is already FALSE:
|
@westonpace Thanks! yeah, the timing I see is similar to the timing to list contents of the bucket recursively ( As you noted there, performance is much better when we can work in the same 'datacenter' (i.e. have our MINIO host be on a VM in the same datacenter as the compute), but we want to be able to support access to our typical end-user who will typically be on a laptop and usually be requesting a small subset of the partitions. In some cases we can write wrapper functions such that we call open_dataset() directly on the desired partition rather than the dataset root, it feels hacky but maybe that is indeed the best strategy(?) It's fast but not nearly as ergonomic as allowing the arrow + dplyr::filter() to select those paths from the dataset root. |
open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema. This ought to provide a substantial performance increase in contexts where the schema is known in advance.
Unfortunately, in my tests it seems to have no impact on performance. Consider the following reprexes:
default, unify_schemas=TRUE
about 32 seconds for me.
manual, unify_schemas=FALSE:
takes about 32 seconds as well.
Reporter: Carl Boettiger / @cboettig
Note: This issue was originally created as ARROW-18114. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: