Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

Open
asfimport opened this issue Oct 20, 2022 · 9 comments
Open

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

asfimport opened this issue Oct 20, 2022 · 9 comments

Comments

@asfimport
Copy link

open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema.  This ought to provide a substantial performance increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance.  Consider the following reprexes:

 default, unify_schemas=TRUE 

library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
)

about 32 seconds for me.

 manual, unify_schemas=FALSE:  

bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
})

takes about 32 seconds as well. 

Reporter: Carl Boettiger / @cboettig

Note: This issue was originally created as ARROW-18114. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Alessandro Molina / @amol-:
@westonpace I checked that the R bindings seems to be properly passing InspectOptions

arrow/r/src/dataset.cpp

Lines 135 to 142 in afc6840

std::shared_ptr<ds::Dataset> dataset___DatasetFactory__Finish1(
const std::shared_ptr<ds::DatasetFactory>& factory, bool unify_schemas) {
ds::FinishOptions opts;
if (unify_schemas) {
opts.inspect_options.fragments = ds::InspectOptions::kInspectAllFragments;
}
return ValueOrStop(factory->Finish(opts));
}

Is this an expected behaviour?

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Yes, I would expect there to be a difference. I'll try and get some time to debug this week and figure out what is going on.

@asfimport
Copy link
Author

Carl Boettiger / @cboettig:
Thanks Weston! Any update here?

@asfimport
Copy link
Author

Carl Boettiger / @cboettig:
Any update on this?  I think we could realize a pretty substantial performance boost in both time and maybe RAM if unified_schemas=FALSE could allow us not to touch all the parquet files before we need to!

@asfimport
Copy link
Author

Carl Boettiger / @cboettig:
Just an additional comment that this behavior also seems to occur whether or not schema is specified manually, as well as when unfied_schemas=FALSE (i.e. determined from the first parquet file). Here's another more extreme example owing to an even larger number of partitions:

forecast_schema <- function() { 
  arrow::schema(target_id = arrow::string(), 
                datetime = arrow::timestamp("us", timezone = "UTC"), 
                parameter=arrow::string(),
                variable = arrow::string(), 
                prediction=arrow::float64(),
                family=arrow::string(),
                reference_datetime=arrow::string(),
                site_id=arrow::string(),
                model_id = arrow::string(),
                date=arrow::string()
  )
}

s3 <- arrow::s3_bucket("neon4cast-forecasts/parquet/phenology", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
ds <- arrow::open_dataset(s3, schema=forecast_schema()) 

@rqthomas
Copy link

rqthomas commented Feb 3, 2023

I want to echo @cboettig about this bug. I am currently making design decisions about partitioning that are influenced by how slow the open_dataset calls are on a dataset with a unified schema. Any status update on this issue? Thanks so much!

@assignUser
Copy link
Member

@westonpace did you end up finding anything for this?

@westonpace
Copy link
Member

Ah, it seems that the problem is that the default is already FALSE:

#' @param unify_schemas logical: should all data fragments (files, `Dataset`s)
#' be scanned in order to create a unified schema from them? If `FALSE`, only
#' the first fragment will be inspected for its schema. Use this fast path
#' when you know and trust that all fragments have an identical schema.
#' The default is `FALSE` when creating a dataset from a directory path/URI or
#' vector of file paths/URIs (because there may be many files and scanning may
#' be slow) but `TRUE` when `sources` is a list of `Dataset`s (because there
#' should be few `Dataset`s in the list and their `Schema`s are already in
#' memory).

The default is FALSE when creating a dataset from a directory path/URI or
#' vector of file paths/URIs (because there may be many files and scanning may
#' be slow)

@cboettig
Copy link
Contributor

@westonpace Thanks! yeah, the timing I see is similar to the timing to list contents of the bucket recursively (s3$ls(recursive=TRUE), (as you noted in #34145) so that probably explains the additional overhead between the above examples rather than the unify_schema process. I'll keep an eye on whatever you come up with in #34213.

As you noted there, performance is much better when we can work in the same 'datacenter' (i.e. have our MINIO host be on a VM in the same datacenter as the compute), but we want to be able to support access to our typical end-user who will typically be on a laptop and usually be requesting a small subset of the partitions. In some cases we can write wrapper functions such that we call open_dataset() directly on the desired partition rather than the dataset root, it feels hacky but maybe that is indeed the best strategy(?) It's fast but not nearly as ergonomic as allowing the arrow + dplyr::filter() to select those paths from the dataset root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants