[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

asfimport · 2022-10-20T18:22:50Z

open_dataset() provides the very helpful optional argument to set unify_schemas=FALSE, which should allow arrow to inspect a single parquet file instead of touching potentially thousands or more parquet files to determine a consistent unified schema. This ought to provide a substantial performance increase in contexts where the schema is known in advance.

Unfortunately, in my tests it seems to have no impact on performance. Consider the following reprexes:

default, unify_schemas=TRUE

library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time(
{ open_dataset(ex) }
)

about 32 seconds for me.

manual, unify_schemas=FALSE:

bench::bench_time({
open_dataset(ex, unify_schemas = FALSE)
})

takes about 32 seconds as well.

Reporter: Carl Boettiger / @cboettig

_{Note: This issue was originally created as ARROW-18114. Please see the migration documentation for further details.}

asfimport · 2022-10-24T15:58:45Z

Alessandro Molina / @amol-:
@westonpace I checked that the R bindings seems to be properly passing InspectOptions

arrow/r/src/dataset.cpp

Lines 135 to 142 in afc6840

    
           std::shared_ptr<ds::Dataset> dataset___DatasetFactory__Finish1( 
        
               const std::shared_ptr<ds::DatasetFactory>& factory, bool unify_schemas) { 
        
             ds::FinishOptions opts; 
        
             if (unify_schemas) { 
        
               opts.inspect_options.fragments = ds::InspectOptions::kInspectAllFragments; 
        
             } 
        
             return ValueOrStop(factory->Finish(opts)); 
        
           }

Is this an expected behaviour?

asfimport · 2022-10-24T18:24:58Z

Weston Pace / @westonpace:
Yes, I would expect there to be a difference. I'll try and get some time to debug this week and figure out what is going on.

asfimport · 2022-11-01T21:29:35Z

Carl Boettiger / @cboettig:
Thanks Weston! Any update here?

asfimport · 2022-11-11T01:00:50Z

Carl Boettiger / @cboettig:
Any update on this? I think we could realize a pretty substantial performance boost in both time and maybe RAM if unified_schemas=FALSE could allow us not to touch all the parquet files before we need to!

asfimport · 2022-12-15T21:43:56Z

Carl Boettiger / @cboettig:
Just an additional comment that this behavior also seems to occur whether or not schema is specified manually, as well as when unfied_schemas=FALSE (i.e. determined from the first parquet file). Here's another more extreme example owing to an even larger number of partitions:

forecast_schema <- function() { 
  arrow::schema(target_id = arrow::string(), 
                datetime = arrow::timestamp("us", timezone = "UTC"), 
                parameter=arrow::string(),
                variable = arrow::string(), 
                prediction=arrow::float64(),
                family=arrow::string(),
                reference_datetime=arrow::string(),
                site_id=arrow::string(),
                model_id = arrow::string(),
                date=arrow::string()
  )
}

s3 <- arrow::s3_bucket("neon4cast-forecasts/parquet/phenology", endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
ds <- arrow::open_dataset(s3, schema=forecast_schema())

rqthomas · 2023-02-03T19:16:26Z

I want to echo @cboettig about this bug. I am currently making design decisions about partitioning that are influenced by how slow the open_dataset calls are on a dataset with a unified schema. Any status update on this issue? Thanks so much!

assignUser · 2023-02-11T20:29:08Z

@westonpace did you end up finding anything for this?

westonpace · 2023-02-15T22:22:00Z

Ah, it seems that the problem is that the default is already FALSE:

#' @param unify_schemas logical: should all data fragments (files, `Dataset`s)
#' be scanned in order to create a unified schema from them? If `FALSE`, only
#' the first fragment will be inspected for its schema. Use this fast path
#' when you know and trust that all fragments have an identical schema.
#' The default is `FALSE` when creating a dataset from a directory path/URI or
#' vector of file paths/URIs (because there may be many files and scanning may
#' be slow) but `TRUE` when `sources` is a list of `Dataset`s (because there
#' should be few `Dataset`s in the list and their `Schema`s are already in
#' memory).

The default is FALSE when creating a dataset from a directory path/URI or
#' vector of file paths/URIs (because there may be many files and scanning may
#' be slow)

cboettig · 2023-02-16T04:08:19Z

@westonpace Thanks! yeah, the timing I see is similar to the timing to list contents of the bucket recursively (s3$ls(recursive=TRUE), (as you noted in #34145) so that probably explains the additional overhead between the above examples rather than the unify_schema process. I'll keep an eye on whatever you come up with in #34213.

As you noted there, performance is much better when we can work in the same 'datacenter' (i.e. have our MINIO host be on a VM in the same datacenter as the compute), but we want to be able to support access to our typical end-user who will typically be on a laptop and usually be requesting a small subset of the partitions. In some cases we can write wrapper functions such that we call open_dataset() directly on the desired partition rather than the dataset root, it feels hacky but maybe that is indeed the best strategy(?) It's fast but not nearly as ergonomic as allowing the arrow + dplyr::filter() to select those paths from the dataset root.

cboettig mentioned this issue Feb 11, 2023

[Python] Specifying schema does not prevent arrow from reading metadata on every single parquet? #34145

Open

assignUser mentioned this issue Sep 21, 2023

[R] open_dataset(..., unify_schemas=FALSE) opens all files #37816

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

asfimport commented Oct 20, 2022

asfimport commented Oct 24, 2022

asfimport commented Oct 24, 2022

asfimport commented Nov 1, 2022

asfimport commented Nov 11, 2022

asfimport commented Dec 15, 2022

rqthomas commented Feb 3, 2023

assignUser commented Feb 11, 2023

westonpace commented Feb 15, 2023

cboettig commented Feb 16, 2023

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

[R] unify_schemas=FALSE does not improve open_dataset() read times #33312

Comments

asfimport commented Oct 20, 2022

asfimport commented Oct 24, 2022

asfimport commented Oct 24, 2022

asfimport commented Nov 1, 2022

asfimport commented Nov 11, 2022

asfimport commented Dec 15, 2022

rqthomas commented Feb 3, 2023

assignUser commented Feb 11, 2023

westonpace commented Feb 15, 2023

cboettig commented Feb 16, 2023