-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17059][SQL] Allow FileFormat to specify partition pruning strategy #14649
Conversation
4665798
to
1a00d1f
Compare
Test build #3223 has finished for PR 14649 at commit
|
@@ -703,6 +703,16 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-17059: Allow Allow FileFormat to specify partition pruning strategy") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo will fix
Also, if my understanding is correct, we are picking up only single file to read footer (see ParquetFileFormat.scala#L217-L225) unless we merge schemas. So, it seems, due to this reason, writing |
Let me see if I can answer these in order:
Yep, this attempts to do Footer processing before launching tasks, if the _metadata file is available.
This should only be touching the _metadata file, it reads all the Footers that are collected in there to determine file paths to prune out. It needs the _metadata and not the _common_metadata because _common_metadata doesn't include footer info.
Yes we have generation of the summary-metadata disabled by default, but it is configurable via |
Oh, sorry for the second question. I missed So, this one reads |
Yep, that's it. Effectively it just reduces task overhead for querying large parquet datasets and turns filter queries where the filter never applies into a no-op, no tasks generated (only if summary metadata is available) |
Can we rerun tests please? Issue should be fixed now. |
Jenkins add to whitelist |
Jenkins test this please |
Test build #63848 has finished for PR 14649 at commit
|
cc @rxin @liancheng Let me know what you think |
(Thanks Sean for whitelisting!) |
Test build #64086 has finished for PR 14649 at commit
|
cc @davies @cloud-fan as well |
@@ -220,6 +220,21 @@ trait FileFormat { | |||
} | |||
|
|||
/** | |||
* Allow FileFormat's to have a pluggable way to utilize pushed filters to eliminate partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: FileFormats
This looks very helpful @andreweduffy ! I see you have logging for how effective the partition filtering is. Do you have any rough benchmarks of particular workflows that were improved by this PR? |
(As I am already here), I also think this should be helpful, in particular, for S3 with Parquet. However, I am wondering if this might be only Parquet-specific optimization. I mean, we don't have metafile for other file based data sources. So, my personal opinion is, to put this within Parquet without adding another interface. If we can implement this for other data sources in the future, I think we can add this interface in the future but not now. |
@ash211 Anecdotally, it's been tested in production and was able to take queries that were requiring full dataset scan of 100+ partitions down to 1-2 partitions (depending on how large the split sizes were). This sped up the query on the order of several seconds, imaginably the savings would be larger for much larger datasets. The dataset we've tested on was also stored in a local network HDFS cluster, not in S3, so seeing as read cost is much higher in that environment would also expect savings to be greater. @HyukjinKwon I can see the benefit of making this Parquet-specific, however to have the full benefit of this change the logic needs to execute inside We could do special casing logic directly in |
addressed typos |
Test build #64647 has finished for PR 14649 at commit
|
Sorry for the late reply. Firstly, Spark SQL only reads footers of all Parquet files in case of schema merging, which can be controlled by SQL option Secondly, although you mentioned "partition pruning", but what the code change in this PR performs is actually Parquet row group filtering, which is already a feature of Spark SQL. Thirdly, partition pruning is already implemented in Spark SQL. Furthermore, since partition pruning is handled inside the framework of Spark SQL, not only data source filters, but also arbitrary Catalyst expressions can be used to prune partitions. That said, I don't see benefits from this PR. Did I miss something here? |
Hi @liancheng, regardless of the PR description and the implementation, I would like to describe what I thought and I hope this is sensible in a way. I thought in the exactly same way as you described. However, if you look at the codes closely, it actually only takes care of the case when there is So, this is rather about reducing the files to touch within each partition (as far as I know, we introduced the optimization that collects small files and puts together into each partition). I might be wrong but this was the reason I thought it should be helpful anyway, in particular, for S3 where touching files itself is expensive. |
Hyun mostly sums it up. This uses the summary metadata for Parquet when available. Rather than performing row group level filtering, it actually filters out entire files when summary metadata is available. It does this when it's constructing the FileScanRDD, which means it actually only spawns tasks for files that match the predicate. At work we were running into issues with S3 deployments where very large S3 datasets would take exceedingly long to load in Spark. Empirically, we're running this exact patch in production and for many types of queries, we see a very large decrease in tasks created and time spent fetching from S3. So this is mainly for the use case of short-lived RDDs (so doing .persist doesn't help you) that are backed by data in S3 (so eliminating read time is actually a significant speed up) |
@HyukjinKwon @andreweduffy Thanks for the explanations! This makes much more sense to me now. Although
However, I still agree that with an existing trustworthy |
Glad that helped, sorry if it wasn't more clear. Agreed that writing summary metadata isn't always the best. In this patch, it only ever performs the file pruning if the _metadata file exists for the dataset. At work we have it enabled since we have a query-heavy workload where new data lands occasionally. Sent from Outlook On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" notifications@github.com wrote: @andreweduffy @andreweduffy Thanks for the explanations! This makes much more sense to me now. Although _metadata can be neat for the read path, it's a trouble maker for the write path: Writing summary files (either _metadata or _common_metadata) can be quite expensive when writing a large Parquet dataset since it reads footers from all files and tries to merge them. This can be especially frustrating when appending a small amount of data to an existing large dataset. However, I still agree that with an existing trustworthy _metadata file at hand, this patch is still very useful. I'll take a deeper look at this tomorrow. — |
Hi @andreweduffy, this looks very interesting, @guyGerson and I have been working on this for a while - we agree it can significantly reduce the number of tasks and files/objects touched, and agree on its importance for querying object storage systems like S3 and Openstack Swift. We also like your idea of introducing a new interface to handle all formats, not just parquet, although the benefit seems to be currently limited to parquet data with _metadata summaries. By making a small extension to your pull request, the same idea could work for parquet data without _metadata summaries as well as for all file formats supported by Spark SQL. We have found this to be very useful in real world use cases, for example: Madrid Transportation Scenario: we presented this at Spark Summit Europe 2015 - this use case utilizes Elastic Search to index object storage metadata and search for objects relevant to a query. For example a query looking for rush hour traffic over the past five years. See https://spark-summit.org/eu-2015/events/how-spark-enables-the-internet-of-things-efficient-integration-of-multiple-spark-components-for-smart-city-use-cases/ (slides 9-10). IoT data archive scenario: we have an external DB with the following information about sensor state: |sensorID -> type, manufacturer, version, isActive | Note that for each case we plug in different code which does the filtering. The filtering is application specific and not necessarily related to the file format (parquet/csv/json etc.). Here is a code snippet from our patch which shows how to keep the filter interface more generic. @andreweduffy please let us know what you think about extending your pull request in this way. fileSourceInterfaces.scala:
DataSourceScanExec.scala:
|
Hey @paulata, sounds like you want to make the pruner pluggable at runtime. That sounds like it would be good, though IMO is a bit orthogonal to this change. This is more of an optimization than an FR, i.e. you're "no worse off" with this change performance-wise and all pre-existing behavior is preserved, whereas yours is a new feature. You're welcome to reuse the code in this PR for your own needs, or to base your changes on top of this. I have my doubts that the PR in its current form will be merged into Spark, but it's already been merged into Palantir's fork (palantir#40) so hopefully the folks there will try and upstream it at some point in the future. |
@andreweduffy, are you fine with #15835? If so, we might be able to close this. |
Yep yep, closing this one |
What changes were proposed in this pull request?
Different FileFormat implementations may be able to make intelligent decisions about files that will need to be processed as part of a FileSourceScanExec based on the
pushedFilters
available. This PR implements a way to do that pluggably, with an implementation for Parquet that allows applying the summary metadata to prevent task creation.This has a few performance benefits:
How was this patch tested?
Existing tests and Spark Shell, plus unit test for the Parquet implementation.