[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

emaynardigs · 2020-01-09T19:12:04Z

What changes were proposed in this pull request?

Firstly, much of this PR is a rebase of #22535, much thanks to @dbtsai for his work.

Spark can now push down predicates on struct columns when reading Parquet and ORC tables.

Why are the changes needed?

There are significant performance gains to be had from pushing down predicates.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT were extended to cover the new functionality.

Sanity check tests:

//// Setup ////
spark.range(1000 * 1000).toDF("id").selectExpr("id", "STRUCT(id x, STRUCT(CAST(id AS STRING) z) y) nested").write.mode("overwrite").parquet("/tmp/data")
spark.range(1000 * 1000).toDF("id").selectExpr("id", "STRUCT(id x, STRUCT(CAST(id AS STRING) z) y) nested").write.mode("overwrite").orc("/tmp/data_orc")
def hack_benchmark(f: (() => Any)): Double = {
	(0 to 100).map(i => {
		val start = System.currentTimeMillis
		f()
		(System.currentTimeMillis - start)
	}).sum / 100.0
}


//// Without patch ////
scala> spark.read.parquet("/tmp/data").filter("nested.x = 100").explain
== Physical Plan ==
*(1) Project [id#0L, nested#1]
+- *(1) Filter (isnotnull(nested#1) && (nested#1.x = 100))
   +- *(1) FileScan parquet [id#0L,nested#1] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], PushedFilters: [IsNotNull(nested)], ReadSchema: struct<id:bigint,nested:struct<x:bigint,y:string>>

scala> spark.read.orc("/tmp/data_orc").filter("nested.x = 100").explain
== Physical Plan ==
*(1) Project [id#9253L, nested#9254]
+- *(1) Filter (isnotnull(nested#9254) && (nested#9254.x = 100))
   +- *(1) FileScan orc [id#9253L,nested#9254] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/tmp/data_orc], PartitionFilters: [], PushedFilters: [IsNotNull(nested)], ReadSchema: struct<id:bigint,nested:struct<x:bigint,y:string>>

scala> hack_benchmark(spark.read.parquet("/tmp/data").filter("nested.x < 100").count _)
res0: Double = 419.82 

scala> hack_benchmark(spark.read.orc("/tmp/data_orc").filter("nested.x < 100").count _)
res5: Double = 1525.83  

//// With patch ////
scala> spark.read.parquet("/tmp/data").filter("nested.x = 100").explain
== Physical Plan ==
*(1) Project [id#54L, nested#55]
+- *(1) Filter (isnotnull(nested#55) AND (nested#55.x = 100))
   +- BatchScan[id#54L, nested#55] ParquetScan Location: InMemoryFileIndex[file:/tmp/data], ReadSchema: struct<id:bigint,nested:struct<x:bigint,y:string>>, PushedFilters: [EqualTo(nested.x,100)]

scala> spark.read.orc("/tmp/data_orc").filter("nested.x = 100").explain
== Physical Plan ==
*(1) Project [id#0L, nested#1]
+- *(1) Filter (isnotnull(nested#1) AND (nested#1.x = 100))
   +- BatchScan[id#0L, nested#1] OrcScan Location: InMemoryFileIndex[file:/tmp/data_orc], ReadSchema: struct<id:bigint,nested:struct<x:bigint,y:struct<z:string>>>, PushedFilters: [EqualTo(nested.x,100)]
            
scala> hack_benchmark(spark.read.parquet("/tmp/data").filter("nested.x < 100").count _)
res0: Double = 192.15                                                           

scala> hack_benchmark(spark.read.orc("/tmp/data_orc").filter("nested.x < 100").count _)
res1: Double = 1029.57

Note the significant performance improvement and the inclusion of the filter in PushedFilters in both cases.

dongjoon-hyun · 2020-01-09T19:33:06Z

Thank you for making a PR, @emaynardigs
However, could you build and test locally first?

emaynardigs · 2020-01-09T19:58:01Z

Yep, failure related, I was mistakenly only testing locally against the v2 ORC code path before but Jenkins failed on the v1 OrcFilters. Updating the v1 code path and re-testing.

dongjoon-hyun · 2020-01-10T07:22:36Z

ok to test

SparkQA · 2020-01-10T08:05:01Z

Test build #116456 has finished for PR 27155 at commit 3e78632.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-01-10T09:20:47Z

Retest this please.

SparkQA · 2020-01-10T10:44:09Z

Test build #116479 has finished for PR 27155 at commit 3e78632.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…till

SparkQA · 2020-01-10T21:08:34Z

Test build #116501 has finished for PR 27155 at commit f612321.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

First of all, you must add @dbtsai 's authorship by add a commit with his authorship.

The following is not a standard way to keep the authorship.

Firstly, much of this PR is a rebase of #22535, much thanks to @dbtsai for his work.

Second, you need to address all the existing comment in the original PR. In the PR description, could you explain what is the improvement here from the original PR? If there is nothing new here, we had better close this one and asking @dbtsai to update his original PR.

emaynardigs · 2020-01-11T20:20:52Z

First of all, you must add @dbtsai 's authorship by add a commit with his authorship.

The following is not a standard way to keep the authorship.

Firstly, much of this PR is a rebase of #22535, much thanks to @dbtsai for his work.

Second, you need to address all the existing comment in the original PR. In the PR description, could you explain what is the improvement here from the original PR? If there is nothing new here, we had better close this one and asking @dbtsai to update his original PR.

Hi @dongjoon-hyun, thanks for the feedback! Actually, when viewing the original PR I was unaware that @dbtsai is a member and, in fact, your colleague. Your concern definitely makes sense.

Firstly, I should say that there are actually no commits here under the original author's ownership; the code has diverged to the point now where, while I took some ideas & code from the original PR, it was easier to do all of this manually. I called this a "rebase", but it is a rebase only in the abstract sense that a lot of code was copied and updated for the latest master and not in the sense of any source control.

Secondly, I'll try to elaborate on why I opened a new PR...

The original PR was abandoned with no activity or reply from the author in a year. The original author was asked to update his PR and did not respond.
The original PR was flawed, failed several tests, and included some strange choices (like always splitting on a field that contained '.') that I did not agree with.
The original PR did not extend to ORC and only worked for Parquet. My company uses yet another binary format that the original PR would have been incompatible with, but the new one can work with.

As you pointed out, I have addressed the comments in the original PR. I've extended the functionality to ORC as well as Parquet, tested the functionality myself, and have written more unit tests (largely copied from the original PR) and am currently writing more pending the approval of the basic code here. I would not say there is nothing new here.

dongjoon-hyun · 2020-01-12T03:33:16Z

@emaynardigs . In this case, usually, we close the second PR (yours) because the original is still alive. You can retry this after the first one is closed.

Firstly, I should say that there are actually no commits here under the original author's ownership; the code has diverged to the point now where,

dongjoon-hyun · 2020-01-12T03:34:00Z

I'll leave this PR to @dbtsai .

emaynardigs · 2020-01-12T06:43:47Z

Yes, surely we should close one PR. The other one is inactive, fails tests, and doesn’t merge cleanly. This one has none of those issues and has more functionality.

I don’t mind closing this one out if the other PR can get us to the same place just as quickly, but that seems like it would take more work at this point? Either way, let’s make sure there is an active PR for this issue which can be merged in. As I have no control over any other PR this is my submission towards that end.

emaynardigs · 2020-01-15T18:53:58Z

@dongjoon-hyun I see this has changes requested, do I need to make any changes here?

Or is this just pending review?

dbtsai · 2020-01-15T19:46:38Z

Hello @emaynardigs ,

Thank you for your contribution, and I do value your work a lot. In fact, at Apple, we are still using an updated version of #22535 which is critical to our production job. As far as I know, Databirkcs's runtime also has an implementation with similar approach to tackle this issue.

The reason why I am inactive on my previous PR is that I feel adding nested support to the current filter api is a short term solution since the design doesn't consider this complex use-cases. For a better long term solution, I would like to create a new set of FilterV2 apis in DSv2 framework that makes nested columns as first class support. + @cloud-fan @rdblue @viirya for feedback on this.

I already started to work on FilterV2 api, and here is WIP code https://github.com/dbtsai/spark/pull/10/files . The change is bigger than I thought, and now, I am debating do we actually need a new FilterV2 framework?

Feedback and idea are welcome.

Thanks.

emaynardigs · 2020-01-15T19:52:50Z

Hello @emaynardigs ,

Thank you for your contribution, and I do value your work a lot. In fact, at Apple, we are still using an updated version of #22535 which is critical to our production job. As far as I know, Databirkcs's runtime also has an implementation with similar approach to tackle this issue.

The reason why I am inactive on my previous PR is that I feel adding nested support to the current filter api is a short term solution since the design doesn't consider this complex use-cases. For a better long term solution, I would like to create a new set of FilterV2 apis in DSv2 framework that makes nested columns as first class support. + @cloud-fan @rdblue @viirya for feedback on this.

I already started to work on FilterV2 api, and here is WIP code https://github.com/dbtsai/spark/pull/10/files . The change is bigger than I thought, and now, I am debating do we actually need a new FilterV2 framework?

Feedback and idea are welcome.

Thanks.

Hey @dbtsai no worries, actually I suspected the silence was because you had moved this into a fork and were running with it :)

Actually I think the core approach you took here is sufficient for most cases, right? The crux of my change was only porting it to the new APIs and looking at the schema itself to unpack nested columns instead of looking at the column name (needed this for ORC anyway). Then it was pretty easy to add ORC support as we use a fork of ORC internally while you guys use Parquet.

What complex cases do you think break under this PR?

emaynardigs · 2020-01-28T16:01:40Z

@dbtsai checking in again -- is there an edge case that you think doesn't work here? It would be nice to have updated filters, but seeing as you yourself are running code very much like this in a fork, wouldn't the right thing to do be to merge it upstream?

emaynardigs · 2020-02-09T21:38:00Z

@dongjoon-hyun @dbtsai pinging again for review; it doesn't seem there is any progress on another PR and as @dbtsai pointed out these performance improvements can be very helpful for production workloads in their current state.

dbtsai · 2020-02-14T08:58:58Z

@emaynardigs I have been distracted by other work, and finally I found some time to continue this work. The other approach mentioned above will take longer time, so I'm thinking to submit a PR based on our internal version (a modified version of #22535) which is proven to be stable and in production for awhile.

I need some time to do some cleanup, and I'll submit a PR so we can collaborate. I'll add you as an author for the collaboration. WDYT?

BTW, are you using this internally at your company? How does it perform?

Thanks.

emaynardigs · 2020-02-14T16:02:25Z

@dbtsai

so I'm thinking to submit a PR based on our internal version (a modified version of #22535) which is proven to be stable and in production for awhile.

My main reservation is that #22535 relies on a dot in the name of the field, and so cannot support ORC. The key difference in this PR is that it actually inspects the type of the field and extends the same functionality to Parquet and ORC. It's also rebased for the current master already and so merges cleanly. I also added more tests. If we merge #22535 we'll need another PR to un-do this logic and implement the same for ORC again.

No, I intended to cherry pick this back after it merges, but if it doesn't get merge we'll probably end up using it on our fork much like you've done.

dbtsai · 2020-02-28T00:59:23Z

A new PR is submitted #27728 can you take a look? We can add ORC implementation on top of that once the PR is merged.

dbtsai · 2020-03-28T03:50:46Z

@emaynardigs #27728 is merged. Are you interested in rebase this PR based on that? Should not be hard to support ORC now as we have a proper framework to support nested predicate pushdown.

emaynardigs · 2020-03-31T20:53:47Z

@dbtsai That PR still relies on a dot in the column name, as I called out above. Not sure why you don't just parse the schema, like I was already doing in this PR?

I may rebase, but as we'renon a 2.x fork any dependency on the v2 filters shipping in 3.x isn't really useful.

dbtsai · 2020-03-31T21:19:00Z

In this PR, you also use dots to create the source filter api. This doesn't handle column name containing dots by quoting it properly. As we have proper parser to parse mutipart identifier that is proven and used everywhere, it's much more easy to use dots in source filter apis.

The implementation of each data source can be different. I choose to use key as a string containing dots in parquet for simplicity. But you can always do the schema stuff.

  private def translateLeafNodeFilter(predicate: Expression): Option[Filter] = {
    // Recursively try to find an attribute name from the top level that can be pushed down.
    def attrName(e: Expression): Option[String] = e match {
      case a: Attribute if a.dataType != StructType =>
        Some(a.name)
      case s: GetStructField if s.childSchema(s.ordinal).dataType != StructType =>
        attrName(s.child).map(_ + s".${s.childSchema(s.ordinal).name}")
      case _ =>
        None
    }

emaynardigs added 2 commits January 8, 2020 17:49

Cherry pick from PR#22535

b4ae8aa

Added orc support, tests passing

277b632

dongjoon-hyun requested a review from dbtsai January 9, 2020 19:31

emaynardigs added 3 commits January 9, 2020 14:43

Added ORC v1 filter code

12484cc

Updated ORC v1 test suites to cover nested predicates

80c0937

Updated Hive ORC test suites to cover nested predicates

b24e3cc

Fixed double import in updated test

3e78632

dongjoon-hyun added the SQL label Jan 10, 2020

Bugfix, df columns were being flattened in one of the ORC v1 suites s…

f612321

…till

dongjoon-hyun requested changes Jan 11, 2020

View reviewed changes

emaynardigs requested a review from dongjoon-hyun January 11, 2020 20:22

dongjoon-hyun added the SPARK CORE label Feb 5, 2020

emaynardigs closed this Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

emaynardigs commented Jan 9, 2020 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented Jan 9, 2020

emaynardigs commented Jan 9, 2020

dongjoon-hyun commented Jan 10, 2020

SparkQA commented Jan 10, 2020

dongjoon-hyun commented Jan 10, 2020

SparkQA commented Jan 10, 2020

SparkQA commented Jan 10, 2020

dongjoon-hyun left a comment

emaynardigs commented Jan 11, 2020 •

edited

Loading

dongjoon-hyun commented Jan 12, 2020

dongjoon-hyun commented Jan 12, 2020

emaynardigs commented Jan 12, 2020 •

edited

Loading

emaynardigs commented Jan 15, 2020

dbtsai commented Jan 15, 2020

emaynardigs commented Jan 15, 2020

emaynardigs commented Jan 28, 2020

emaynardigs commented Feb 9, 2020

dbtsai commented Feb 14, 2020 •

edited

Loading

emaynardigs commented Feb 14, 2020 •

edited

Loading

dbtsai commented Feb 28, 2020

dbtsai commented Mar 28, 2020

emaynardigs commented Mar 31, 2020

dbtsai commented Mar 31, 2020

[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

[SPARK-17636][SPARK-25557][SQL] Parquet and ORC predicate pushdown in nested fields #27155

Conversation

emaynardigs commented Jan 9, 2020 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Jan 9, 2020

emaynardigs commented Jan 9, 2020

dongjoon-hyun commented Jan 10, 2020

SparkQA commented Jan 10, 2020

dongjoon-hyun commented Jan 10, 2020

SparkQA commented Jan 10, 2020

SparkQA commented Jan 10, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

emaynardigs commented Jan 11, 2020 • edited Loading

dongjoon-hyun commented Jan 12, 2020

dongjoon-hyun commented Jan 12, 2020

emaynardigs commented Jan 12, 2020 • edited Loading

emaynardigs commented Jan 15, 2020

dbtsai commented Jan 15, 2020

emaynardigs commented Jan 15, 2020

emaynardigs commented Jan 28, 2020

emaynardigs commented Feb 9, 2020

dbtsai commented Feb 14, 2020 • edited Loading

emaynardigs commented Feb 14, 2020 • edited Loading

dbtsai commented Feb 28, 2020

dbtsai commented Mar 28, 2020

emaynardigs commented Mar 31, 2020

dbtsai commented Mar 31, 2020

emaynardigs commented Jan 9, 2020 •

edited by dongjoon-hyun

Loading

emaynardigs commented Jan 11, 2020 •

edited

Loading

emaynardigs commented Jan 12, 2020 •

edited

Loading

dbtsai commented Feb 14, 2020 •

edited

Loading

emaynardigs commented Feb 14, 2020 •

edited

Loading