[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

rdblue · 2018-01-24T19:00:19Z

What changes were proposed in this pull request?

SPARK-23203: DataSourceV2 should use immutable catalyst trees instead of wrapping a mutable DataSourceV2Reader. This commit updates DataSourceV2Relation and consolidates much of the DataSourceV2 API requirements for the read path in it. Instead of wrapping a reader that changes, the relation lazily produces a reader from its configuration.

This commit also updates the predicate and projection push-down. Instead of the implementation from SPARK-22197, this reuses the rule matching from the Hive and DataSource read paths (using PhysicalOperation) and copies most of the implementation of SparkPlanner.pruneFilterProject, with updates for DataSourceV2. By reusing the implementation from other read paths, this should have fewer regressions from other read paths and is less code to maintain.

The new push-down rules also supports the following edge cases:

The output of DataSourceV2Relation should be what is returned by the reader, in case the reader can only partially satisfy the requested schema projection
The requested projection passed to the DataSourceV2Reader should include filter columns
The push-down rule may be run more than once if filters are not pushed through projections

How was this patch tested?

Existing push-down and read tests.

SparkQA · 2018-01-24T19:04:40Z

Test build #86598 has finished for PR 20387 at commit d3233e1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StreamingDataSourceV2Relation(

rdblue · 2018-01-24T19:22:01Z

@cloud-fan, please have a look at these changes. This will require follow-up for the Streaming side. I have yet to review the streaming interfaces for DataSourceV2, so I haven't made any changes there.

In our Spark build, I've also moved the write path to use DataSourceV2Relation, which I intend to do in a follow-up to this issue.

@rxin FYI.

SparkQA · 2018-01-24T19:24:47Z

Test build #86600 has finished for PR 20387 at commit 9c4dcb5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StreamingDataSourceV2Relation(

SparkQA · 2018-01-24T19:46:43Z

Test build #86601 has finished for PR 20387 at commit ac58844.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-24T23:25:14Z

Test build #86602 has finished for PR 20387 at commit 83203a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-25T02:12:04Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+    source: DataSourceV2,
+    options: Map[String, String],
+    path: Option[String] = None,
+    table: Option[TableIdentifier] = None,


why do we need these 2 parameters? Can't we get them from options when needed?

We could keep these in options, but because they are the main two ways to identify tables, they should be easier to work with. I'd even suggest adding them to the DataSourceV2 read and write APIs.

Another benefit of adding these is that it is easier to use DataSourceV2Relation elsewhere. In our Spark build, I've added a rule to convert Hive relations to DataSourceV2Relation based on a table property. That's cleaner because we can pass the TableIdentifier instead of adding options to the map.

I guess another way to say this is that it's better to set reliable path, database, and table parameters after passing the explicitly, than to require all the places where DataSourceV2Relations are created do the same thing. Better to standardize passing these options in v2Options, and it would be even better to pass these directly to the readers and writers.

But not all data sources have path and table name, if you feel strongly about it, we can add 2 methods that exact path and table from options.

That's why these are options. Passing either path or table name is the most common case, which we should have good support for. If tables are identified in other ways, that's supported.

cloud-fan · 2018-01-25T02:14:09Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+    path: Option[String] = None,
+    table: Option[TableIdentifier] = None,
+    projection: Option[Seq[AttributeReference]] = None,
+    filters: Option[Seq[Expression]] = None,


so every time we add a new push down interface, we need to add parameters here too?

I'm not sure I understand what you mean. When something is pushed, it creates a new immutable relation, so I think it has to be added to the relation. But I'm not sure that many things will be pushed besides the projection and filters. What are you thinking that we would need to add? Fragments of logical plan?

Assuming we add the ability to push parts of the logical plan, then this would need to have a reference to the part that was pushed down. I'm not sure that would be this relation class, a subclass, or something else, but I would be fine adding a third push-down option here. The number of things to push down isn't very large, is it?

I like this pattern. I think it is important that the arguments to a query plan node are comprehensive so that it is easy to understand what is going on in the output of explain().

cloud-fan · 2018-01-25T02:14:41Z

overall I think it's a good idea to make the plan immutable.

cloud-fan · 2018-01-29T19:15:34Z

I dig into the commit history and recalled why I made these decisions:

having an mutable DataSourceV2Relation. This is mostly to avoid to keep adding more constructor parameters to DataSourceV2Relation, make the code easy to maintain. I'm ok to make it immutable if there is an significant benefit.
not using PhysicalOperation. This is because we will add more push down optimizations(e.g. limit, aggregate, join), and we have a specify push down order for them. It's hard to improve PhysicalOperation to support more operators and specific push down orders, so I created the new one. Eventually all data sources will be implemented as data source v2, so PhysicalOperation will go away.

The output of DataSourceV2Relation should be what is returned by the reader, in case the reader can only partially satisfy the requested schema projection

Good catch! Since DataSourceV2Reader is mutable, the output can't be fixed, as it may change when we apply data source optimizations. Using lazy val output ... can fix this.

The requested projection passed to the DataSourceV2Reader should include filter columns

I did this intentionally. If a column is only refered by pushed filters, Spark doesn't need this column. Even if we require this column from the data source, we just read it out and wait it to be pruned by the next operator.

The push-down rule may be run more than once if filters are not pushed through projections

This looks weird, do you have a query to reproduce this issue?

This updates DataFrameReader to parse locations that do not look like paths as table names and pass the result as "database" and "table" keys in v2 options.

Personally I'd suggest to use spark.read.format("iceberg").option("table", "db.table").load(), as load is defined as def load(paths: String*), but I think your usage looks better. The communition protocol between Spark and data source is options, I'd suggest that we just propogate the paths parameter to options, and data source implementations are free to interprete the path option to whatever they want, e.g. table and database names.

rdblue · 2018-01-29T20:01:42Z

I'm ok to make it immutable if there is an significant benefit.

Mutable nodes violate a basic assumption of catalyst, that trees are immutable. Here's a good quote from the SIGMOD paper (by @rxin, @yhuai, and @marmbrus et al.):

In our experience, functional transformations on immutable trees make the whole optimizer very easy to reason about and debug. They also enable parallelization in the optimizer, although we do not yet exploit this.

Mixing mutable nodes into supposedly immutable trees is a bad idea. Other nodes in the tree assume that children do not change.

rdblue · 2018-01-29T20:06:36Z

I'd suggest that we just propogate the paths parameter to options, and data source implementations are free to interprete the path option to whatever they want, e.g. table and database names.

What about code paths that expect table names? In our branch, we've added support for converting Hive relations (which have a TableIdentifier, not a path) and using insertInto. Table names are paths are the two main ways to identify tables and I think both should be supported.

This is a new API, so it doesn't matter that load and save currently use paths. We can easily update that support for tables. If we don't, then there will be no common way to refer to tables: some implementations will use table, some will pass db separately, and some might use database. Standardizing this and adding support in Spark will produce more consistent behavior across data sources.

rdblue · 2018-01-29T20:07:39Z

[The push-down rule may be run more than once if filters are not pushed through projections] looks weird, do you have a query to reproduce this issue?

One of the DataSourceV2 tests hit this. I thought it was a good thing to push a single node down at a time and not depend on order.

rdblue · 2018-01-29T20:19:40Z

It's hard to improve PhysicalOperation to support more operators and specific push down orders, so I created the new one

I'm concerned about the new one. The projection support seems really brittle because it calls out specific logical nodes and scans the entire plan. If we are doing push-down wrong on the current v1 and Hive code paths, then I'd like to see a proposal for fixing that without these drawbacks.

I like that this PR pushes projections and filters just like the other paths. We should start there and add additional push-down as necessary.

cloud-fan · 2018-01-30T03:18:53Z

This is a new API...

Are you saying you wanna add a new method in DataFreameReader that is different than load? In Scala, parameter name is part of the method signature, so for def load(path: String), we can't change its semantic, the parameter is a path. It's fine if a data source impelementation teach its users that path will interpreted as database/tables by it, but this should not be a contract in Spark.

I do agree that Spark should set a standard for specifying database and table, as it's very common. We can even argue that path is not a general concept for data sources, but we still provide special APIs for path.

My proposal: How about we add a new methods table in DataFrameReader? The usage would look like: spark.read.format("iceberg").table("db.table").load(), what do you think? We should not specify database, as if we may have catalog federation and table name may have 3 parts catalog.db.table. Let's keep it general and let the data source to interprete it.

cloud-fan · 2018-01-30T03:24:08Z

I thought it was a good thing to push a single node down at a time and not depend on order.

The order must be taken care. For example, we can't push down a limit through Filter, unless the entire filter is pushed into the data source. Generally, if we pushed down multiple operators into a data source, we should clearly define what the order is to apply these operators in the data source.

cloud-fan · 2018-01-30T03:36:36Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+      pushedFilters: Seq[Expression]) = {
+    val newReader = userSchema match {
+      case Some(s) =>
+        asReadSupportWithSchema.createReader(s, v2Options)


I like this idea. Although DataSourceReader is mutable, we can create a new one every time when we wanna apply the operator pushdowns.

felixcheung · 2018-01-30T07:20:06Z

don't we already have table in DataFrameReader? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.table
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@table(tableName:String):org.apache.spark.sql.DataFrame

rdblue · 2018-01-31T18:29:04Z

@felixcheung, yes, we do already have a table option. That creates an UnresolvedRelation with the parsed table name as a TableIdentifier, which is not currently compatible with DataSourceV2 because there is no standard way to pass the identifier's db and table name.

Part of the intent here is to add support in DataSourceV2Relation for cases where we have a TableIdentifier, so that we can add a resolver rule that replaces UnresolvedRelation with DataSourceV2Relation. This is what we do in our Spark branch.

@cloud-fan, what is your objection to support like this?

rdblue · 2018-01-31T18:42:56Z

spark.read.format("iceberg").table("db.table").load()

I'm fine with this if you think it is confusing to parse the path as a table name in load. I think it is reasonable.

I'd still like to keep the Option[TableIdentifier] parameter on the relation, so that we can support table or insertInto on the write path.

rdblue · 2018-01-31T18:48:13Z

@cloud-fan, to your point about push-down order, I'm not saying that order doesn't matter at all, I'm saying that the push-down can run more than once and it should push the closest operators. That way, if you have a situation where operators can't be reordered but they can all be pushed, they all get pushed through multiple runs of the rule, each one further refining the relation.

If we do it this way, then we don't need to traverse the logical plan to find out what to push down. We continue pushing projections until the plan stops changing. This is how the rest of the optimizer works, so I think it is a better approach from a design standpoint.

My implementation also reuses more existing code that we have higher confidence in, which is a good thing. We can add things like limit pushdown later, by adding it properly to the existing code. I don't see a compelling reason to toss out the existing implementation, especially without the same level of testing.

rdblue · 2018-01-31T18:50:21Z

Let's keep it general and let the data source to interprete it.

I think this is the wrong approach. The reason why we are using a special DataSourceOptions object is to ensure that data sources consistently ignore case when reading their own options. Consistency across data sources matters and we should be pushing for more consistency, not less.

rdblue · 2018-01-31T18:54:55Z

@dongjoon-hyun, @gatorsmile, could you guys weigh in on some this discussion? I'd like to get additional perspectives on the changes I'm proposing.

cloud-fan · 2018-02-01T03:33:40Z

Currently DataSourceOptions is the major way for Spark and users to pass information to the data source. It's very flexible and only defines one rule: the option key lookup should be case-insensitive.

I agree with your point that more consistency is better. It's annoying if every data source needs to define their own option keys for table and database, and tell users about it. It's good if Spark can define some rules about what option keys should be used for some common information.

My proposal:

class DataSourceOptions {
  ...
  
  def getPath(): String = get("path")

  def getTimeZone(): String = get("timeZone")

  def getTableName(): String = get("table")
}

We can keep adding these options since this won't break binary compatibility.

And then we just need to document it and tell both users and data source developers about how to specify and retrieve these common options.

Then I think we don't need to add table and database parameters to DataSourceV2Relation, because we can easily do relation.options.getTable.

BTW this doesn't change the API so I think it's fine to do it after 2.3.

cloud-fan · 2018-02-01T03:37:21Z

We can add things like limit pushdown later, by adding it properly to the existing code.

I tried and can't figure out how to do it with PhysicalOperation, that's why I build something new for data source v2 pushdown. I'm OK to reuse it if you can convince me PhysicalOperation is extendable, e.g. support limit push down.

cloud-fan · 2018-02-02T03:51:49Z

Hi @rdblue , I think we all agree that the plan should be immutable, but other parts are still under discussion. Can you send a new PR that focus on making the plan immutable? so that we can merge that one first, and continue to discuss other parts in this PR.

dongjoon-hyun · 2018-02-02T05:23:50Z

+1 for @cloud-fan 's suggestion.

This also removes unused imports.

This is going to be removed in a follow-up commit, so it is as self-contained as possible. This allows user-supplied schemas when ReadSupportWithSchema is not implemented as long as the supplied schema is identical to the reader's schema.

rdblue · 2018-02-17T21:22:03Z

Thanks for the update! Enjoy your vacation, and thanks for letting me know.

SparkQA · 2018-02-18T00:29:47Z

Test build #87532 has finished for PR 20387 at commit 1a603db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-20T08:00:02Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+    val canonicalOutput: Seq[AttributeReference] = this.output
+        .map(a => QueryPlan.normalizeExprId(a, projection))
+
+    new DataSourceV2Relation(c.source, c.options, c.projection) {


This is hacky but I don't have a better idea now, let's revisit it later.

cloud-fan · 2018-02-20T08:04:51Z

thanks, merging to master!

rdblue · 2018-02-20T16:49:28Z

Thanks for all your help getting this committed, @cloud-fan!

…, but not supported. ## What changes were proposed in this pull request? DataSourceV2 initially allowed user-supplied schemas when a source doesn't implement `ReadSupportWithSchema`, as long as the schema was identical to the source's schema. This is confusing behavior because changes to an underlying table can cause a previously working job to fail with an exception that user-supplied schemas are not allowed. This reverts commit adcb25a0624, which was added to #20387 so that it could be removed in a separate JIRA issue and PR. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #20603 from rdblue/SPARK-23418-revert-adcb25a0624.

SPARK-23203: DataSourceV2 should use immutable catalyst trees instead of wrapping a mutable DataSourceV2Reader. This commit updates DataSourceV2Relation and consolidates much of the DataSourceV2 API requirements for the read path in it. Instead of wrapping a reader that changes, the relation lazily produces a reader from its configuration. This commit also updates the predicate and projection push-down. Instead of the implementation from SPARK-22197, this reuses the rule matching from the Hive and DataSource read paths (using `PhysicalOperation`) and copies most of the implementation of `SparkPlanner.pruneFilterProject`, with updates for DataSourceV2. By reusing the implementation from other read paths, this should have fewer regressions from other read paths and is less code to maintain. The new push-down rules also supports the following edge cases: * The output of DataSourceV2Relation should be what is returned by the reader, in case the reader can only partially satisfy the requested schema projection * The requested projection passed to the DataSourceV2Reader should include filter columns * The push-down rule may be run more than once if filters are not pushed through projections Existing push-down and read tests. Author: Ryan Blue <blue@apache.org> Closes apache#20387 from rdblue/SPARK-22386-push-down-immutable-trees. (cherry picked from commit aadf953) Conflicts: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala

SPARK-23203: DataSourceV2 should use immutable catalyst trees instead of wrapping a mutable DataSourceV2Reader. This commit updates DataSourceV2Relation and consolidates much of the DataSourceV2 API requirements for the read path in it. Instead of wrapping a reader that changes, the relation lazily produces a reader from its configuration. This commit also updates the predicate and projection push-down. Instead of the implementation from SPARK-22197, this reuses the rule matching from the Hive and DataSource read paths (using `PhysicalOperation`) and copies most of the implementation of `SparkPlanner.pruneFilterProject`, with updates for DataSourceV2. By reusing the implementation from other read paths, this should have fewer regressions from other read paths and is less code to maintain. The new push-down rules also supports the following edge cases: * The output of DataSourceV2Relation should be what is returned by the reader, in case the reader can only partially satisfy the requested schema projection * The requested projection passed to the DataSourceV2Reader should include filter columns * The push-down rule may be run more than once if filters are not pushed through projections Existing push-down and read tests. Author: Ryan Blue <blue@apache.org> Closes apache#20387 from rdblue/SPARK-22386-push-down-immutable-trees. Ref: LIHADOOP-48531

…, but not supported. DataSourceV2 initially allowed user-supplied schemas when a source doesn't implement `ReadSupportWithSchema`, as long as the schema was identical to the source's schema. This is confusing behavior because changes to an underlying table can cause a previously working job to fail with an exception that user-supplied schemas are not allowed. This reverts commit adcb25a0624, which was added to apache#20387 so that it could be removed in a separate JIRA issue and PR. Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#20603 from rdblue/SPARK-23418-revert-adcb25a0624. Ref: LIHADOOP-48531

rdblue changed the title ~~SPARK-22386: DataSourceV2: Use immutable logical plans.~~ [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immutable logical plans. Jan 24, 2018

rdblue force-pushed the SPARK-22386-push-down-immutable-trees branch from d3233e1 to 9c4dcb5 Compare January 24, 2018 19:19

cloud-fan reviewed Jan 25, 2018

View reviewed changes

cloud-fan reviewed Jan 30, 2018

View reviewed changes

cloud-fan mentioned this pull request Jan 31, 2018

[SPARK-23203][SQL] make DataSourceV2Relation immutable #20448

Closed

tdas mentioned this pull request Feb 1, 2018

[SPARK-23092][SQL] Migrate MemoryStream to DataSourceV2 APIs #20445

Closed

rdblue added 14 commits February 17, 2018 13:14

SPARK-22386: DataSourceV2: Use immutable logical plans.

fd5adbb

SPARK-23203: Fix scala style check.

afbb1fb

SPARK-23203: Fix Kafka tests, use StreamingDataSourceV2Relation.

b23bbf3

This also removes unused imports.

SPARK-23204: DataFrameReader: Remove v2 table identifier parsing.

fd9551a

SPARK-23203: Remove import changes from DataSourceV2Utils.

969ae23

SPARK-23203: Remove TableIdentifier from DataSourceV2Relation.

0192a67

Remove path from DataSourceV2Relation.

14cabd7

Implement doCanonicalize in DataSourceV2Relation.

cd08775

Remove write methods from DataSourceV2Relation.

3f0aca1

Remove unnecessary imports.

1533bdf

Add DataSourceV2Relation.create to ensure projection is always set.

53ffa4f

Rename userSchema -> userSpecifiedSchema.

bf762b4

Fix failing test.

1a603db

rdblue force-pushed the SPARK-22386-push-down-immutable-trees branch from 3b55609 to 1a603db Compare February 17, 2018 21:18

cloud-fan reviewed Feb 20, 2018

View reviewed changes

asfgit closed this in aadf953 Feb 20, 2018

rdblue mentioned this pull request May 8, 2018

[SPARK-24172][SQL]: Push projection and filters once when converting to physical plan. #21262

Closed

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Conversation

rdblue commented Jan 24, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 24, 2018

rdblue commented Jan 24, 2018

SparkQA commented Jan 24, 2018

SparkQA commented Jan 24, 2018

SparkQA commented Jan 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 25, 2018

cloud-fan commented Jan 29, 2018

rdblue commented Jan 29, 2018

rdblue commented Jan 29, 2018

rdblue commented Jan 29, 2018

rdblue commented Jan 29, 2018

cloud-fan commented Jan 30, 2018

cloud-fan commented Jan 30, 2018

Choose a reason for hiding this comment

felixcheung commented Jan 30, 2018

rdblue commented Jan 31, 2018

rdblue commented Jan 31, 2018

rdblue commented Jan 31, 2018

rdblue commented Jan 31, 2018

rdblue commented Jan 31, 2018

cloud-fan commented Feb 1, 2018

cloud-fan commented Feb 1, 2018

cloud-fan commented Feb 2, 2018

dongjoon-hyun commented Feb 2, 2018

rdblue commented Feb 17, 2018

SparkQA commented Feb 18, 2018

Choose a reason for hiding this comment

cloud-fan commented Feb 20, 2018

rdblue commented Feb 20, 2018

rdblue commented Jan 24, 2018 •

edited

Loading