[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

cloud-fan · 2018-02-02T07:57:44Z

What changes were proposed in this pull request?

DataSourceV2Relation keeps a fullOutput and resolves the real output on demand by column name lookup. i.e.

lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name =>
  fullOutput.find(_.name == name).get
}

This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42

To fix this, DataSourceV2Relation should just keep output, and update the output when doing column pruning.

How was this patch tested?

a new test case

cloud-fan · 2018-02-02T08:12:12Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala

-        case _ =>
+          val nameToAttr = relation.output.map(_.name).zip(relation.output).toMap
+          val newOutput = reader.readSchema().map(_.name).map(nameToAttr)
+          relation.copy(output = newOutput)


@rdblue This is the bug I mentioned before. Finally I figured out a way to fix it surgically: always run column pruning even no column needs to be pruned. This helps us correct the required schema of the reader, if it's updated by someone else.

cloud-fan · 2018-02-02T08:12:39Z

cc @tdas @jose-torres @rdblue @gatorsmile

SparkQA · 2018-02-02T08:22:22Z

Test build #86979 has finished for PR 20485 at commit 75950a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T10:32:50Z

Test build #86981 has finished for PR 20485 at commit 3aa0438.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-02-02T16:23:13Z

To be clear, the purpose of this commit, like #20476, is just to get something working for the 2.3.0 release?

I just want to make sure since I think we should be approaching these problems with a better initial design for the integration. I'm fine getting this in to unblock a release, but if it isn't for that purpose then I think we should fix the design problems first.

gatorsmile · 2018-02-02T18:03:36Z

@rdblue This is another bug we found during the code review. The goal is to ensure Data Source API V2 is usable with at least the same feature sets as Data source API V1.

After getting more feedbacks about Data Source API V2 from the community, we will restart the discussion about the data source API design in the next release.

rdblue · 2018-02-02T18:44:56Z

Sounds fine to me, then.

My focus is on the long-term design issues. I still think that the changes to make plans immutable and to use the existing push-down code as much as possible is the best way to get a reliable 2.3.0, but it is fine if they don't make the release.

tdas · 2018-02-02T19:35:16Z

jenkins retest this please

SparkQA · 2018-02-02T21:46:15Z

Test build #86999 has finished for PR 20485 at commit 3aa0438.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-02T23:27:48Z

retest this please

SparkQA · 2018-02-03T02:40:28Z

Test build #87013 has finished for PR 20485 at commit 3aa0438.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-06T02:50:20Z

retest this please

SparkQA · 2018-02-06T05:56:55Z

Test build #87086 has finished for PR 20485 at commit 3aa0438.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-02-06T18:58:32Z

@gatorsmile @rdblue please review and LGTM this. This will unblock my PR - #20445

gatorsmile · 2018-02-06T20:42:37Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala


      case relation: DataSourceV2Relation => relation.reader match {
        case reader: SupportsPushDownRequiredColumns =>
+          // TODO: Enable the below assert after we make `DataSourceV2Relation` immutable. Fow now


gatorsmile · 2018-02-06T20:43:23Z

LGTM Thanks! Merged to master/2.3

…e v2 related plans ## What changes were proposed in this pull request? `DataSourceV2Relation` keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e. ``` lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => fullOutput.find(_.name == name).get } ``` This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42 To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning. ## How was this patch tested? a new test case Author: Wenchen Fan <wenchen@databricks.com> Closes #20485 from cloud-fan/canonicalize. (cherry picked from commit b96a083) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…e v2 related plans ## What changes were proposed in this pull request? `DataSourceV2Relation` keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e. ``` lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name => fullOutput.find(_.name == name).get } ``` This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42 To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning. ## How was this patch tested? a new test case Author: Wenchen Fan <wenchen@databricks.com> Closes apache#20485 from cloud-fan/canonicalize.

failed to get output from canonicalized data source v2 related plans

3aa0438

cloud-fan force-pushed the canonicalize branch from 75950a1 to 3aa0438 Compare February 2, 2018 08:11

cloud-fan commented Feb 2, 2018

View reviewed changes

gatorsmile reviewed Feb 6, 2018

View reviewed changes

asfgit closed this in b96a083 Feb 6, 2018

cloud-fan mentioned this pull request Feb 8, 2018

[SPARK-23203][SQL]: DataSourceV2: Use immutable logical plans. #20387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

cloud-fan commented Feb 2, 2018 •

edited

Loading

cloud-fan Feb 2, 2018

cloud-fan commented Feb 2, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 2, 2018

rdblue commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

rdblue commented Feb 2, 2018

tdas commented Feb 2, 2018

SparkQA commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

SparkQA commented Feb 3, 2018

cloud-fan commented Feb 6, 2018

SparkQA commented Feb 6, 2018

tdas commented Feb 6, 2018

gatorsmile Feb 6, 2018

gatorsmile commented Feb 6, 2018

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

Conversation

cloud-fan commented Feb 2, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan Feb 2, 2018

Choose a reason for hiding this comment

cloud-fan commented Feb 2, 2018

SparkQA commented Feb 2, 2018

SparkQA commented Feb 2, 2018

rdblue commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

rdblue commented Feb 2, 2018

tdas commented Feb 2, 2018

SparkQA commented Feb 2, 2018

gatorsmile commented Feb 2, 2018

SparkQA commented Feb 3, 2018

cloud-fan commented Feb 6, 2018

SparkQA commented Feb 6, 2018

tdas commented Feb 6, 2018

gatorsmile Feb 6, 2018

Choose a reason for hiding this comment

gatorsmile commented Feb 6, 2018

cloud-fan commented Feb 2, 2018 •

edited

Loading