Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23315][SQL] failed to get output from canonicalized data source v2 related plans #20485

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 2, 2018

What changes were proposed in this pull request?

DataSourceV2Relation keeps a fullOutput and resolves the real output on demand by column name lookup. i.e.

lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name =>
  fullOutput.find(_.name == name).get
}

This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42

To fix this, DataSourceV2Relation should just keep output, and update the output when doing column pruning.

How was this patch tested?

a new test case

case _ =>
val nameToAttr = relation.output.map(_.name).zip(relation.output).toMap
val newOutput = reader.readSchema().map(_.name).map(nameToAttr)
relation.copy(output = newOutput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue This is the bug I mentioned before. Finally I figured out a way to fix it surgically: always run column pruning even no column needs to be pruned. This helps us correct the required schema of the reader, if it's updated by someone else.

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86979 has finished for PR 20485 at commit 75950a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86981 has finished for PR 20485 at commit 3aa0438.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2018

To be clear, the purpose of this commit, like #20476, is just to get something working for the 2.3.0 release?

I just want to make sure since I think we should be approaching these problems with a better initial design for the integration. I'm fine getting this in to unblock a release, but if it isn't for that purpose then I think we should fix the design problems first.

@gatorsmile
Copy link
Member

@rdblue This is another bug we found during the code review. The goal is to ensure Data Source API V2 is usable with at least the same feature sets as Data source API V1.

After getting more feedbacks about Data Source API V2 from the community, we will restart the discussion about the data source API design in the next release.

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2018

Sounds fine to me, then.

My focus is on the long-term design issues. I still think that the changes to make plans immutable and to use the existing push-down code as much as possible is the best way to get a reliable 2.3.0, but it is fine if they don't make the release.

@tdas
Copy link
Contributor

tdas commented Feb 2, 2018

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86999 has finished for PR 20485 at commit 3aa0438.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2018

Test build #87013 has finished for PR 20485 at commit 3aa0438.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87086 has finished for PR 20485 at commit 3aa0438.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Feb 6, 2018

@gatorsmile @rdblue please review and LGTM this. This will unblock my PR - #20445


case relation: DataSourceV2Relation => relation.reader match {
case reader: SupportsPushDownRequiredColumns =>
// TODO: Enable the below assert after we make `DataSourceV2Relation` immutable. Fow now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Fow

@gatorsmile
Copy link
Member

LGTM Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Feb 6, 2018
…e v2 related plans

## What changes were proposed in this pull request?

`DataSourceV2Relation`  keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e.
```
lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name =>
  fullOutput.find(_.name == name).get
}
```

This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42

To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning.

## How was this patch tested?

a new test case

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20485 from cloud-fan/canonicalize.

(cherry picked from commit b96a083)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in b96a083 Feb 6, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…e v2 related plans

## What changes were proposed in this pull request?

`DataSourceV2Relation`  keeps a `fullOutput` and resolves the real output on demand by column name lookup. i.e.
```
lazy val output: Seq[Attribute] = reader.readSchema().map(_.name).map { name =>
  fullOutput.find(_.name == name).get
}
```

This will be broken after we canonicalize the plan, because all attribute names become "None", see https://github.com/apache/spark/blob/v2.3.0-rc1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L42

To fix this, `DataSourceV2Relation` should just keep `output`, and update the `output` when doing column pruning.

## How was this patch tested?

a new test case

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#20485 from cloud-fan/canonicalize.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants