Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check #31848

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Mar 16, 2021

What changes were proposed in this pull request?

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes FileScan.equals().

Why are the changes needed?

Does this PR introduce any user-facing change?

Yes, before this fix incorrect reuse of FileScan and so BatchScanExec could have happed causing correctness issues.

How was this patch tested?

Added new UTs.

@peter-toth
Copy link
Contributor Author

@dongjoon-hyun, you asked for a UT in #31820 (review), shall I create a simple one focusing on .equals() (I wonder where is the right place for such a test?) or shall shall I create an e2e test where a correctness issue is shown and fixed?

@@ -86,7 +86,7 @@ trait FileScan extends Scan

override def equals(obj: Any): Boolean = obj match {
case f: FileScan =>
fileIndex == f.fileIndex && readSchema == f.readSchema
fileIndex == f.fileIndex && readSchema == f.readSchema &&
ExpressionSet(partitionFilters) == ExpressionSet(f.partitionFilters) &&
ExpressionSet(dataFilters) == ExpressionSet(f.dataFilters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we canonicalize the filters here to fix the exchange reuse issue?

Copy link
Contributor Author

@peter-toth peter-toth Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we need the output of the ScanExec (BatchScanExec) node to do that?
I need to look into this a bit...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Then it's hard to canonicalize Scan implementations that are outside of Spark...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more, do we really need output? We can canonicalize expr IDs to 0 and only look at the column name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about the same. I will try to do this today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a change like in f65ebe3.
That way we don't need to duplicate some parts of the canonicalization logic defined in QueryPlan.normalizeExpressions and in Expression.canonicalized (Canonicalize.expressionReorder, Canonicalize.ignoreTimeZone)...

If this looks ok to you then I can add UTs for .equals() into a new FileScanSuite as @dongjoon-hyun suggested and move the e2e UT from #31820 to here as we don't need the other change in BatchScanExec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@github-actions github-actions bot added the SQL label Mar 16, 2021
@dongjoon-hyun
Copy link
Member

Thank you, @peter-toth . I prefer a simple one focusing on .equals() only.

@dongjoon-hyun, you asked for a UT in #31820 (review), shall I create a simple one focusing on .equals() (I wonder where is the right place for such a test?) or shall shall I create an e2e test where a correctness issue is shown and fixed?

@dongjoon-hyun
Copy link
Member

For the following question, yes, it seems that we don't have a proper one yet. If we don't have a proper test suite, shall we make one like FileScanSuite? I guess the new test suite need to cover all the following implementations: ParquetScan, OrcScan, CSVScan, JsonScan, TextScan, AvroScan.

I wonder where is the right place for such a test?

val dataFiltersAttributes = AttributeSet(dataFilters).map(a => a.name -> a).toMap
val normalizedPartitionFilters = ExpressionSet(partitionFilters.map(
QueryPlan.normalizeExpressions(_, output.map(a =>
partitionFilterAttributes.getOrElse(a.name, a)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider case sensitivity here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the attr name inside the data/partitionFilterAttributes have been normalized already before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shall, added in 776828e

@peter-toth
Copy link
Contributor Author

For the following question, yes, it seems that we don't have a proper one yet. If we don't have a proper test suite, shall we make one like FileScanSuite? I guess the new test suite need to cover all the following implementations: ParquetScan, OrcScan, CSVScan, JsonScan, TextScan, AvroScan.

I wonder where is the right place for such a test?

Thanks @dongjoon-hyun, I will try to update this PR with new tests this week.

…756-fix-filescan-equality-check

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@peter-toth peter-toth force-pushed the SPARK-34756-fix-filescan-equality-check branch from 3dc22fd to 9711da0 Compare March 18, 2021 17:50
@peter-toth peter-toth changed the title [SPARK-34756][SQL] Fix FileScan equality check [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check Mar 18, 2021
@peter-toth
Copy link
Contributor Author

@dongjoon-hyun I added new UTs in 9711da0

output.map(a => dataFiltersAttributes.getOrElse(a.name, a)))))
(normalizedPartitionFilters, normalizedDataFilters)
}

override def equals(obj: Any): Boolean = obj match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I update hashCode() as well? Looks like we can easily come up with a better one...

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we do that separately because it's irrelevant to the correctness issue?
In general, we expect a performance improvement with that, don't we?
Apache Spark doesn't allow to backport performance improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a new JIRA and go for it, @peter-toth ! :)

@dongjoon-hyun
Copy link
Member

Thank you for update, @peter-toth !

import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
import org.apache.spark.sql.util.CaseInsensitiveStringMap

trait FileScanSuiteBase extends SharedSparkSession {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice!

@SparkQA
Copy link

SparkQA commented Mar 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40802/

@SparkQA
Copy link

SparkQA commented Mar 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40802/

val readPartitionSchema = StructType(Seq(StructField("partition", IntegerType, false)))
val readPartitionSchemaNotEqual = StructType(Seq(
StructField("partition", IntegerType, false),
StructField("other", IntegerType, false)))
Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the shorter is the better. Could you use the following style?

-    val dataSchema = StructType(Seq(
-      StructField("data", IntegerType, false),
-      StructField("partition", IntegerType, false),
-      StructField("other", IntegerType, false)))
-    val dataSchemaNotEqual = StructType(Seq(
-      StructField("data", IntegerType, false),
-      StructField("partition", IntegerType, false),
-      StructField("other", IntegerType, false),
-      StructField("new", IntegerType, false)))
-    val readDataSchema = StructType(Seq(StructField("data", IntegerType, false)))
-    val readDataSchemaNotEqual = StructType(Seq(
-      StructField("data", IntegerType, false),
-      StructField("other", IntegerType, false)))
-    val readPartitionSchema = StructType(Seq(StructField("partition", IntegerType, false)))
-    val readPartitionSchemaNotEqual = StructType(Seq(
-      StructField("partition", IntegerType, false),
-      StructField("other", IntegerType, false)))
+    val dataSchema = StructType.fromDDL("data INT, partition INT, other INT")
+    val dataSchemaNotEqual = StructType.fromDDL("data INT, partition INT, other INT, new INT")
+    val readDataSchema = StructType.fromDDL("data INT")
+    val readDataSchemaNotEqual = StructType.fromDDL("data INT, other INT")
+    val readPartitionSchema = StructType.fromDDL("partition INT")
+    val readPartitionSchemaNotEqual = StructType.fromDDL("partition INT, other INT")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongjoon-hyun, I updated the PR with this change in: d782723

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. The PR is not merged back into a single PR for both issues.
Since this is merged by @cloud-fan 's request, I'm fine and I'll leave this to him.
Thank you, @peter-toth and @cloud-fan .

@github-actions github-actions bot added the AVRO label Mar 19, 2021
@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40835/

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40835/

@@ -84,11 +85,25 @@ trait FileScan extends Scan

protected def seqToString(seq: Seq[Any]): String = seq.mkString("[", ", ", "]")

private lazy val (normalizedPartitionFilters, normalizedDataFilters) = {
val output = readSchema().toAttributes.map(a => a.withName(normalizeName(a.name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, the FileScan equality already considers fileIndex and readSchema, which means 2 file scans only equal to each other if they read the same set of files and the same set of columns.

Given that, I think the expr IDs do not matter for filters, only the column name matters. For normal v2 sources, they use Filter not Expression, which do not have expr IDs either.

The data/partition filters are created in PruneFileSourcePartitions(see https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L51), and the column names inside filters are already normalized w.r.t. the actual file scan output schema, so we don't need to consider case sensitivity here.

That said, I think the normalize logic here should be very simple: just turn expr IDs to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and agree that name is that matters in these Filter like Expressions but if we go this way then I think:

  • we also need to clear other properties of AttributeReferences like qualifier
  • we need to either explicitly sort partitionFilters and dataFilters expression lists (probably with .sortBy(_.hashCode())) to make sure they match with f.partitionFilters and f.dataFilters, or use Set(partitionFilters) == Set(f.partitionFilters) because we can't use ExpressionSet(partitionFilters) == ExpressionSet(f.partitionFilters) as we removed all expr ids
  • we need to reorder all descendants of each partitionFilters and dataFilters expression (with Canonicalize.expressionReorder() to make sure like id = 1 matches with 1 = id (and Canonicalize.ignoreTimeZone() also needs to be applied)

And just a side note that I think we could do most of the above at https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L120-L121 before withFilters() and then FileScan.equals() became very simple.

But I wonder all these changes are simpler than the current PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For simplicity maybe we should just assign fresh expr IDs like what this PR did, but we can remove the case sensitivity handling here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed in 30d2d8b

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Test build #136253 has finished for PR 31848 at commit d782723.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…756-fix-filescan-equality-check

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40924/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40924/

test(s"SPARK-33482: Test $name equals") {
val partitioningAwareFileIndex = newPartitioningAwareFileIndex()

val parquetScan = scanBuilder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquetScan -> fileScan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d38ac2

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Test build #136340 has finished for PR 31848 at commit 30d2d8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40933/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40933/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Test build #136349 has finished for PR 31848 at commit 0d38ac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1/3.0!

@cloud-fan cloud-fan closed this in 93a5d34 Mar 23, 2021
cloud-fan pushed a commit that referenced this pull request Mar 23, 2021
### What changes were proposed in this pull request?

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes `FileScan.equals()`.

### Why are the changes needed?
- Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account.
- Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities.

### Does this PR introduce _any_ user-facing change?
Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues.

### How was this patch tested?
Added new UTs.

Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 93a5d34)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@peter-toth
Copy link
Contributor Author

Thanks @cloud-fan and @dongjoon-hyun for the review.

cloud-fan pushed a commit that referenced this pull request Mar 23, 2021
This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes `FileScan.equals()`.

- Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account.
- Partition filters and data filters added to `FileScan` (in #27112 and #27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities.

Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues.

Added new UTs.

Closes #31848 from peter-toth/SPARK-34756-fix-filescan-equality-check.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 93a5d34)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

Thank you, @peter-toth and @cloud-fan .

@dongjoon-hyun
Copy link
Member

It seems this breaks branch-3.0.

[error] /Users/dongjoon/APACHE/spark-merge/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroScanSuite.scala:26: one more argument than can be applied to method apply: (sparkSession: org.apache.spark.sql.SparkSession, fileIndex: org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex, dataSchema: org.apache.spark.sql.types.StructType, readDataSchema: org.apache.spark.sql.types.StructType, readPartitionSchema: org.apache.spark.sql.types.StructType, options: org.apache.spark.sql.util.CaseInsensitiveStringMap, partitionFilters: Seq[org.apache.spark.sql.catalyst.expressions.Expression], dataFilters: Seq[org.apache.spark.sql.catalyst.expressions.Expression])org.apache.spark.sql.v2.avro.AvroScan in object AvroScan
[error]       (s, fi, ds, rds, rps, f, o, pf, df) => AvroScan(s, fi, ds, rds, rps, o, f, pf, df),
[error]                                                                                      ^
[error] one error found

I'll revert it first at branch-3.0. Please make a backporting PR to branch-3.0, @peter-toth .

peter-toth added a commit to peter-toth/spark that referenced this pull request Mar 24, 2021
This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes `FileScan.equals()`.

- Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account.
- Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities.

Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues.

Added new UTs.

Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 93a5d34)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@peter-toth
Copy link
Contributor Author

It seems this breaks branch-3.0.

[error] /Users/dongjoon/APACHE/spark-merge/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroScanSuite.scala:26: one more argument than can be applied to method apply: (sparkSession: org.apache.spark.sql.SparkSession, fileIndex: org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex, dataSchema: org.apache.spark.sql.types.StructType, readDataSchema: org.apache.spark.sql.types.StructType, readPartitionSchema: org.apache.spark.sql.types.StructType, options: org.apache.spark.sql.util.CaseInsensitiveStringMap, partitionFilters: Seq[org.apache.spark.sql.catalyst.expressions.Expression], dataFilters: Seq[org.apache.spark.sql.catalyst.expressions.Expression])org.apache.spark.sql.v2.avro.AvroScan in object AvroScan
[error]       (s, fi, ds, rds, rps, f, o, pf, df) => AvroScan(s, fi, ds, rds, rps, o, f, pf, df),
[error]                                                                                      ^
[error] one error found

I'll revert it first at branch-3.0. Please make a backporting PR to branch-3.0, @peter-toth .

@dongjoon-hyun, I've opened a 3.0 backport PR here: #31952

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
### What changes were proposed in this pull request?

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes `FileScan.equals()`.

### Why are the changes needed?
- Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account.
- Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities.

### Does this PR introduce _any_ user-facing change?
Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues.

### How was this patch tested?
Added new UTs.

Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 93a5d34)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
### What changes were proposed in this pull request?

This bug was introduced by SPARK-30428 at Apache Spark 3.0.0.
This PR fixes `FileScan.equals()`.

### Why are the changes needed?
- Without this fix `FileScan.equals` doesn't take `fileIndex` and `readSchema` into account.
- Partition filters and data filters added to `FileScan` (in apache#27112 and apache#27157) caused that canonicalized form of some `BatchScanExec` nodes don't match and this prevents some reuse possibilities.

### Does this PR introduce _any_ user-facing change?
Yes, before this fix incorrect reuse of `FileScan` and so `BatchScanExec` could have happed causing correctness issues.

### How was this patch tested?
Added new UTs.

Closes apache#31848 from peter-toth/SPARK-34756-fix-filescan-equality-check.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 93a5d34)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Aug 29, 2022
… filter columns are not read

### What changes were proposed in this pull request?
Unfortunately the fix in #31848 was not correct in all cases. When the partition or data filter contains a column that is not in `readSchema()` the filter nornalization in `FileScan.equals()` doesn't work.

### Why are the changes needed?
To fix `FileScan.equals()` to fix reuse issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #37693 from peter-toth/SPARK-40245-fix-filescan-equals.

Authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Aug 29, 2022
… filter columns are not read

### What changes were proposed in this pull request?
Unfortunately the fix in apache/spark#31848 was not correct in all cases. When the partition or data filter contains a column that is not in `readSchema()` the filter nornalization in `FileScan.equals()` doesn't work.

### Why are the changes needed?
To fix `FileScan.equals()` to fix reuse issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #37693 from peter-toth/SPARK-40245-fix-filescan-equals.

Authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
… filter columns are not read

### What changes were proposed in this pull request?
Unfortunately the fix in apache/spark#31848 was not correct in all cases. When the partition or data filter contains a column that is not in `readSchema()` the filter nornalization in `FileScan.equals()` doesn't work.

### Why are the changes needed?
To fix `FileScan.equals()` to fix reuse issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #37693 from peter-toth/SPARK-40245-fix-filescan-equals.

Authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
… filter columns are not read

### What changes were proposed in this pull request?
Unfortunately the fix in apache/spark#31848 was not correct in all cases. When the partition or data filter contains a column that is not in `readSchema()` the filter nornalization in `FileScan.equals()` doesn't work.

### Why are the changes needed?
To fix `FileScan.equals()` to fix reuse issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #37693 from peter-toth/SPARK-40245-fix-filescan-equals.

Authored-by: Peter Toth <ptoth@cloudera.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants