Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one #23474

Closed
wants to merge 4 commits into from

Conversation

Projects
None yet
5 participants
@viirya
Copy link
Contributor

commented Jan 6, 2019

What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
java.lang.IllegalArgumentException: middle does not exist. Available: last                                                                    
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)                                                         
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)                                                                              
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)                                                                     
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)                                
[info]   at scala.Option.map(Option.scala:163)                                                                                                         
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)                                           
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)                                                 
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)                                                                             

How was this patch tested?

Added tests.

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Jan 6, 2019

@SparkQA

This comment has been minimized.

Copy link

commented Jan 6, 2019

Test build #100820 has finished for PR 23474 at commit 4f5a91a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Jan 6, 2019

retest this please.

@SparkQA

This comment has been minimized.

Copy link

commented Jan 6, 2019

Test build #100825 has finished for PR 23474 at commit 4f5a91a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Jan 6, 2019

retest this please...

@SparkQA

This comment has been minimized.

Copy link

commented Jan 6, 2019

Test build #100829 has finished for PR 23474 at commit 4f5a91a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@dongjoon-hyun

This comment has been minimized.

Copy link
Member

commented Jan 6, 2019

Thank you for pinging me, @viirya !

@SparkQA

This comment has been minimized.

Copy link

commented Jan 7, 2019

Test build #100861 has finished for PR 23474 at commit 6dbd753.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@HyukjinKwon

This comment has been minimized.

Copy link
Member

commented Jan 7, 2019

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Jan 7, 2019

Test build #100864 has finished for PR 23474 at commit 6dbd753.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.
@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2019

retest this please.

@SparkQA

This comment has been minimized.

Copy link

commented Jan 7, 2019

Test build #100869 has finished for PR 23474 at commit 6dbd753.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@dbtsai

This comment has been minimized.

Copy link
Member

commented Jan 7, 2019

When we read name.first and check if name is not null, we are marking name as contentAccessed = false to avoid reading the entire name column.

Now, the issue is when we read name.first and check if name.middle is not null, we are still marking name.middle as contentAccessed = false resulting java.lang.IllegalArgumentException: middle does not exist..

To fix the root cause, and avoid the misunderstanding of the meaning of contentAccessed, we might mark contentAccessed = true in the second case; thus, this change here will not be required.

What do you think?

Thanks.

@viirya

This comment has been minimized.

Copy link
Contributor Author

commented Jan 8, 2019

I think It is harder to mark name.middle as contentAccessed = false. It is true we can check all field accesses and see if middle is not accessed by others. But I feel it is more difficult to do that and current fix is simpler.

@dbtsai

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

If it's really difficult to mark name.middle as contentAccessed = true in this case (which I feel is a less hacky solution), can we reformat the code with the following with documentation?

      !rootFields.exists { root =>
        root.field.name == opt.field.name && {
          // If the merged field type of root and opt field is different from opt field type,
          // we will keep it.
          // For example, when root field type is `struct<name:struct<last:string>>`,
          // and opt field type is `struct<name:struct<middle:string>>`, the merged field type will be
          // `struct<name:struct<last:string,middle:string>>`. Since the merged one contains more
          // nested fields than opt field type, we have to keep it.
          val rootFieldType = StructType(Array(root.field))
          val optFieldType = StructType(Array(opt.field))
          val merged = optFieldType.merge(rootFieldType)
          merged.sameType(optFieldType)
        }

Add @hvanhovell @gatorsmile for more input.

Thanks!

@dongjoon-hyun

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

+1 for @dbtsai 's refactored code.


optRootFields.filter { opt =>
!rootFields.exists(_.field.name == opt.field.name)
val optFieldType = StructType(Array(opt.field))

This comment has been minimized.

Copy link
@dbtsai

dbtsai Jan 10, 2019

Member

Can we move optFieldType right after val rootFieldType = StructType(Array(root.field))? Thanks!

This comment has been minimized.

Copy link
@viirya

viirya Jan 11, 2019

Author Contributor

Do you mean moving it inside the exists call?

This comment has been minimized.

Copy link
@viirya

viirya Jan 11, 2019

Author Contributor

I make it out of exist call so it can be reused, isn't? Moving it to after rootFieldType is for readability?

This comment has been minimized.

Copy link
@dbtsai

dbtsai Jan 11, 2019

Member

It's not very expensive, and we only need to compute it when root.field.name == opt.field.name. As a result, I feel moving it right after val rootFieldType will be more readable.

        root.field.name == opt.field.name && {
          val rootFieldType = StructType(Array(root.field))
          val optFieldType = StructType(Array(opt.field))
          val merged = optFieldType.merge(rootFieldType)
          merged.sameType(optFieldType)
        }

This comment has been minimized.

Copy link
@viirya

viirya Jan 11, 2019

Author Contributor

Ok. I see. Let me move it. Thanks.

@dbtsai

This comment has been minimized.

Copy link
Member

commented Jan 10, 2019

LGTM. Just one minor comment. Thanks!

@SparkQA

This comment has been minimized.

Copy link

commented Jan 10, 2019

Test build #101024 has finished for PR 23474 at commit ff1cc85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jan 11, 2019

Test build #101075 has finished for PR 23474 at commit ff2fa67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 11, 2019

[SPARK-26551][SQL] Fix schema pruning error when selecting one comple…
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes #23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <d_tsai@apple.com>
@dbtsai

This comment has been minimized.

Copy link
Member

commented Jan 11, 2019

LGTM. Merged into master and 2.4 branch. Thanks!

@asfgit asfgit closed this in 50ebf3a Jan 11, 2019

stczwd added a commit to stczwd/spark that referenced this pull request Feb 18, 2019

[SPARK-26551][SQL] Fix schema pruning error when selecting one comple…
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes apache#23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>

wangyum added a commit to wangyum/spark that referenced this pull request Mar 14, 2019

[SPARK-26551][SQL] Fix schema pruning error when selecting one comple…
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes apache#23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.