[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

Hisoka-X · 2023-05-16T15:00:19Z

What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1

val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+ 
|      value|      key_value|           map_entry| 
+-----------+---------------+--------------------+ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+-----------+---------------+--------------------+

In 3.4.0

org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....

The reason is CreateNamedStruct will use last expr of value Expression as column name. And will check it must are String. But array Expression's last expr are Integer. The check will failed. So we can skip match with UnresolvedExtractValue when last expr not String. Then it will when fall back to the default name.

Why are the changes needed?

Fix the bug when creating struct column name with index of array

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

Hisoka-X · 2023-05-17T02:05:40Z

cc @cloud-fan @sadikovi

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

cloud-fan · 2023-05-17T12:59:11Z

do you know which commit broke this?

Hisoka-X · 2023-05-17T13:13:43Z

do you know which commit broke this?

#37965

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

cloud-fan · 2023-05-18T06:28:52Z

thanks, merging to master/3.4!

### What changes were proposed in this pull request? When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. In 3.3.1 ```scala val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )) testDF.show() +-----------+---------------+--------------------+ | value| key_value| map_entry| +-----------+---------------+--------------------+ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +-----------+---------------+--------------------+ ``` In 3.4.0 ``` org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) .... ``` The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name. ### Why are the changes needed? Fix the bug when creating struct column name with index of array ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test Closes #41187 from Hisoka-X/SPARK-43522_struct_name_array. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f2a2917) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Hisoka-X · 2023-05-18T06:42:38Z

Thanks @cloud-fan @sadikovi

### What changes were proposed in this pull request? When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. In 3.3.1 ```scala val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )) testDF.show() +-----------+---------------+--------------------+ | value| key_value| map_entry| +-----------+---------------+--------------------+ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +-----------+---------------+--------------------+ ``` In 3.4.0 ``` org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) .... ``` The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name. ### Why are the changes needed? Fix the bug when creating struct column name with index of array ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test Closes apache#41187 from Hisoka-X/SPARK-43522_struct_name_array. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f2a2917) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-43522][SQL] Fix creating struct column name with index of array

1d6eba1

github-actions bot added the SQL label May 16, 2023

sadikovi reviewed May 17, 2023

View reviewed changes

[SPARK-43522][SQL] change test case

67c2a3e

cloud-fan reviewed May 17, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala Outdated Show resolved Hide resolved

[SPARK-43522][SQL] fix review

f990c85

cloud-fan approved these changes May 18, 2023

View reviewed changes

cloud-fan closed this in f2a2917 May 18, 2023

Hisoka-X deleted the SPARK-43522_struct_name_array branch May 18, 2023 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

Hisoka-X commented May 16, 2023 •

edited

Hisoka-X commented May 17, 2023

cloud-fan commented May 17, 2023

Hisoka-X commented May 17, 2023

cloud-fan commented May 18, 2023

Hisoka-X commented May 18, 2023

[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

Conversation

Hisoka-X commented May 16, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Hisoka-X commented May 17, 2023

cloud-fan commented May 17, 2023

Hisoka-X commented May 17, 2023

cloud-fan commented May 18, 2023

Hisoka-X commented May 18, 2023

Hisoka-X commented May 16, 2023 •

edited