Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43522][SQL] Fix creating struct column name with index of array #41187

Closed

Conversation

Hisoka-X
Copy link
Member

@Hisoka-X Hisoka-X commented May 16, 2023

What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1

val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+ 
|      value|      key_value|           map_entry| 
+-----------+---------------+--------------------+ 
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
+-----------+---------------+--------------------+

In 3.4.0

org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....

The reason is CreateNamedStruct will use last expr of value Expression as column name. And will check it must are String. But array Expression's last expr are Integer. The check will failed. So we can skip match with UnresolvedExtractValue when last expr not String. Then it will when fall back to the default name.

Why are the changes needed?

Fix the bug when creating struct column name with index of array

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add new test

@github-actions github-actions bot added the SQL label May 16, 2023
@Hisoka-X
Copy link
Member Author

cc @cloud-fan @sadikovi

@cloud-fan
Copy link
Contributor

do you know which commit broke this?

@Hisoka-X
Copy link
Member Author

do you know which commit broke this?

#37965

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in f2a2917 May 18, 2023
cloud-fan pushed a commit that referenced this pull request May 18, 2023
### What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1
```scala
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+
|      value|      key_value|           map_entry|
+-----------+---------------+--------------------+
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...|
+-----------+---------------+--------------------+
```

In 3.4.0

```
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....
```

The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name.

### Why are the changes needed?

Fix the bug when creating struct column name with index of array

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes #41187 from Hisoka-X/SPARK-43522_struct_name_array.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f2a2917)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@Hisoka-X
Copy link
Member Author

Thanks @cloud-fan @sadikovi

@Hisoka-X Hisoka-X deleted the SPARK-43522_struct_name_array branch May 18, 2023 12:00
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1
```scala
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+
|      value|      key_value|           map_entry|
+-----------+---------------+--------------------+
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...|
+-----------+---------------+--------------------+
```

In 3.4.0

```
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....
```

The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name.

### Why are the changes needed?

Fix the bug when creating struct column name with index of array

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41187 from Hisoka-X/SPARK-43522_struct_name_array.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f2a2917)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
### What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1
```scala
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+
|      value|      key_value|           map_entry|
+-----------+---------------+--------------------+
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...|
+-----------+---------------+--------------------+
```

In 3.4.0

```
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....
```

The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name.

### Why are the changes needed?

Fix the bug when creating struct column name with index of array

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41187 from Hisoka-X/SPARK-43522_struct_name_array.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f2a2917)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
### What changes were proposed in this pull request?

When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0.

In 3.3.1
```scala
val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) ))
testDF.show()

+-----------+---------------+--------------------+
|      value|      key_value|           map_entry|
+-----------+---------------+--------------------+
|a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...|
+-----------+---------------+--------------------+
```

In 3.4.0

```
org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].;
'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
+- Project [value#41, split(value#41, ,, -1) AS key_value#45]
   +- LocalRelation [value#41]  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
....
```

The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name.

### Why are the changes needed?

Fix the bug when creating struct column name with index of array

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes apache#41187 from Hisoka-X/SPARK-43522_struct_name_array.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f2a2917)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants