[SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spa…

…rk.sql.json.enablePartialResults" enabled  ### What changes were proposed in this pull request?  This PR fixes a bug in a corner case of JSON parsing when `spark.sql.json.enablePartialResults` is enabled. When running the following query with the config set to true: ``` select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>') ``` the code would fail with ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893) ``` The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case. ### Why are the changes needed?  Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on. ### Does this PR introduce _any_ user-facing change?  Yes, but it is a bug fix so it would not have worked without this patch overall. The parsing output will be different due to the partial results improvement: Previously, we would get `null` (the partial results are disabled). With this patch and partial results enabled, this will return `Array([b, null])`. This is not specific to this patch but rather to the partial results feature in general. ### How was this patch tested?  I added a unit test. ### Was this patch authored or co-authored using generative AI tooling?  No. Closes apache#47292 from sadikovi/SPARK-48863. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
bjornjorgensen · Jul 11, 2024 · 31d5ea1 · 31d5ea1
1 parent b4e3c2a
commit 31d5ea1
Show file tree

Hide file tree

Showing 2 changed files with 44 additions and 1 deletion.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -201,7 +201,18 @@ class JacksonParser(
         //
         val st = at.elementType.asInstanceOf[StructType]
         val fieldConverters = st.map(_.dataType).map(makeConverter).toArray
-        Some(InternalRow(new GenericArrayData(convertObject(parser, st, fieldConverters).toArray)))
+
+        val res = try {
+          convertObject(parser, st, fieldConverters)
+        } catch {
+          case err: PartialResultException =>
+            throw PartialArrayDataResultException(
+              new GenericArrayData(Seq(err.partialResult)),
+              err.cause
+            )
+        }
+
+        Some(InternalRow(new GenericArrayData(res.toArray)))
     }
   }
 

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
@@ -1161,6 +1161,38 @@ class JsonFunctionsSuite extends QueryTest with SharedSparkSession {
     }
   }
 
+  test("SPARK-48863: parse object as an array with partial results enabled") {
+    val schema = StructType(StructField("a", StringType) :: StructField("c", IntegerType) :: Nil)
+
+    // Value can be parsed correctly and should return the same result with or without the flag.
+    Seq(false, true).foreach { enabled =>
+      withSQLConf(SQLConf.JSON_ENABLE_PARTIAL_RESULTS.key -> s"${enabled}") {
+        checkAnswer(
+          Seq("""{"a": "b", "c": 1}""").toDF("c0")
+            .select(from_json($"c0", ArrayType(schema))),
+          Row(Seq(Row("b", 1)))
+        )
+      }
+    }
+
+    // Value does not match the schema.
+    val df = Seq("""{"a": "b", "c": "1"}""").toDF("c0")
+
+    withSQLConf(SQLConf.JSON_ENABLE_PARTIAL_RESULTS.key -> "true") {
+      checkAnswer(
+        df.select(from_json($"c0", ArrayType(schema))),
+        Row(Seq(Row("b", null)))
+      )
+    }
+
+    withSQLConf(SQLConf.JSON_ENABLE_PARTIAL_RESULTS.key -> "false") {
+      checkAnswer(
+        df.select(from_json($"c0", ArrayType(schema))),
+        Row(null)
+      )
+    }
+  }
+
   test("SPARK-33270: infers schema for JSON field with spaces and pass them to from_json") {
     val in = Seq("""{"a b": 1}""").toDS()
     val out = in.select(from_json($"value", schema_of_json("""{"a b": 100}""")) as "parsed")