[SPARK-35912][SQL] Fix nullability of `spark.read.json` #33212

cfmcgrady · 2021-07-05T07:52:46Z

What changes were proposed in this pull request?

Rework PR with suggestions.

This PR changes the behavior of spark.read.json when nullability fields are false. the JacksonParser should fail and throw an exception instead of setting a null value.

here is an example:

  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = true),
      StructField("y", IntegerType, nullable = false)
    ))
  )))
  schema.printTreeString()
  // root
  // |-- value: struct (nullable = true)
  // |    |-- x: integer (nullable = false)
  // |    |-- y: integer (nullable = false)
  
  val testDS = Seq("""{"value":{"x":1}}""").toDS
  val jsonDS = spark.read.schema(schema).json(testDS)

  jsonDS.show()

output before this pr:

+---------+
|    value|
+---------+
|{1, null}|
+---------+

output has different behavior with different parse modes after this pr:

PERMISSIVE mode

throw an IllegalSchemaArgumentException which contains the message:

Field c2 is not nullable but PERMISSIVE mode only works with nullable fields.

FAILFAST mode

throw an IllegalSchemaArgumentException which contains the message:

field y is not nullable but it's missing in one record.

DROPMALFORMED mode

+-----+
|value|
+-----+
+-----+

Does this PR introduce any user-facing change?

Yes. when reading a missing value(or null value) JSON with a not nullable field, Spark will fail and throw an exception for PERMISSIVE/FAILFAST mode, and drop records for DROPMALFORMED mode.

How was this patch tested?

New test.

AmplabJenkins · 2021-07-05T07:54:30Z

Can one of the admins verify this patch?

cfmcgrady · 2021-07-05T08:38:14Z

cc @HyukjinKwon @cloud-fan @maropu @viirya

cloud-fan · 2021-07-05T08:56:47Z

should return a default value instead of null for the primitive type

I don't agree with this. It's a correctness bug if an integer value is null in the JSON file but Spark read it as 0. We should fail at runtime. We can either do this null check in the jackson parser, or leverage the AssertNotNull expression.

HyukjinKwon · 2021-07-05T09:42:26Z

Yeah, I think we should fail.

HyukjinKwon · 2021-07-06T10:23:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

@@ -404,6 +407,19 @@ class JacksonParser(
      }
    }

+    // When the input schema is setting to `nullable = false`, make sure the field is not null.
+    var index = 0
+    while (badRecordException.isEmpty && !skipRow && index < schema.length) {


Can we fix parseJsonToken instead by passing nullable bool flag?

Let me try.

As the missing field won't call fieldConverter, it seems it's hard to do this?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

Lines 404 to 419 in 2537fe8

while (!skipRow && nextUntil(parser, JsonToken.END_OBJECT)) {

schema.getFieldIndex(parser.getCurrentName) match {

case Some(index) =>

try {

row.update(index, fieldConverters(index).apply(parser))

skipRow = structFilters.skipRow(row, index)

} catch {

case e: SparkUpgradeException => throw e

case NonFatal(e) if isRoot =>

badRecordException = badRecordException.orElse(Some(e))

parser.skipChildren()

}

case None =>

parser.skipChildren()

}

}

So there are 2 cases that can lead to null result:

the JSON value is null

the JSON field is missing

and for malformed fields, we also set the value to null and put the JSON string to a special column.

We may not return the JSON string, the reason show in test

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

HyukjinKwon · 2021-07-06T10:25:25Z

cc @MaxGekk FYI

viirya

+1 for failing the unexpected null case.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

cfmcgrady · 2021-07-09T06:33:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+          assert(exceptionMsg1.contains(
+            "the null value found when parsing non-nullable field c2."))
+          val exceptionMsg2 = intercept[SparkException] {
+            load("PERMISSIVE", schema, jsonString).collect


@cloud-fan Since the field is non-nullable, do we still return Row(1, null) for PERMISSIVE mode? if yes, this may cause the cast struct problem as we talked about before, the field is non-nullable but row.isNullAt(index) is true.

Yea, I think we can't apply permissive mode here. Failing looks fine to me.

Actually, PERMISSIVE mode should not fail at runtime. Shall we fail at the very beginning if the mode is PERMISSIVE and schema contains non-nullable fields?

Done, add not nullable check for PERMISSIVE mode when we initial FailureSafeParser.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

cloud-fan · 2021-07-09T12:35:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+          s"the null value found when parsing non-nullable field ${schema(index).name}.")
+      }
+      if (!checkedIndexSet.contains(index)) {
+        skipRow = structFilters.skipRow(row, index)


is this change related to the nullability issue?

yes, when the input JSON string contains the missing value field, JacksonParser should be able to skip rows through this null value field.

This was not done before your PR? The missing field is always filled with null value, and we can use it to filter rows. This doesn't seem to be related to this PR.

For example:

val missingFieldInput = """{"c1":1}""" val schema = StructType(Seq( StructField("c1", IntegerType, nullable = true), StructField("c2", IntegerType, nullable = true) )) val filters = Seq(IsNotNull("c2")) val options = new JSONOptions(Map.empty[String, String], "GMT", "") val parser = new JacksonParser(schema, options, false, filters) val createParser = CreateJacksonParser.string _ val result = parser.parse(missingFieldInput, createParser, UTF8String.fromString) println(result)

before this pr:

List([1,null])

after this pr:

List()

Please let me know if I've missed out anything.

This is not a nullability issue, but a performance improvement. Spark will evaluate the filter again and drop the rows. Can we exclude this part from this PR? We need to backport the nullability fix but not the performance improvement.

maropu · 2021-07-12T06:11:17Z

Could you update the title and description, too?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonParserSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

cloud-fan · 2021-07-12T16:12:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

  private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
  private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
  private val resultRow = new GenericInternalRow(schema.length)
  private val nullResult = new GenericInternalRow(schema.length)

+  // As PERMISSIVE mode should not fail at runtime, so fail if the mode is PERMISSIVE and schema
+  // contains non-nullable fields.
+  private def disableNotNullableForPermissiveMode: Unit = {


nit: this method has side effect and it's better to define it with parentheses.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala

cloud-fan · 2021-07-14T13:50:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+            load("PERMISSIVE", schema, jsonString).collect
+          }
+          val expectedMsg2 =
+            "Field c2 is not nullable but PERMISSIVE mode only works with nullable fields."


Can we add a migration guide for it? This is a breaking change as PERMISSIVE is the default mode.

Also cc @HyukjinKwon for this breaking change.

Another choice is to not respect the user-given schema and always turn it into a nullable schema, if the mode is PERMISSIVE.

cfmcgrady · 2021-07-15T03:56:37Z

docs/sql-migration-guide.md

@@ -22,6 +22,10 @@ license: |
 * Table of contents
 {:toc}

+## Upgrading from Spark SQL 3.2 to 3.3
+
+  - In Spark 3.3, spark will fail when parsing a JSON/CSV string with `PERMISSIVE` mode and schema contains non-nullable fields. You can set mode to `FAILFAST/DROPMALFORMED` if you want to read JSON/CSV with a schema that contains nullable fields.


This change also affects parsing CSV string, I will add a unit test for CSV in a separate pr. cc @cloud-fan @HyukjinKwon

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

maropu · 2021-07-16T07:25:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+      badRecordException.isEmpty && !skipRow && options.parseMode != PermissiveMode
+    if (checkNotNullable) {
+      var index = 0
+      while (index < schema.length) {
+        if (!schema(index).nullable && row.isNullAt(index)) {
+          throw new IllegalSchemaArgumentException(
+            s"field ${schema(index).name} is not nullable but it's missing in one record.")
+        }
+        index += 1
+      }
+    }


Could we extract this part as a private method like this?

private lazy val checkNotNullableInRow = if (options.parseMode != PermissiveMode) { (row: GenericInternalRow, schema: StructType, skipRow: Boolean, runtimeExceptionOption: Option[Throwable]) => { val checkNotNullable = runtimeExceptionOption.isEmpty && !skipRow && options.parseMode != PermissiveMode if (checkNotNullable) { ... } } } } else { (_: GenericInternalRow, _: StructType, _: Boolean, _: Option[Throwable]) => {} }

cfmcgrady · 2021-07-19T12:52:41Z

kindly ping @HyukjinKwon

cfmcgrady · 2021-07-19T12:56:58Z

Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting black test...
black checks passed.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks failed:
python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Found 13 errors in 4 files (checked 314 source files)
1
Error: Process completed with exit code 1.

The check failure is not related to this pr.

HyukjinKwon · 2021-07-20T00:44:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

@@ -405,10 +405,18 @@ class JacksonParser(
      schema.getFieldIndex(parser.getCurrentName) match {
        case Some(index) =>
          try {
-            row.update(index, fieldConverters(index).apply(parser))
+            val fieldValue = fieldConverters(index).apply(parser)


I .. am not sure if we really need these complicated fix to address nullability mismatch (which is rather a corner case) to be honest. I wonder if there's a simpler approach, e.g.) simply warning on non-nullable columns?
Just to be clear, I don't mind if other committers prefer to fix it.

That's another proposal I mentioned earlier: if the user-given schema is not-nullable, we just turn it into nullable schema and don't fail.

Thank you for your suggestions, I'll raise a new PR.

### What changes were proposed in this pull request? Rework [PR](#33212) with suggestions. This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default. Here is an example: ```scala val schema = StructType(Seq(StructField("value", StructType(Seq( StructField("x", IntegerType, nullable = false), StructField("y", IntegerType, nullable = false) )), nullable = true ))) val testDS = Seq("""{"value":{"x":1}}""").toDS spark.read .schema(schema) .json(testDS) .printSchema() spark.read .schema(schema) .format("json") .load("/tmp/json/t1") .printSchema() // root // |-- value: struct (nullable = true) // | |-- x: integer (nullable = true) // | |-- y: integer (nullable = true) ``` Before this pr: ``` // output of spark.read.json() root |-- value: struct (nullable = true) | |-- x: integer (nullable = false) | |-- y: integer (nullable = false) ``` After this pr: ``` // output of spark.read.json() root |-- value: struct (nullable = true) | |-- x: integer (nullable = true) | |-- y: integer (nullable = true) ``` - `spark.read.csv()` also has the same problem. - Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation. https://github.com/apache/spark/blob/c77acf0bbc25341de2636649fdd76f9bb4bdf4ed/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L415-L421 ### Does this PR introduce _any_ user-facing change? Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default. ### How was this patch tested? New test. Closes #33436 from cfmcgrady/SPARK-35912-v3. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added the SQL label Jul 5, 2021

cfmcgrady mentioned this pull request Jul 5, 2021

[SPARK-35912][SQL] Fix cast struct contains null value to string/struct #33146

Closed

HyukjinKwon reviewed Jul 6, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Outdated Show resolved Hide resolved

viirya reviewed Jul 6, 2021

View reviewed changes

cfmcgrady force-pushed the SPARK-35912-v2 branch from a2b16b0 to 48e0473 Compare July 7, 2021 08:26

cloud-fan reviewed Jul 8, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Show resolved Hide resolved

cfmcgrady commented Jul 9, 2021

View reviewed changes

cloud-fan reviewed Jul 9, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 9, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 9, 2021

View reviewed changes

maropu reviewed Jul 12, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala Show resolved Hide resolved

maropu reviewed Jul 12, 2021

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonParserSuite.scala Outdated Show resolved Hide resolved

maropu reviewed Jul 12, 2021

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonParserSuite.scala Outdated Show resolved Hide resolved

maropu reviewed Jul 12, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 12, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 14, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 14, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 14, 2021

View reviewed changes

cfmcgrady added 4 commits July 14, 2021 23:14

fix spark.read.json nullability.

6f29bea

fail instead of set a default valut

1b8dade

fix test

b1d749f

update test

48b8006

cfmcgrady added 8 commits July 14, 2021 23:14

fail if the mode is PERMISSIVE and schema contains non-nullable fields.

e44348d

fix test

3cfc583

nit

51923c5

fix test

3f3339c

nit

4f2e904

review

0f5354b

review

9b08de1

add migration guide

484a17e

cfmcgrady force-pushed the SPARK-35912-v2 branch from 77239e4 to 484a17e Compare July 15, 2021 02:27

github-actions bot added the DOCS label Jul 15, 2021

cfmcgrady commented Jul 15, 2021

View reviewed changes

maropu reviewed Jul 16, 2021

View reviewed changes

cfmcgrady added 2 commits July 16, 2021 16:09

review

7a6d978

nit

9fb8fbd

HyukjinKwon reviewed Jul 20, 2021

View reviewed changes

cfmcgrady closed this Jul 20, 2021

cfmcgrady mentioned this pull request Jul 20, 2021

[SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv #33436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35912][SQL] Fix nullability of `spark.read.json` #33212

[SPARK-35912][SQL] Fix nullability of `spark.read.json` #33212

cfmcgrady commented Jul 5, 2021 •

edited

Loading

AmplabJenkins commented Jul 5, 2021

cfmcgrady commented Jul 5, 2021

cloud-fan commented Jul 5, 2021

HyukjinKwon commented Jul 5, 2021

HyukjinKwon Jul 6, 2021

cfmcgrady Jul 7, 2021

cfmcgrady Jul 7, 2021

cloud-fan Jul 8, 2021

cloud-fan Jul 8, 2021

cfmcgrady Jul 9, 2021

HyukjinKwon commented Jul 6, 2021

viirya left a comment

cfmcgrady Jul 9, 2021

cloud-fan Jul 9, 2021

cloud-fan Jul 9, 2021

cfmcgrady Jul 12, 2021

cloud-fan Jul 9, 2021

cfmcgrady Jul 12, 2021

cloud-fan Jul 12, 2021

cfmcgrady Jul 13, 2021

cloud-fan Jul 14, 2021

cfmcgrady Jul 14, 2021

maropu commented Jul 12, 2021

cloud-fan Jul 12, 2021

cloud-fan Jul 14, 2021

cloud-fan Jul 14, 2021

cfmcgrady Jul 15, 2021

maropu Jul 16, 2021

cfmcgrady commented Jul 19, 2021

cfmcgrady commented Jul 19, 2021

HyukjinKwon Jul 20, 2021

cloud-fan Jul 20, 2021

cfmcgrady Jul 20, 2021

	while (!skipRow && nextUntil(parser, JsonToken.END_OBJECT)) {
	schema.getFieldIndex(parser.getCurrentName) match {
	case Some(index) =>
	try {
	row.update(index, fieldConverters(index).apply(parser))
	skipRow = structFilters.skipRow(row, index)
	} catch {
	case e: SparkUpgradeException => throw e
	case NonFatal(e) if isRoot =>
	badRecordException = badRecordException.orElse(Some(e))
	parser.skipChildren()
	}
	case None =>
	parser.skipChildren()
	}
	}

[SPARK-35912][SQL] Fix nullability of spark.read.json #33212

[SPARK-35912][SQL] Fix nullability of spark.read.json #33212

Conversation

cfmcgrady commented Jul 5, 2021 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Jul 5, 2021

cfmcgrady commented Jul 5, 2021

cloud-fan commented Jul 5, 2021

HyukjinKwon commented Jul 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jul 6, 2021

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jul 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfmcgrady commented Jul 19, 2021

cfmcgrady commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-35912][SQL] Fix nullability of `spark.read.json` #33212

[SPARK-35912][SQL] Fix nullability of `spark.read.json` #33212

cfmcgrady commented Jul 5, 2021 •

edited

Loading