Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35912][SQL] Fix nullability of spark.read.json #33212

Closed
wants to merge 14 commits into from

Conversation

cfmcgrady
Copy link
Contributor

@cfmcgrady cfmcgrady commented Jul 5, 2021

What changes were proposed in this pull request?

Rework PR with suggestions.

This PR changes the behavior of spark.read.json when nullability fields are false. the JacksonParser should fail and throw an exception instead of setting a null value.

here is an example:

  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = true),
      StructField("y", IntegerType, nullable = false)
    ))
  )))
  schema.printTreeString()
  // root
  // |-- value: struct (nullable = true)
  // |    |-- x: integer (nullable = false)
  // |    |-- y: integer (nullable = false)
  
  val testDS = Seq("""{"value":{"x":1}}""").toDS
  val jsonDS = spark.read.schema(schema).json(testDS)

  jsonDS.show()

output before this pr:

+---------+
|    value|
+---------+
|{1, null}|
+---------+

output has different behavior with different parse modes after this pr:

  • PERMISSIVE mode

throw an IllegalSchemaArgumentException which contains the message:

Field c2 is not nullable but PERMISSIVE mode only works with nullable fields.
  • FAILFAST mode

throw an IllegalSchemaArgumentException which contains the message:

field y is not nullable but it's missing in one record.
  • DROPMALFORMED mode
+-----+
|value|
+-----+
+-----+

Does this PR introduce any user-facing change?

Yes. when reading a missing value(or null value) JSON with a not nullable field, Spark will fail and throw an exception for PERMISSIVE/FAILFAST mode, and drop records for DROPMALFORMED mode.

How was this patch tested?

New test.

@github-actions github-actions bot added the SQL label Jul 5, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cfmcgrady
Copy link
Contributor Author

@cloud-fan
Copy link
Contributor

should return a default value instead of null for the primitive type

I don't agree with this. It's a correctness bug if an integer value is null in the JSON file but Spark read it as 0. We should fail at runtime. We can either do this null check in the jackson parser, or leverage the AssertNotNull expression.

@HyukjinKwon
Copy link
Member

Yeah, I think we should fail.

@@ -404,6 +407,19 @@ class JacksonParser(
}
}

// When the input schema is setting to `nullable = false`, make sure the field is not null.
var index = 0
while (badRecordException.isEmpty && !skipRow && index < schema.length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix parseJsonToken instead by passing nullable bool flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the missing field won't call fieldConverter, it seems it's hard to do this?

while (!skipRow && nextUntil(parser, JsonToken.END_OBJECT)) {
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
row.update(index, fieldConverters(index).apply(parser))
skipRow = structFilters.skipRow(row, index)
} catch {
case e: SparkUpgradeException => throw e
case NonFatal(e) if isRoot =>
badRecordException = badRecordException.orElse(Some(e))
parser.skipChildren()
}
case None =>
parser.skipChildren()
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 2 cases that can lead to null result:

  1. the JSON value is null
  2. the JSON field is missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for malformed fields, we also set the value to null and put the JSON string to a special column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not return the JSON string, the reason show in test

@HyukjinKwon
Copy link
Member

cc @MaxGekk FYI

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for failing the unexpected null case.

assert(exceptionMsg1.contains(
"the null value found when parsing non-nullable field c2."))
val exceptionMsg2 = intercept[SparkException] {
load("PERMISSIVE", schema, jsonString).collect
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Since the field is non-nullable, do we still return Row(1, null) for PERMISSIVE mode? if yes, this may cause the cast struct problem as we talked about before, the field is non-nullable but row.isNullAt(index) is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we can't apply permissive mode here. Failing looks fine to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, PERMISSIVE mode should not fail at runtime. Shall we fail at the very beginning if the mode is PERMISSIVE and schema contains non-nullable fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, add not nullable check for PERMISSIVE mode when we initial FailureSafeParser.

s"the null value found when parsing non-nullable field ${schema(index).name}.")
}
if (!checkedIndexSet.contains(index)) {
skipRow = structFilters.skipRow(row, index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to the nullability issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when the input JSON string contains the missing value field, JacksonParser should be able to skip rows through this null value field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not done before your PR? The missing field is always filled with null value, and we can use it to filter rows. This doesn't seem to be related to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

  val missingFieldInput = """{"c1":1}"""
  val schema = StructType(Seq(
    StructField("c1", IntegerType, nullable = true),
    StructField("c2", IntegerType, nullable = true)
  ))

  val filters = Seq(IsNotNull("c2"))

  val options = new JSONOptions(Map.empty[String, String], "GMT", "")
  val parser = new JacksonParser(schema, options, false, filters)
  val createParser = CreateJacksonParser.string _
  val result = parser.parse(missingFieldInput, createParser, UTF8String.fromString)
  println(result)

before this pr:

List([1,null])

after this pr:

List()

Please let me know if I've missed out anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a nullability issue, but a performance improvement. Spark will evaluate the filter again and drop the rows. Can we exclude this part from this PR? We need to backport the nullability fix but not the performance improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@maropu
Copy link
Member

maropu commented Jul 12, 2021

Could you update the title and description, too?

private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord)
private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord))
private val resultRow = new GenericInternalRow(schema.length)
private val nullResult = new GenericInternalRow(schema.length)

// As PERMISSIVE mode should not fail at runtime, so fail if the mode is PERMISSIVE and schema
// contains non-nullable fields.
private def disableNotNullableForPermissiveMode: Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this method has side effect and it's better to define it with parentheses.

load("PERMISSIVE", schema, jsonString).collect
}
val expectedMsg2 =
"Field c2 is not nullable but PERMISSIVE mode only works with nullable fields."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a migration guide for it? This is a breaking change as PERMISSIVE is the default mode.

Also cc @HyukjinKwon for this breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another choice is to not respect the user-given schema and always turn it into a nullable schema, if the mode is PERMISSIVE.

@@ -22,6 +22,10 @@ license: |
* Table of contents
{:toc}

## Upgrading from Spark SQL 3.2 to 3.3

- In Spark 3.3, spark will fail when parsing a JSON/CSV string with `PERMISSIVE` mode and schema contains non-nullable fields. You can set mode to `FAILFAST/DROPMALFORMED` if you want to read JSON/CSV with a schema that contains nullable fields.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change also affects parsing CSV string, I will add a unit test for CSV in a separate pr. cc @cloud-fan @HyukjinKwon

Comment on lines 431 to 441
badRecordException.isEmpty && !skipRow && options.parseMode != PermissiveMode
if (checkNotNullable) {
var index = 0
while (index < schema.length) {
if (!schema(index).nullable && row.isNullAt(index)) {
throw new IllegalSchemaArgumentException(
s"field ${schema(index).name} is not nullable but it's missing in one record.")
}
index += 1
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract this part as a private method like this?

  private lazy val checkNotNullableInRow = if (options.parseMode != PermissiveMode) {
    (row: GenericInternalRow, schema: StructType, skipRow: Boolean,
        runtimeExceptionOption: Option[Throwable]) => {
      val checkNotNullable =
        runtimeExceptionOption.isEmpty && !skipRow && options.parseMode != PermissiveMode
      if (checkNotNullable) {
        ...
        }
      }
    }
  } else {
    (_: GenericInternalRow, _: StructType, _: Boolean, _: Option[Throwable]) => {}
  }

@cfmcgrady
Copy link
Contributor Author

kindly ping @HyukjinKwon

@cfmcgrady
Copy link
Contributor Author

Run ./dev/lint-python
starting python compilation test...
python compilation succeeded.

starting black test...
black checks passed.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks failed:
python/pyspark/mllib/tree.pyi:29: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/tree.pyi:38: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:34: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:42: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:48: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:54: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:76: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:124: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/feature.pyi:165: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:45: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/clustering.pyi:72: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:39: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
python/pyspark/mllib/classification.pyi:52: error: Overloaded function signatures 1 and 2 overlap with incompatible return types
Found 13 errors in 4 files (checked 314 source files)
1
Error: Process completed with exit code 1.

The check failure is not related to this pr.

@@ -405,10 +405,18 @@ class JacksonParser(
schema.getFieldIndex(parser.getCurrentName) match {
case Some(index) =>
try {
row.update(index, fieldConverters(index).apply(parser))
val fieldValue = fieldConverters(index).apply(parser)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I .. am not sure if we really need these complicated fix to address nullability mismatch (which is rather a corner case) to be honest. I wonder if there's a simpler approach, e.g.) simply warning on non-nullable columns?
Just to be clear, I don't mind if other committers prefer to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's another proposal I mentioned earlier: if the user-given schema is not-nullable, we just turn it into nullable schema and don't fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions, I'll raise a new PR.

@cfmcgrady cfmcgrady closed this Jul 20, 2021
HyukjinKwon pushed a commit that referenced this pull request Jul 22, 2021
### What changes were proposed in this pull request?

Rework [PR](#33212) with suggestions.

This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default.

Here is an example:

```scala
  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = false),
      StructField("y", IntegerType, nullable = false)
    )),
    nullable = true
  )))

  val testDS = Seq("""{"value":{"x":1}}""").toDS
  spark.read
    .schema(schema)
    .json(testDS)
    .printSchema()

  spark.read
    .schema(schema)
    .format("json")
    .load("/tmp/json/t1")
    .printSchema()
  // root
  //  |-- value: struct (nullable = true)
  //  |    |-- x: integer (nullable = true)
  //  |    |-- y: integer (nullable = true)
```

Before this pr:
```
// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = false)
 |    |-- y: integer (nullable = false)
```

After this pr:
```
// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = true)
 |    |-- y: integer (nullable = true)
```

- `spark.read.csv()` also has the same problem.
- Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation.

https://github.com/apache/spark/blob/c77acf0bbc25341de2636649fdd76f9bb4bdf4ed/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L415-L421

### Does this PR introduce _any_ user-facing change?

Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default.

### How was this patch tested?

New test.

Closes #33436 from cfmcgrady/SPARK-35912-v3.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants