Skip to content

[SPARK-57736][SQL] Fix NPE in CreateNamedStruct.dataType when a struct field name is null#56845

Closed
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:SPARK-57736-createnamedstruct-npe
Closed

[SPARK-57736][SQL] Fix NPE in CreateNamedStruct.dataType when a struct field name is null#56845
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:SPARK-57736-createnamedstruct-npe

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Jun 28, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

CreateNamedStruct.dataType builds each field with StructField(name.toString, ...):

override lazy val dataType: StructType = {
  val fields = names.zip(valExprs).map {
    case (name, expr) =>
      ...
      StructField(name.toString, expr.dataType, expr.nullable, metadata)   // NPE if name == null
  }
  StructType(fields)
}

When a field name is null, name.toString throws a NullPointerException. This is reached eagerly while building a RowEncoder serializer (SerializerBuildHelper.createSerializerForObject -> CreateNamedStruct(...).dataType), so it crashes before any analysis runs. This PR makes the field name null-safe and preserves the null name:

StructField(if (name == null) null else name.toString, expr.dataType, expr.nullable, metadata)

Why are the changes needed?

A null field name is invalid input -- CreateNamedStruct.checkInputDataTypes already rejects it (names.contains(null) -> UNEXPECTED_NULL) -- but dataType dereferences name.toString before type checking, and the encoder calls dataType directly. Keeping it null-safe converts the hard NullPointerException into correct behavior, consistent with SPARK-57725 which made AttributeSeq tolerate null-named attributes.

Minimal reproduction:

import org.apache.spark.sql.catalyst.expressions.{CreateNamedStruct, Literal}
import org.apache.spark.sql.types.{IntegerType, StringType}

CreateNamedStruct(Seq(Literal.create(null, StringType), Literal(1))).dataType  // NPE before this fix

Note: this fixes the specific CreateNamedStruct.dataType NPE. The full createDataFrame(schemaWithNullFieldName) scenario hits additional, independent null-name sites further along (e.g. a StructField.name.equalsIgnoreCase schema comparison during resolution), which are separate pre-existing issues and out of scope here.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a regression test in ComplexTypeSuite asserting dataType no longer throws and preserves the null field name.

build/sbt 'catalyst/testOnly *ComplexTypeSuite'

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

…t field name is null

### What changes were proposed in this pull request?
`CreateNamedStruct.dataType` builds each field with `StructField(name.toString, ...)`, which
throws a `NullPointerException` when a field name is null. This is reached eagerly while
building a `RowEncoder` serializer, so it crashes before any analysis runs. This PR makes the
field name null-safe and preserves the null name:
`StructField(if (name == null) null else name.toString, ...)`.

### Why are the changes needed?
A null field name is invalid input (`checkInputDataTypes` already flags it as `UNEXPECTED_NULL`),
but `dataType` dereferences `name.toString` before type checking, and the encoder calls
`dataType` directly. Keeping it null-safe converts the hard NPE into correct behavior, consistent
with SPARK-57725 which made `AttributeSeq` tolerate null-named attributes.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added a regression test in `ComplexTypeSuite` asserting `dataType` no longer throws and preserves
the null field name.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor
@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Could you review when you have a moment? cc @cloud-fan @dongjoon-hyun

Comment on lines +756 to +757
assert(dt.length == 1)
assert(dt.head.name == null)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the test a bit - the current regression test asserts dt.length == 1 and dt.head.name == null (core behavior covered) but does not assert checkInputDataTypes() still returns UNEXPECTED_NULL, nor exercise a null name mixed with valid named fields.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MaxGekk, I left one note but otherwise looks good!

@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Thank you for the review, @uros-b! Addressed your comment: the regression test now also asserts checkInputDataTypes() returns UNEXPECTED_NULL and exercises a null field name mixed with valid named fields.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Merging to master/4.x/4.2/4.1/4.0. Thank you, @uros-b and @dongjoon-hyun for review.

@MaxGekk MaxGekk closed this in 0525313 Jun 29, 2026
MaxGekk added a commit that referenced this pull request Jun 29, 2026
…t field name is null

### What changes were proposed in this pull request?

`CreateNamedStruct.dataType` builds each field with `StructField(name.toString, ...)`:

```scala
override lazy val dataType: StructType = {
  val fields = names.zip(valExprs).map {
    case (name, expr) =>
      ...
      StructField(name.toString, expr.dataType, expr.nullable, metadata)   // NPE if name == null
  }
  StructType(fields)
}
```

When a field name is `null`, `name.toString` throws a `NullPointerException`. This is reached eagerly while building a `RowEncoder` serializer (`SerializerBuildHelper.createSerializerForObject` -> `CreateNamedStruct(...).dataType`), so it crashes before any analysis runs. This PR makes the field name null-safe and preserves the null name:

```scala
StructField(if (name == null) null else name.toString, expr.dataType, expr.nullable, metadata)
```

### Why are the changes needed?

A null field name is invalid input -- `CreateNamedStruct.checkInputDataTypes` already rejects it (`names.contains(null)` -> `UNEXPECTED_NULL`) -- but `dataType` dereferences `name.toString` before type checking, and the encoder calls `dataType` directly. Keeping it null-safe converts the hard `NullPointerException` into correct behavior, consistent with SPARK-57725 which made `AttributeSeq` tolerate null-named attributes.

Minimal reproduction:

```scala
import org.apache.spark.sql.catalyst.expressions.{CreateNamedStruct, Literal}
import org.apache.spark.sql.types.{IntegerType, StringType}

CreateNamedStruct(Seq(Literal.create(null, StringType), Literal(1))).dataType  // NPE before this fix
```

Note: this fixes the specific `CreateNamedStruct.dataType` NPE. The full `createDataFrame(schemaWithNullFieldName)` scenario hits additional, independent null-name sites further along (e.g. a `StructField.name.equalsIgnoreCase` schema comparison during resolution), which are separate pre-existing issues and out of scope here.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a regression test in `ComplexTypeSuite` asserting `dataType` no longer throws and preserves the null field name.

```
build/sbt 'catalyst/testOnly *ComplexTypeSuite'
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Closes #56845 from MaxGekk/SPARK-57736-createnamedstruct-npe.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 0525313)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit that referenced this pull request Jun 29, 2026
…t field name is null

### What changes were proposed in this pull request?

`CreateNamedStruct.dataType` builds each field with `StructField(name.toString, ...)`:

```scala
override lazy val dataType: StructType = {
  val fields = names.zip(valExprs).map {
    case (name, expr) =>
      ...
      StructField(name.toString, expr.dataType, expr.nullable, metadata)   // NPE if name == null
  }
  StructType(fields)
}
```

When a field name is `null`, `name.toString` throws a `NullPointerException`. This is reached eagerly while building a `RowEncoder` serializer (`SerializerBuildHelper.createSerializerForObject` -> `CreateNamedStruct(...).dataType`), so it crashes before any analysis runs. This PR makes the field name null-safe and preserves the null name:

```scala
StructField(if (name == null) null else name.toString, expr.dataType, expr.nullable, metadata)
```

### Why are the changes needed?

A null field name is invalid input -- `CreateNamedStruct.checkInputDataTypes` already rejects it (`names.contains(null)` -> `UNEXPECTED_NULL`) -- but `dataType` dereferences `name.toString` before type checking, and the encoder calls `dataType` directly. Keeping it null-safe converts the hard `NullPointerException` into correct behavior, consistent with SPARK-57725 which made `AttributeSeq` tolerate null-named attributes.

Minimal reproduction:

```scala
import org.apache.spark.sql.catalyst.expressions.{CreateNamedStruct, Literal}
import org.apache.spark.sql.types.{IntegerType, StringType}

CreateNamedStruct(Seq(Literal.create(null, StringType), Literal(1))).dataType  // NPE before this fix
```

Note: this fixes the specific `CreateNamedStruct.dataType` NPE. The full `createDataFrame(schemaWithNullFieldName)` scenario hits additional, independent null-name sites further along (e.g. a `StructField.name.equalsIgnoreCase` schema comparison during resolution), which are separate pre-existing issues and out of scope here.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a regression test in `ComplexTypeSuite` asserting `dataType` no longer throws and preserves the null field name.

```
build/sbt 'catalyst/testOnly *ComplexTypeSuite'
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Closes #56845 from MaxGekk/SPARK-57736-createnamedstruct-npe.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 0525313)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit that referenced this pull request Jun 29, 2026
…t field name is null

### What changes were proposed in this pull request?

`CreateNamedStruct.dataType` builds each field with `StructField(name.toString, ...)`:

```scala
override lazy val dataType: StructType = {
  val fields = names.zip(valExprs).map {
    case (name, expr) =>
      ...
      StructField(name.toString, expr.dataType, expr.nullable, metadata)   // NPE if name == null
  }
  StructType(fields)
}
```

When a field name is `null`, `name.toString` throws a `NullPointerException`. This is reached eagerly while building a `RowEncoder` serializer (`SerializerBuildHelper.createSerializerForObject` -> `CreateNamedStruct(...).dataType`), so it crashes before any analysis runs. This PR makes the field name null-safe and preserves the null name:

```scala
StructField(if (name == null) null else name.toString, expr.dataType, expr.nullable, metadata)
```

### Why are the changes needed?

A null field name is invalid input -- `CreateNamedStruct.checkInputDataTypes` already rejects it (`names.contains(null)` -> `UNEXPECTED_NULL`) -- but `dataType` dereferences `name.toString` before type checking, and the encoder calls `dataType` directly. Keeping it null-safe converts the hard `NullPointerException` into correct behavior, consistent with SPARK-57725 which made `AttributeSeq` tolerate null-named attributes.

Minimal reproduction:

```scala
import org.apache.spark.sql.catalyst.expressions.{CreateNamedStruct, Literal}
import org.apache.spark.sql.types.{IntegerType, StringType}

CreateNamedStruct(Seq(Literal.create(null, StringType), Literal(1))).dataType  // NPE before this fix
```

Note: this fixes the specific `CreateNamedStruct.dataType` NPE. The full `createDataFrame(schemaWithNullFieldName)` scenario hits additional, independent null-name sites further along (e.g. a `StructField.name.equalsIgnoreCase` schema comparison during resolution), which are separate pre-existing issues and out of scope here.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a regression test in `ComplexTypeSuite` asserting `dataType` no longer throws and preserves the null field name.

```
build/sbt 'catalyst/testOnly *ComplexTypeSuite'
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Closes #56845 from MaxGekk/SPARK-57736-createnamedstruct-npe.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 0525313)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit that referenced this pull request Jun 29, 2026
…t field name is null

### What changes were proposed in this pull request?

`CreateNamedStruct.dataType` builds each field with `StructField(name.toString, ...)`:

```scala
override lazy val dataType: StructType = {
  val fields = names.zip(valExprs).map {
    case (name, expr) =>
      ...
      StructField(name.toString, expr.dataType, expr.nullable, metadata)   // NPE if name == null
  }
  StructType(fields)
}
```

When a field name is `null`, `name.toString` throws a `NullPointerException`. This is reached eagerly while building a `RowEncoder` serializer (`SerializerBuildHelper.createSerializerForObject` -> `CreateNamedStruct(...).dataType`), so it crashes before any analysis runs. This PR makes the field name null-safe and preserves the null name:

```scala
StructField(if (name == null) null else name.toString, expr.dataType, expr.nullable, metadata)
```

### Why are the changes needed?

A null field name is invalid input -- `CreateNamedStruct.checkInputDataTypes` already rejects it (`names.contains(null)` -> `UNEXPECTED_NULL`) -- but `dataType` dereferences `name.toString` before type checking, and the encoder calls `dataType` directly. Keeping it null-safe converts the hard `NullPointerException` into correct behavior, consistent with SPARK-57725 which made `AttributeSeq` tolerate null-named attributes.

Minimal reproduction:

```scala
import org.apache.spark.sql.catalyst.expressions.{CreateNamedStruct, Literal}
import org.apache.spark.sql.types.{IntegerType, StringType}

CreateNamedStruct(Seq(Literal.create(null, StringType), Literal(1))).dataType  // NPE before this fix
```

Note: this fixes the specific `CreateNamedStruct.dataType` NPE. The full `createDataFrame(schemaWithNullFieldName)` scenario hits additional, independent null-name sites further along (e.g. a `StructField.name.equalsIgnoreCase` schema comparison during resolution), which are separate pre-existing issues and out of scope here.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a regression test in `ComplexTypeSuite` asserting `dataType` no longer throws and preserves the null field name.

```
build/sbt 'catalyst/testOnly *ComplexTypeSuite'
```

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Closes #56845 from MaxGekk/SPARK-57736-createnamedstruct-npe.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 0525313)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants