Skip to content

Comments

[SPARK-42384][SQL] Check for null input in generated code for mask function#39945

Closed
bersprockets wants to merge 4 commits intoapache:masterfrom
bersprockets:mask_npe_issue
Closed

[SPARK-42384][SQL] Check for null input in generated code for mask function#39945
bersprockets wants to merge 4 commits intoapache:masterfrom
bersprockets:mask_npe_issue

Conversation

@bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

When generating code for the mask function, call ctx.nullSafeExec to produce null safe code.

This change assumes that the mask function returns null only when the input is null (which appears to be the case, from reading the code of Mask.transformInput).

Why are the changes needed?

The following query fails with a NullPointerException:

create or replace temp view v1 as
select * from values
(null),
('AbCD123-@$#')
as data(col1);

cache table v1;

select mask(col1) from v1;

23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)

The generated code calls UnsafeWriter.write(0, value_0) regardless of whether Mask.transformInput returns null or not. The UnsafeWriter.write method for UTF8String does not expect a null pointer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

@github-actions github-actions bot added the SQL label Feb 8, 2023
@LuciferYang
Copy link
Contributor

cc @gengliangwang FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the new generated code will look like this (if the first child is null, then the result is null):

/* 031 */     boolean isNull_1 = i.isNullAt(0);
/* 032 */     UTF8String value_1 = isNull_1 ?
/* 033 */     null : (i.getUTF8String(0));
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */     boolean isNull_0 = isNull_1;
/* 039 */     UTF8String value_0 = null;
/* 040 */
/* 041 */     if (!isNull_1) {
/* 042 */       value_0 = org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) references[3] /* literal */));;
/* 043 */     }
/* 044 */     if (isNull_0) {
/* 045 */       mutableStateArray_0[0].setNullAt(0);
/* 046 */     } else {
/* 047 */       mutableStateArray_0[0].write(0, value_0);
/* 048 */     }
/* 049 */     return (mutableStateArray_0[0].getRow());

Versus the old generated code, which looks like this (call Mask.transformInput even when input is null, and call UnsafeWriter.write(0, value_0) even if value_0 is null):

/* 031 */     boolean isNull_1 = i.isNullAt(0);
/* 032 */     UTF8String value_1 = isNull_1 ?
/* 033 */     null : (i.getUTF8String(0));
/* 034 */
/* 035 */
/* 036 */
/* 037 */
/* 038 */     UTF8String value_0 = null;
/* 039 */     value_0 = org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) references[3] /* literal */));;
/* 040 */     if (false) {
/* 041 */       mutableStateArray_0[0].setNullAt(0);
/* 042 */     } else {
/* 043 */       mutableStateArray_0[0].write(0, value_0);
/* 044 */     }
/* 045 */     return (mutableStateArray_0[0].getRow());
/* 046 */   }

f(firstGen.value, secondGen.value, thirdGen.value, fourthGen.value, fifthGen.value)
ev.copy(
code = code"""
if (nullable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bersprockets Have you tried using the NullIntolerant trait on the expression Mask instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang With this function, only the first parameter affects whether NULL is returned. For the other parameters, NULL input has special meaning. An example from the doc:

      > SELECT _FUNC_('AbCD123-@$#', NULL, NULL, NULL, 'o');
        AbCD123oooo

Mixing in NullIntolerant causes the above query to return NULL:

spark-sql> SELECT MASK('AbCD123-@$#', NULL, NULL, NULL, 'o');
NULL
Time taken: 0.063 seconds, Fetched 1 row(s)

Copy link
Member

@gengliangwang gengliangwang Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation!

@gengliangwang
Copy link
Member

@bersprockets Thanks for the work. Merging to master/3.4

gengliangwang pushed a commit that referenced this pull request Feb 16, 2023
…nction

### What changes were proposed in this pull request?

When generating code for the mask function, call `ctx.nullSafeExec` to produce null safe code.

This change assumes that the mask function returns null only when the input is null (which appears to be the case, from reading the code of `Mask.transformInput`).

### Why are the changes needed?

The following query fails with a `NullPointerException`:
```
create or replace temp view v1 as
select * from values
(null),
('AbCD123-$#')
as data(col1);

cache table v1;

select mask(col1) from v1;

23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
```
The generated code calls `UnsafeWriter.write(0, value_0)` regardless of whether `Mask.transformInput` returns null or not. The `UnsafeWriter.write` method for `UTF8String` does not expect a null pointer.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes #39945 from bersprockets/mask_npe_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 7ff8ba2)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@bersprockets bersprockets deleted the mask_npe_issue branch March 28, 2023 16:37
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…nction

### What changes were proposed in this pull request?

When generating code for the mask function, call `ctx.nullSafeExec` to produce null safe code.

This change assumes that the mask function returns null only when the input is null (which appears to be the case, from reading the code of `Mask.transformInput`).

### Why are the changes needed?

The following query fails with a `NullPointerException`:
```
create or replace temp view v1 as
select * from values
(null),
('AbCD123-$#')
as data(col1);

cache table v1;

select mask(col1) from v1;

23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
```
The generated code calls `UnsafeWriter.write(0, value_0)` regardless of whether `Mask.transformInput` returns null or not. The `UnsafeWriter.write` method for `UTF8String` does not expect a null pointer.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes apache#39945 from bersprockets/mask_npe_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 7ff8ba2)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants