Describe the bug
TimedWithCodegenExec.doConsume drops the row: ExprCode parameter when calling
consume(ctx, input). This causes ctx.INPUT_ROW to be null for any downstream
CodegenFallback expression that interpolates INPUT_ROW into its generated code,
producing an NPE inside Block.code interpolation:
java.lang.NullPointerException: Cannot invoke "Object.getClass()" because "arg" is null
at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotInterpolateClassIntoCodeBlockError(QueryExecutionErrors.scala:426)
at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240)
...
at org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback.doGenCode(CodegenFallback.scala:56)
at org.apache.spark.sql.catalyst.expressions.JsonToStructs.doGenCode(jsonExpressions.scala:541)
The plugin works correctly for plans that contain only fully codegen'd expressions; it
breaks any plan that contains a CodegenFallback expression — most visibly from_json
(JsonToStructs), but the same applies to any expression that extends CodegenFallback
and uses ctx.INPUT_ROW in its generated code.
Root cause
In spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/TimedExec.scala
(lines 127-128 on main):
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String =
consume(ctx, input)
This drops the third argument. The Spark CodegenSupport.consume signature is
final def consume(ctx: CodegenContext, outputVars: Seq[ExprCode], row: String = null): String
When TimedWithCodegenExec is inserted between operators, downstream CodegenSupport
nodes that consult ctx.INPUT_ROW see null instead of the underlying row variable.
CodegenFallback.doGenCode (Spark 3.5, CodegenFallback.scala:56) interpolates
val input = ctx.INPUT_ROW into a code"..." block:
ev.copy(code = code"""
| ...
| ((Expression) references[$idx]).eval($input);
| ...
""")
The Scala code macro at javaCode.scala:237-250 walks each interpolated arg and calls
arg.getClass to dispatch by type; a null arg here triggers
cannotInterpolateClassIntoCodeBlockError, which immediately NPEs on arg.getClass
before the type-error message can even be constructed.
Environemnt
spark verison: 3.5 (3.5.6-amzn-1)
platform: EMR (7.12.0, Scala 2.12)
- DataFlint
spark_2.12-0.9.7 (also reproduces against earlier 0.9.x — the offending
code in TimedExec.scala has not changed materially in recent versions)
spark.plugins=io.dataflint.spark.SparkDataflintPlugin
spark.dataflint.instrument.spark.enabled=true
- AQE on (default), Kryo serializer
To Reproduce
Steps to reproduce the behavior:
-
Start a Spark 3.5.6 session on EMR 7.12 with the DataFlint plugin enabled:
spark.plugins=io.dataflint.spark.SparkDataflintPlugin,
spark.dataflint.instrument.spark.enabled=true.
-
Run the following PySpark snippet (uses from_json, a CodegenFallback expression):
from pyspark.sql.functions import from_json, col, explode
from pyspark.sql.types import ArrayType, StructType, StructField, StringType
schema = ArrayType(StructType([
StructField("name", StringType(), True),
StructField("kind", StringType(), True),
]))
df = spark.createDataFrame(
[("k1", '[{"name":"a","kind":"x"}]'), ("k2", None)],
"key STRING, payload STRING",
)
(df
.filter(col("payload").isNotNull())
.withColumn("parsed", from_json(col("payload"), schema))
.filter(col("parsed").isNotNull())
.select("key", explode("parsed").alias("d"))
.filter(col("d.name").isNotNull())
.count())
-
Trigger the action (.count()).
-
See the NPE in the stack trace above, raised from
CodegenFallback.doGenCode → Block.code interpolation.
Expected behavior
The query runs to completion and returns the row count, the same as it does without
the DataFlint plugin or with whole-stage codegen disabled. TimedWithCodegenExec
should propagate the row parameter to consume, so downstream CodegenFallback
expressions see a valid ctx.INPUT_ROW:
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String =
consume(ctx, input, if (row == null) null else row.value)
This matches how stock Spark operators that don't transform rows (e.g.
InputAdapter's consume path) propagate INPUT_ROW.
Screenshots
N/A — failure is a JVM stack trace (included above).
Additional context
- Workarounds: Setting
spark.dataflint.instrument.spark.enabled=false (plugin
still loaded) avoids the crash and is what we are deploying. Setting
spark.sql.codegen.wholeStage=false also avoids it but at a real performance cost.
- Impact: Any Spark workload using DataFlint instrumentation that contains a
CodegenFallback expression which references INPUT_ROW in its generated code.
The most common offender is from_json, but the same path is used by other
expressions in this family. Because the bug is plan-shape-dependent, it can lurk
in pipelines for a long time and then surface when a new column or preprocessor
introduces a CodegenFallback expression — exactly how we hit it.
Describe the bug
TimedWithCodegenExec.doConsumedrops therow: ExprCodeparameter when callingconsume(ctx, input). This causesctx.INPUT_ROWto benullfor any downstreamCodegenFallbackexpression that interpolatesINPUT_ROWinto its generated code,producing an NPE inside
Block.codeinterpolation:The plugin works correctly for plans that contain only fully codegen'd expressions; it
breaks any plan that contains a
CodegenFallbackexpression — most visiblyfrom_json(
JsonToStructs), but the same applies to any expression that extendsCodegenFallbackand uses
ctx.INPUT_ROWin its generated code.Root cause
In
spark-plugin/plugin/src/main/scala/org/apache/spark/dataflint/TimedExec.scala(lines 127-128 on
main):This drops the third argument. The Spark
CodegenSupport.consumesignature isWhen
TimedWithCodegenExecis inserted between operators, downstreamCodegenSupportnodes that consult
ctx.INPUT_ROWseenullinstead of the underlying row variable.CodegenFallback.doGenCode(Spark 3.5,CodegenFallback.scala:56) interpolatesval input = ctx.INPUT_ROWinto acode"..."block:The Scala
codemacro atjavaCode.scala:237-250walks each interpolated arg and callsarg.getClassto dispatch by type; anullarg here triggerscannotInterpolateClassIntoCodeBlockError, which immediately NPEs onarg.getClassbefore the type-error message can even be constructed.
Environemnt
spark verison: 3.5 (3.5.6-amzn-1)
platform: EMR (7.12.0, Scala 2.12)
spark_2.12-0.9.7(also reproduces against earlier 0.9.x — the offendingcode in
TimedExec.scalahas not changed materially in recent versions)spark.plugins=io.dataflint.spark.SparkDataflintPluginspark.dataflint.instrument.spark.enabled=trueTo Reproduce
Steps to reproduce the behavior:
Start a Spark 3.5.6 session on EMR 7.12 with the DataFlint plugin enabled:
spark.plugins=io.dataflint.spark.SparkDataflintPlugin,spark.dataflint.instrument.spark.enabled=true.Run the following PySpark snippet (uses
from_json, aCodegenFallbackexpression):Trigger the action (
.count()).See the NPE in the stack trace above, raised from
CodegenFallback.doGenCode→Block.codeinterpolation.Expected behavior
The query runs to completion and returns the row count, the same as it does without
the DataFlint plugin or with whole-stage codegen disabled.
TimedWithCodegenExecshould propagate the
rowparameter toconsume, so downstreamCodegenFallbackexpressions see a valid
ctx.INPUT_ROW:This matches how stock Spark operators that don't transform rows (e.g.
InputAdapter's consume path) propagateINPUT_ROW.Screenshots
N/A — failure is a JVM stack trace (included above).
Additional context
spark.dataflint.instrument.spark.enabled=false(pluginstill loaded) avoids the crash and is what we are deploying. Setting
spark.sql.codegen.wholeStage=falsealso avoids it but at a real performance cost.CodegenFallbackexpression which referencesINPUT_ROWin its generated code.The most common offender is
from_json, but the same path is used by otherexpressions in this family. Because the bug is plan-shape-dependent, it can lurk
in pipelines for a long time and then surface when a new column or preprocessor
introduces a
CodegenFallbackexpression — exactly how we hit it.