-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35912][SQL] Fix cast struct contains null value to string/struct #33146
Conversation
Can one of the admins verify this patch? |
val javaType = JavaCode.javaType(ft) | ||
code""" | ||
|${if (i != 0) code"""$buffer.append(",");""" else EmptyBlock} | ||
|if ($row.isNullAt($i)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the actual value is null, for primitive type field, row.isNullAt(i)
return ture, but row.getXXX
return a default value.
For exmaple:
val r = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(Array(1, null))
println(r.getInt(0)) // 1
println(r.getInt(1)) // 0
println(r.isNullAt(1)) // true
so we cann't only check row.isNullAt(i)
here, we need to do the same logical like BoundReference.doGenCode()
, add nullable check.
how is the cache issue related to the cast? |
HI, @HyukjinKwon |
@@ -406,19 +406,21 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
if (row.numFields > 0) { | |||
val st = fields.map(_.dataType) | |||
val toUTF8StringFuncs = st.map(castToString) | |||
if (row.isNullAt(0)) { | |||
if (fields(0).nullable && row.isNullAt(0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if fields(0).nullable
is false, how can row.isNullAt(0)
be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have the same question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user create dataframe from spark.internalCreateDataFrame()
, the row.isNullAt()
may be true even though the schema nullable is false.
For instance:
val schema = StructType(Seq(
StructField("x",
StructType(Seq(
StructField("y", IntegerType, true),
StructField("z", IntegerType, false)
)))))
val rdd = spark.sparkContext.parallelize(Seq(InternalRow(InternalRow(1, null))))
val df = spark.internalCreateDataFrame(rdd, schema)
df.show
// current master branch output
// +---------+
// | x|
// +---------+
// |{1, null}|
// +---------+
Although the spark.internalCreateDataFrame()
is sql package private API, but spark.read.json()
and spark.read.csv()
call it without null value handled.(the example show in pr description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we need to fix the nullability. There are so many places in the Spark codebase that relies on nullability to do optimizations. It's not possible to change all of them to not trust the nullability anymore.
Can we fix spark.read.json()
to set the nullability correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me try.
Hey mind explaining why cast path issue is related to being cached? |
|
||
test("SPARK-35912: Cast struct contains the null value to string") { | ||
Seq(true, false).foreach { | ||
case nullable => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Seq(true, false).foreach { nullable =>
} | ||
|
||
test("SPARK-35912: Cast struct contains the null value to struct") { | ||
Seq(true, false).foreach { case nullable => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seq(true, false).foreach { nullable =>
checkEvaluation(ret, expected) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please remove this blank.
@@ -406,19 +406,21 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
if (row.numFields > 0) { | |||
val st = fields.map(_.dataType) | |||
val toUTF8StringFuncs = st.map(castToString) | |||
if (row.isNullAt(0)) { | |||
if (fields(0).nullable && row.isNullAt(0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have the same question)
Actually, the cached result is what we want. The issue is that
|
Shouldn't it fail instead of setting it as |
Thanks for your suggestion, I'll try. |
Create a new PR BTW, Shall we merge this PR? The cast issue may occur when the user create dataframe from API |
|
What changes were proposed in this pull request?
This PR fixes an issue that cast the struct which contains null value to other type has a difference result when we enable/disable codegen.
Here is an example:
Actually, the result should be depending on the field nullable setting. this bug also happens when we cast struct to struct.
Does this PR introduce any user-facing change?
No, only bug fix.
How was this patch tested?
New test.