[SPARK-25072][PySpark] Forbid extra value for custom Row#22140
[SPARK-25072][PySpark] Forbid extra value for custom Row#22140xuanyuanking wants to merge 2 commits intoapache:masterfrom
Conversation
|
Test build #94920 has finished for PR 22140 at commit
|
|
cc @HyukjinKwon |
|
cc @BryanCutler as well since we discussed an issue about this code path before. |
|
Does it make any sense to have less values than fields? Maybe we should check that they are equal, wdyt @HyukjinKwon ? |
|
AFAIC, the fix should forbid illegal extra value passing. If less values than fields it should get a |
|
gental ping @HyukjinKwon @BryanCutler |
BryanCutler
left a comment
There was a problem hiding this comment.
Let's just leave to case of less values for another time since you already have this fix. I do think you should move the check to def __call__ in Row just before _create_row is called. It is more user-facing that way.
python/pyspark/sql/tests.py
Outdated
| struct_field = StructField("a", IntegerType()) | ||
| self.assertRaises(TypeError, struct_field.typeName) | ||
|
|
||
| def test_invalid_create_row(slef): |
python/pyspark/sql/types.py
Outdated
|
|
||
| def _create_row(fields, values): | ||
| if len(values) > len(fields): | ||
| raise ValueError("Can not create %s by %s" % (fields, values)) |
There was a problem hiding this comment.
I'd like to improve this message a little, maybe "Can not create Row with fields %s, expected %d values but got %s" % (fields, len(fields), values)
There was a problem hiding this comment.
Thanks, improve done and move this check to __call__ in Row. eb3f506
python/pyspark/sql/tests.py
Outdated
| self.assertRaises(TypeError, struct_field.typeName) | ||
|
|
||
| def test_invalid_create_row(slef): | ||
| rowClass = Row("c1", "c2") |
|
@BryanCutler, for #22140 (comment), yea, to me it looks less sense actually but seems at least working for now: from pyspark.sql import Row
rowClass = Row("c1", "c2")
spark.createDataFrame([rowClass(1)]).show()I think we should consider disallowing it in 3.0.0 given the test above. |
|
Test build #95756 has finished for PR 22140 at commit
|
good point, I guess it only fails when you supply a schema. |
## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <liyuanjian@baidu.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> (cherry picked from commit c84bc40) Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <liyuanjian@baidu.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> (cherry picked from commit c84bc40) Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
|
merged to master, branch 2.4 and 2.3. Thanks @xuanyuanking ! |
|
Thanks @BryanCutler @HyukjinKwon ! |
|
@BryanCutler What is the reason to backport this PR? This sounds a behavior change. @xuanyuanking Could you please update the document? |
#22369 Thanks for reminding, I'll pay attention in future work. |
|
@gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break? |
|
Yea, actually I wouldn't at least backport this to branch-2.3 since the release is very close. Looks a bug to me as well. One nitpicking is the case with RDD operation: >>> from pyspark.sql import Row
>>> row_class = Row("c1", "c2")
>>> row = row_class(1, 2, 3)
>>> spark.sparkContext.parallelize([row]).map(lambda r: r.c1).collect()
[1]This is really unlikely and I even wonder if it makes any sense (also given the nature of Python language itself), but still there might be a case although the creation of the namedtuple-like row with invalid arguments itself should be disallowed, as fixed here. Can we just simply take this out from branch-2.3? |
Thanks @HyukjinKwon , that is fine with me. What do you think @gatorsmile ? |
|
@BryanCutler @HyukjinKwon Thanks for your understanding. Normally, we are very conservative to introduce any potential behavior change to the released version. I just reverted it from branch 2.3. Thanks! |
Yes, I know. It seemed to me at the time as failing fast rather than later and improving the error message, but best to be safe. Thanks! |
|
We are very conservative when backporting the PR to the released version. |
What changes were proposed in this pull request?
Add value length check in
_create_row, forbid extra value for custom Row in PySpark.How was this patch tested?
New UT in pyspark-sql