New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16356][Follow-up][ML] Enforce ML test of exception for local/distributed Dataset. #15261
Conversation
cc @HyukjinKwon |
Oh, I see. Thanks for looking into this. +1 for this PR. |
So, this is roughly error from driver-side vs executor side. I see. |
Test build #65964 has finished for PR 15261 at commit
|
@HyukjinKwon The root cause of this is Spark supported creating local Dataset which may not trigger a Spark job. This satisfied the design of Dataset, and in most case they have same behavior for local and distributed Dataset except whether to wrapper the exception. We need to enforce the test in ML. If Spark SQL guys unified them in the future, we can make corresponding change. Thanks! |
@srowen Would you mind to have a look when you are available? Thanks. |
@@ -121,10 +119,17 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext | |||
|
|||
model.transform(densePoints1) // should work | |||
model.transform(sparsePoints1) // should work | |||
intercept[SparkException] { | |||
// If the data is local Dataset, it throws AssertionError directly. | |||
intercept[AssertionError] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe off-topic for this review, but is this an assertion error? bad input shouldn't cause an assertion to trip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is indeed a problem, let's try to find out all similar cases and resolved them in a separate PR. Thanks.
Merged into master, thanks for review. |
What changes were proposed in this pull request?
#14035 added
testImplicits
to ML unit tests and promotedtoDF()
, but left one minor issue atVectorIndexerSuite
. If we create the DataFrame bySeq(...).toDF()
, it will throw different error/exception compared withsc.parallelize(Seq(...)).toDF()
for one of the test cases.After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at
assert
. If the data is local Dataset, it throwsAssertionError
directly; If the data is distributed Dataset, it throwsSparkException
which is the wrapper ofAssertionError
. I think we should enforce this test to cover both case.How was this patch tested?
Unit test.