Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16356][Follow-up][ML] Enforce ML test of exception for local/distributed Dataset. #15261

Closed
wants to merge 1 commit into from

Conversation

yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Sep 27, 2016

What changes were proposed in this pull request?

#14035 added testImplicits to ML unit tests and promoted toDF(), but left one minor issue at VectorIndexerSuite. If we create the DataFrame by Seq(...).toDF(), it will throw different error/exception compared with sc.parallelize(Seq(...)).toDF() for one of the test cases.

After in-depth study, I found it was caused by different behavior of local and distributed Dataset if the UDF failed at assert. If the data is local Dataset, it throws AssertionError directly; If the data is distributed Dataset, it throws SparkException which is the wrapper of AssertionError. I think we should enforce this test to cover both case.

How was this patch tested?

Unit test.

@yanboliang
Copy link
Contributor Author

cc @HyukjinKwon

@HyukjinKwon
Copy link
Member

Oh, I see. Thanks for looking into this. +1 for this PR.

@HyukjinKwon
Copy link
Member

So, this is roughly error from driver-side vs executor side. I see.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65964 has finished for PR 15261 at commit c24021e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 27, 2016

@HyukjinKwon The root cause of this is Spark supported creating local Dataset which may not trigger a Spark job. This satisfied the design of Dataset, and in most case they have same behavior for local and distributed Dataset except whether to wrapper the exception. We need to enforce the test in ML. If Spark SQL guys unified them in the future, we can make corresponding change. Thanks!

@yanboliang
Copy link
Contributor Author

@srowen Would you mind to have a look when you are available? Thanks.

@@ -121,10 +119,17 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext

model.transform(densePoints1) // should work
model.transform(sparsePoints1) // should work
intercept[SparkException] {
// If the data is local Dataset, it throws AssertionError directly.
intercept[AssertionError] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe off-topic for this review, but is this an assertion error? bad input shouldn't cause an assertion to trip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is indeed a problem, let's try to find out all similar cases and resolved them in a separate PR. Thanks.

@yanboliang
Copy link
Contributor Author

Merged into master, thanks for review.

@asfgit asfgit closed this in a19a1bb Sep 29, 2016
@yanboliang yanboliang deleted the spark-16356 branch September 29, 2016 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants