[SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite #16784

wangmiao1981 · 2017-02-03T07:11:03Z

What changes were proposed in this pull request?

Add unit tests for testing SparseVector.

We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382.

def merge(other: MultivariateOnlineSummarizer): this.type = {
if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) {
require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " +
s"Expecting $n but got $
{other.n}

.")

How was this patch tested?

Unit tests

SparkQA · 2017-02-03T07:13:37Z

Test build #72302 has started for PR 16784 at commit fc1f7d1.

wangmiao1981 · 2017-02-03T17:37:51Z

Jenkins, retest this please.

SparkQA · 2017-02-03T17:54:56Z

Test build #72312 has finished for PR 16784 at commit fc1f7d1.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-02-03T18:31:41Z

Jenkins, retest this please.

SparkQA · 2017-02-03T19:33:55Z

Test build #72314 has finished for PR 16784 at commit fc1f7d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

@wangmiao1981 Thanks for the patch! I don't think that we need to replicate all tests using dense and sparse vectors. The main concern is making sure that sparse and dense vectors produce the same behavior.

Is this multivariate online summarizer issue really a bug? Or is it from passing in sparse vectors which are 10x longer than the dense vectors?

I'd recommend:

Change the dense and sparse datasets so that they match:
- create the sparse datasets
- convert the sparse ones to dense using Vector.toDense
Create a test suite which trains on the 2 datasets and confirms that the models produced are exactly the same.
Revert the other test changes. Unless we can think of reasons why the aspect being tested will interact with vector sparsity, then we don't need to add tests.

jkbradley · 2017-02-27T18:38:07Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

This API is strange, where the caller expects numFeatures = weights.size, but really numFeatures = 10 * weights.size if isDense=false. Please update it to construct a random dense or sparse vector first (both of length weights.size) and then compute y to make the API more consistent.

jkbradley · 2017-02-27T18:38:10Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

Move inside if-then to branch where it is used.

jkbradley · 2017-02-27T18:38:14Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

No need for this. Once the model has been fit, its training data is irrelevant.

wangmiao1981 · 2017-02-28T19:32:42Z

@jkbradley I simplified the tests and modified the data generation API by using toSparse method, which eliminates the index variable.
"Is this multivariate online summarizer issue really a bug? Or is it from passing in sparse vectors which are 10x longer than the dense vectors?"
It is not a bug based on my understanding. It is because the sparse size is larger than the dense vector size, which means we can't mix using them together.

SparkQA · 2017-02-28T20:20:42Z

Test build #73605 has finished for PR 16784 at commit 2422f08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T20:25:14Z

Test build #73606 has finished for PR 16784 at commit 8ff2709.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the simplification! A few comments.

jkbradley · 2017-03-01T00:03:13Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

    assert(model.transform(smallValidationDataset)
      .where("prediction=label").count() > nPoints * 0.8)
+    val sparseModel = svm.fit(smallSparseBinaryDataset)
+    assert(sparseModel.transform(smallSparseValidationDataset)


No need to do transform for this model; calling checkModels should suffice.

jkbradley · 2017-03-01T00:03:18Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

    assert(model.transform(smallValidationDataset)
      .where("prediction=label").count() > nPoints * 0.8)
+    val sparseModel = svm.fit(smallSparseBinaryDataset)
+    assert(sparseModel.transform(smallSparseValidationDataset)


jkbradley · 2017-03-01T00:03:23Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

    binaryDataset = generateSVMInput(1.0, Array[Double](1.0, 2.0, 3.0, 4.0), 10000, 42).toDF()
+
+    // Dataset for testing SparseVector
+    smallSparseBinaryDataset = generateSVMInput(A, Array[Double](B, C), nPoints, 42, false).toDF()


Why call generateSVMInput again? It seems brittle. I'd prefer to use a UDF to convert the vectors to sparse vectors here.

SparkQA · 2017-03-01T02:05:45Z

Test build #73630 has finished for PR 16784 at commit 911f3ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-01T02:07:49Z

Test build #73632 has finished for PR 16784 at commit 2a012fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-06T18:31:15Z

Thanks for the updates. This LGTM pending the conflict resolution. Sorry for the delay!

wangmiao1981 · 2017-03-06T19:56:05Z

Resolved. Thanks!

SparkQA · 2017-03-06T20:48:05Z

Test build #74033 has finished for PR 16784 at commit 1f98622.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FPGrowth @Since(\"2.2.0\") (
class FPGrowthModelWriter(instance: FPGrowthModel) extends MLWriter
case class ResolveInlineTables(conf: CatalystConf) extends Rule[LogicalPlan]
abstract class CSVDataSource extends Serializable

jkbradley · 2017-03-06T21:08:18Z

LGTM
Thanks @wangmiao1981 !
Merging with master

jkbradley reviewed Feb 27, 2017

View reviewed changes

wangmiao1981 added 3 commits February 28, 2017 10:28

unit test backup

2b55a4c

add SparseVector test

b9280c1

address review comments

2422f08

wangmiao1981 force-pushed the bk branch from fc1f7d1 to 2422f08 Compare February 28, 2017 19:25

revert one test

8ff2709

jkbradley reviewed Mar 1, 2017

View reviewed changes

wangmiao1981 added 2 commits February 28, 2017 17:05

simplify data generation

911f3ba

remove blank lines

2a012fa

Merge branch 'master' into bk

1f98622

asfgit closed this in 9265436 Mar 6, 2017

[SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite #16784

[SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite #16784

Uh oh!

Conversation

wangmiao1981 commented Feb 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

wangmiao1981 commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

wangmiao1981 commented Feb 3, 2017

Uh oh!

SparkQA commented Feb 3, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

jkbradley Feb 27, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Feb 27, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Feb 27, 2017

Choose a reason for hiding this comment

Uh oh!

wangmiao1981 commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 1, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

jkbradley commented Mar 6, 2017

Uh oh!

wangmiao1981 commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

jkbradley commented Mar 6, 2017

Uh oh!

Uh oh!