Skip to content

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

Add unit tests for testing SparseVector.

We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382.

def merge(other: MultivariateOnlineSummarizer): this.type = {
if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) {
require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " +
s"Expecting $n but got $
{other.n}

.")

How was this patch tested?

Unit tests

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72302 has started for PR 16784 at commit fc1f7d1.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72312 has finished for PR 16784 at commit fc1f7d1.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72314 has finished for PR 16784 at commit fc1f7d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangmiao1981 Thanks for the patch! I don't think that we need to replicate all tests using dense and sparse vectors. The main concern is making sure that sparse and dense vectors produce the same behavior.

Is this multivariate online summarizer issue really a bug? Or is it from passing in sparse vectors which are 10x longer than the dense vectors?

I'd recommend:

  • Change the dense and sparse datasets so that they match:
    • create the sparse datasets
    • convert the sparse ones to dense using Vector.toDense
  • Create a test suite which trains on the 2 datasets and confirms that the models produced are exactly the same.
  • Revert the other test changes. Unless we can think of reasons why the aspect being tested will interact with vector sparsity, then we don't need to add tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is strange, where the caller expects numFeatures = weights.size, but really numFeatures = 10 * weights.size if isDense=false. Please update it to construct a random dense or sparse vector first (both of length weights.size) and then compute y to make the API more consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move inside if-then to branch where it is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this. Once the model has been fit, its training data is irrelevant.

@wangmiao1981
Copy link
Contributor Author

@jkbradley I simplified the tests and modified the data generation API by using toSparse method, which eliminates the index variable.
"Is this multivariate online summarizer issue really a bug? Or is it from passing in sparse vectors which are 10x longer than the dense vectors?"
It is not a bug based on my understanding. It is because the sparse size is larger than the dense vector size, which means we can't mix using them together.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73605 has finished for PR 16784 at commit 2422f08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73606 has finished for PR 16784 at commit 8ff2709.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the simplification! A few comments.

assert(model.transform(smallValidationDataset)
.where("prediction=label").count() > nPoints * 0.8)
val sparseModel = svm.fit(smallSparseBinaryDataset)
assert(sparseModel.transform(smallSparseValidationDataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do transform for this model; calling checkModels should suffice.

assert(model.transform(smallValidationDataset)
.where("prediction=label").count() > nPoints * 0.8)
val sparseModel = svm.fit(smallSparseBinaryDataset)
assert(sparseModel.transform(smallSparseValidationDataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

binaryDataset = generateSVMInput(1.0, Array[Double](1.0, 2.0, 3.0, 4.0), 10000, 42).toDF()

// Dataset for testing SparseVector
smallSparseBinaryDataset = generateSVMInput(A, Array[Double](B, C), nPoints, 42, false).toDF()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call generateSVMInput again? It seems brittle. I'd prefer to use a UDF to convert the vectors to sparse vectors here.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73630 has finished for PR 16784 at commit 911f3ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73632 has finished for PR 16784 at commit 2a012fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the updates. This LGTM pending the conflict resolution. Sorry for the delay!

@wangmiao1981
Copy link
Contributor Author

Resolved. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74033 has finished for PR 16784 at commit 1f98622.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FPGrowth @Since(\"2.2.0\") (
  • class FPGrowthModelWriter(instance: FPGrowthModel) extends MLWriter
  • case class ResolveInlineTables(conf: CatalystConf) extends Rule[LogicalPlan]
  • abstract class CSVDataSource extends Serializable

@jkbradley
Copy link
Member

LGTM
Thanks @wangmiao1981 !
Merging with master

@asfgit asfgit closed this in 9265436 Mar 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants