-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-19382][ML]:Test sparse vectors in LinearSVCSuite #16784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #72302 has started for PR 16784 at commit |
Jenkins, retest this please. |
Test build #72312 has finished for PR 16784 at commit
|
Jenkins, retest this please. |
Test build #72314 has finished for PR 16784 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangmiao1981 Thanks for the patch! I don't think that we need to replicate all tests using dense and sparse vectors. The main concern is making sure that sparse and dense vectors produce the same behavior.
Is this multivariate online summarizer issue really a bug? Or is it from passing in sparse vectors which are 10x longer than the dense vectors?
I'd recommend:
- Change the dense and sparse datasets so that they match:
- create the sparse datasets
- convert the sparse ones to dense using Vector.toDense
- Create a test suite which trains on the 2 datasets and confirms that the models produced are exactly the same.
- Revert the other test changes. Unless we can think of reasons why the aspect being tested will interact with vector sparsity, then we don't need to add tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is strange, where the caller expects numFeatures = weights.size, but really numFeatures = 10 * weights.size if isDense=false. Please update it to construct a random dense or sparse vector first (both of length weights.size) and then compute y to make the API more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move inside if-then to branch where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this. Once the model has been fit, its training data is irrelevant.
@jkbradley I simplified the tests and modified the data generation API by using toSparse method, which eliminates the index variable. |
Test build #73605 has finished for PR 16784 at commit
|
Test build #73606 has finished for PR 16784 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the simplification! A few comments.
assert(model.transform(smallValidationDataset) | ||
.where("prediction=label").count() > nPoints * 0.8) | ||
val sparseModel = svm.fit(smallSparseBinaryDataset) | ||
assert(sparseModel.transform(smallSparseValidationDataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do transform for this model; calling checkModels should suffice.
assert(model.transform(smallValidationDataset) | ||
.where("prediction=label").count() > nPoints * 0.8) | ||
val sparseModel = svm.fit(smallSparseBinaryDataset) | ||
assert(sparseModel.transform(smallSparseValidationDataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
binaryDataset = generateSVMInput(1.0, Array[Double](1.0, 2.0, 3.0, 4.0), 10000, 42).toDF() | ||
|
||
// Dataset for testing SparseVector | ||
smallSparseBinaryDataset = generateSVMInput(A, Array[Double](B, C), nPoints, 42, false).toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call generateSVMInput again? It seems brittle. I'd prefer to use a UDF to convert the vectors to sparse vectors here.
Test build #73630 has finished for PR 16784 at commit
|
Test build #73632 has finished for PR 16784 at commit
|
Thanks for the updates. This LGTM pending the conflict resolution. Sorry for the delay! |
Resolved. Thanks! |
Test build #74033 has finished for PR 16784 at commit
|
LGTM |
What changes were proposed in this pull request?
Add unit tests for testing SparseVector.
We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382.
def merge(other: MultivariateOnlineSummarizer): this.type = {$n but got $
if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) {
require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " +
s"Expecting
{other.n}
.")
How was this patch tested?
Unit tests