-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11050][MLLIB] PySpark SparseVector can return wrong index in e… #9069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rror message For negative indices in the SparseVector, we update the index value. If we have an incorrect index at this point, the error message has the incorrect *updated* index instead of the original one. This change contains the fix for the same.
|
I've been seeing a lot of bugfix patches for the PySpark SparseVector recently. This suggests to me that we need to write significantly more tests for this component. |
|
@jkbradley for this. If there is a task for adding more tests, I can take that up as well. |
|
It'd be really fun to try using Hypothesis to write some property-based tests for this: https://hypothesis.readthedocs.org/en/master/ |
|
@JoshRosen I think it's that PySpark has less coverage than Scala, and our linear algebra code could use some more coverage too. And PySpark SparseVector is in the intersection of these problems. (Though 4 of those bug fix patches may have been patch + 3 backports.) Hypothesis looks cool...future work? : ) @bhargav There is not a specific task for adding more tests, but since you're interested, it'd be awesome if you could check through some of the PySpark Vector and Matrix APIs and unit tests and see if you can find missing coverage. Note also that the tests are split between doc tests (in each .py file) and unit tests (in tests.py files). If you find missing items, can you please make one or more JIRAs? If you're unsure about any, I'd recommend making a single JIRA and listing there; we can then create subtasks for each major issue. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to move this check to line 768, before index gets modified. That way, you don't have to even create a copy of index (though you'll need to adjust the check's index range).
|
Test build #1882 has finished for PR 9069 at commit
|
|
Updated the PR. Though I now see that the branch name is different from the JIRA issue number. |
|
@jkbradley Gentle ping. :) |
|
@bhargav Thanks for the update! LGTM pending tests |
|
Btw, it doesn't matter what you call your branch name; that's not a problem. |
|
Test build #1916 has finished for PR 9069 at commit
|
|
Merging with master |
…rror message
For negative indices in the SparseVector, we update the index value. If we have an incorrect index
at this point, the error message has the incorrect updated index instead of the original one. This
change contains the fix for the same.