-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #26029 has started for PR 4183 at commit
|
|
Test build #26029 has finished for PR 4183 at commit
|
|
Test PASSed. |
|
I agree that vectors must have the same length and we should check it. It may not be necessary to change the implementation. I saw couple performance issues in your code, for example, unnecessary index lookups. I would suggest only adding the check in this PR. If you want to update the implementation, let's do it in another PR with micro-benchmark. |
|
Thank for the reply, I‘ll limit the scope. And since the size equality constraint will be pretty straight-forward, it seems no additional ut is required. The main purpose for the code refactoring was to unify the scattered logic.
|
|
Test build #26039 has started for PR 4183 at commit
|
|
Test build #26039 has finished for PR 4183 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. For the performance issue, let's open another PR for it. I can see at least the condition "v1.indices.length / v1.size < 0.5" is wrong, because the left-hand side returns an integer. Please let me know whether you want to continue working on this. Thanks! |
|
@mengxr Thanks for reviewing and sharp eyes! Sure, I'll continue to work on it. |
|
@hhbyyh Great. I created a JIRA for this issue and assigned it to you: https://issues.apache.org/jira/browse/SPARK-5419 Thanks!! |
JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384
Currently
Vectors.sqdistreturn inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for samplePR scope:
Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code.
For reviewers:
Maybe we should first discuss what's the correct behavior.
I'll update PR with more optimization and additional ut afterwards. Thanks.