Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jan 23, 2015

JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384
Currently Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample

PR scope:
Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code.

For reviewers:
Maybe we should first discuss what's the correct behavior.

  1. Vectors for sqdist must have the same length, like in breeze?
  2. If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?)

I'll update PR with more optimization and additional ut afterwards. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26029 has started for PR 4183 at commit 54cbf97.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26029 has finished for PR 4183 at commit 54cbf97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26029/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jan 23, 2015

I agree that vectors must have the same length and we should check it. It may not be necessary to change the implementation. I saw couple performance issues in your code, for example, unnecessary index lookups. I would suggest only adding the check in this PR. If you want to update the implementation, let's do it in another PR with micro-benchmark.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 24, 2015

Thank for the reply, I‘ll limit the scope. And since the size equality constraint will be pretty straight-forward, it seems no additional ut is required.

The main purpose for the code refactoring was to unify the scattered logic.
Performance wise, the only concern for the original code was the extra space consumption for,

  1. the invocation of SparseVector.toArray when v1.indices.length / v1.size >= 0.5,
  2. the zip action in (dense, dense) computation.
    I'll hold the refactoring until it's found necessary.

@SparkQA
Copy link

SparkQA commented Jan 24, 2015

Test build #26039 has started for PR 4183 at commit 1f17328.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 24, 2015

Test build #26039 has finished for PR 4183 at commit 1f17328.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26039/
Test PASSed.

@hhbyyh hhbyyh changed the title [SPARK-5384][mllib] Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths [SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths Jan 26, 2015
@asfgit asfgit closed this in 8125168 Jan 26, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 26, 2015

LGTM. Merged into master. For the performance issue, let's open another PR for it. I can see at least the condition "v1.indices.length / v1.size < 0.5" is wrong, because the left-hand side returns an integer. Please let me know whether you want to continue working on this. Thanks!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 26, 2015

@mengxr Thanks for reviewing and sharp eyes!

Sure, I'll continue to work on it.
Just to setup the expectation, as I've mentioned, performance wise, it seems to me that the improvement will be limited. The gain from the refactoring will mostly be code unification and performance improvement under some circumstances. If that's OK, I'll present the PR with some benchmark.

@hhbyyh hhbyyh deleted the fixDouble branch January 26, 2015 14:31
@mengxr
Copy link
Contributor

mengxr commented Jan 27, 2015

@hhbyyh Great. I created a JIRA for this issue and assigned it to you: https://issues.apache.org/jira/browse/SPARK-5419 Thanks!!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 27, 2015

Thanks @mengxr

Also I saw a PR from @viirya which seems is good enough for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants