[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183

hhbyyh · 2015-01-23T16:31:53Z

JIRA issue: https://issues.apache.org/jira/browse/SPARK-5384
Currently Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths, please refer to JIRA for sample

PR scope:
Unify the sqdist logic for dense/sparse vectors and fix the inconsistency, also remove the possible sparse to dense conversion in the original code.

For reviewers:
Maybe we should first discuss what's the correct behavior.

Vectors for sqdist must have the same length, like in breeze?
If they can have different lengths, what's the correct result for sqdist? (should the extra part get into calculation?)

I'll update PR with more optimization and additional ut afterwards. Thanks.

SparkQA · 2015-01-23T16:32:45Z

Test build #26029 has started for PR 4183 at commit 54cbf97.

This patch merges cleanly.

SparkQA · 2015-01-23T17:43:01Z

Test build #26029 has finished for PR 4183 at commit 54cbf97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T17:43:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26029/
Test PASSed.

mengxr · 2015-01-23T22:18:48Z

I agree that vectors must have the same length and we should check it. It may not be necessary to change the implementation. I saw couple performance issues in your code, for example, unnecessary index lookups. I would suggest only adding the check in this PR. If you want to update the implementation, let's do it in another PR with micro-benchmark.

hhbyyh · 2015-01-24T05:55:31Z

Thank for the reply, I‘ll limit the scope. And since the size equality constraint will be pretty straight-forward, it seems no additional ut is required.

The main purpose for the code refactoring was to unify the scattered logic.
Performance wise, the only concern for the original code was the extra space consumption for,

the invocation of SparseVector.toArray when v1.indices.length / v1.size >= 0.5,
the zip action in (dense, dense) computation.
I'll hold the refactoring until it's found necessary.

SparkQA · 2015-01-24T05:57:43Z

Test build #26039 has started for PR 4183 at commit 1f17328.

This patch merges cleanly.

SparkQA · 2015-01-24T07:06:21Z

Test build #26039 has finished for PR 4183 at commit 1f17328.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-24T07:06:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26039/
Test PASSed.

mengxr · 2015-01-26T06:22:32Z

LGTM. Merged into master. For the performance issue, let's open another PR for it. I can see at least the condition "v1.indices.length / v1.size < 0.5" is wrong, because the left-hand side returns an integer. Please let me know whether you want to continue working on this. Thanks!

hhbyyh · 2015-01-26T08:23:36Z

@mengxr Thanks for reviewing and sharp eyes!

Sure, I'll continue to work on it.
Just to setup the expectation, as I've mentioned, performance wise, it seems to me that the improvement will be limited. The gain from the refactoring will mostly be code unification and performance improvement under some circumstances. If that's OK, I'll present the PR with some benchmark.

mengxr · 2015-01-27T02:20:29Z

@hhbyyh Great. I created a JIRA for this issue and assigned it to you: https://issues.apache.org/jira/browse/SPARK-5419 Thanks!!

hhbyyh · 2015-01-27T04:50:47Z

Thanks @mengxr

Also I saw a PR from @viirya which seems is good enough for now.

hhbyyh added 2 commits January 25, 2015 00:03

fix Vectors.sqdist inconsistence

54cbf97

limit PR scope to size constraints only

1f17328

hhbyyh changed the title ~~[SPARK-5384][mllib] Vectors.sqdist return inconsistent result for sparse/dense vectors when the vectors have different lengths~~ [SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths Jan 26, 2015

asfgit closed this in 8125168 Jan 26, 2015

hhbyyh deleted the fixDouble branch January 26, 2015 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183

[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183

Uh oh!

hhbyyh commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

AmplabJenkins commented Jan 23, 2015

Uh oh!

mengxr commented Jan 23, 2015

Uh oh!

hhbyyh commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

AmplabJenkins commented Jan 24, 2015

Uh oh!

mengxr commented Jan 26, 2015

Uh oh!

hhbyyh commented Jan 26, 2015

Uh oh!

mengxr commented Jan 27, 2015

Uh oh!

hhbyyh commented Jan 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183

[SPARK-5384][mllib] Vectors.sqdist returns inconsistent results for sparse/dense vectors when the vectors have different lengths #4183

Uh oh!

Conversation

hhbyyh commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 23, 2015

Uh oh!

AmplabJenkins commented Jan 23, 2015

Uh oh!

mengxr commented Jan 23, 2015

Uh oh!

hhbyyh commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

AmplabJenkins commented Jan 24, 2015

Uh oh!

mengxr commented Jan 26, 2015

Uh oh!

hhbyyh commented Jan 26, 2015

Uh oh!

mengxr commented Jan 27, 2015

Uh oh!

hhbyyh commented Jan 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants