[SPARK-11531] [ML] : SparseVector error Msg #9525

rekhajoshm · 2015-11-06T19:05:11Z

PySpark SparseVector should have "Found duplicate indices" error message

Pulling functionality from apache spark

pull latest from apache spark

Pulling functionality from apache spark

SparkQA · 2015-11-06T19:11:39Z

Test build #45236 has started for PR 9525 at commit 0993a5b.

yu-iskw · 2015-11-06T19:23:27Z

I think it would be natural to fix the Scala error message and fix the condition in Python. The algorithm checks not the indexes have duplicated values, but the indexes are sorted.

@srowen What do you think?

Python

Replace >= with >.

if self.indices[i] > self.indices[i + 1]:
    raise TypeError("indices array must be sorted")

Scala

Change the message.

require(prev < i, s"indices array must be sorted: $i.")

recurse-id · 2015-11-06T19:30:37Z

@yu-iskw The indices are sorted within the code base at:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L510

So the only way to actually get that error is not by the user supplying unsorted indices, but the user supplying duplicate indices.

yu-iskw · 2015-11-06T19:41:08Z

@urvishparikh oh, got it. Thank you for letting me know. It's my fault.
If the indexes are already sorted, the condition should be ==, right?

recurse-id · 2015-11-06T19:56:40Z

@yu-iskw Yes but for consistency sake (and for ensuring sortedness which is crucial for SparseVector to function) I think the >= condition is fine.

yu-iskw · 2015-11-06T20:06:39Z

All right. We should focus on change the error message in this issue. Thank you for making it clear.

LGTM
But Jenkins is still running.

mengxr · 2015-11-06T20:23:09Z

@urvishparikh People need to know the indices are ordered to understand the implementation here. It would be better if you put both the i-th index and the (i+1)-th index in the error message and say that the indices are not strictly increasing.

recurse-id · 2015-11-06T20:31:43Z

Yes I agree @mengxr that much is definitely true. I just thought that the error message conveying that the user supplied duplicate indices is much more direct. If a novice developer created a sparse vector and was replied to with an error message saying "the indices are not strictly increasing" he/she might instinctively drop duplicates and sort the indices before supplying it to the SparseVector constructor. Which is not the behavior we want to induce (since the indices are sorted within constructor).

jkbradley · 2016-01-05T22:39:21Z

@rekhajoshm +1 for the suggestion from @mengxr Could you please update this accordingly? Thanks! (Or please comment if you don't have time.)

@urvishparikh There is a bit of inconsistency / lack of clarity currently about which vector construction methods sort vs. not, especially comparing Scala vs. Python. Let's clean that up in separate issues.

rekhajoshm · 2016-01-06T02:10:12Z

Thanks @jkbradley might have missed it or thought it was under discussion.updated.thanks

SparkQA · 2016-01-06T02:59:15Z

Test build #48811 has finished for PR 9525 at commit 922028b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-06T04:50:45Z

Test build #48829 has finished for PR 9525 at commit 6f0da9f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-06T04:55:51Z

Test build #48830 has finished for PR 9525 at commit 3ec5cff.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-06T05:31:40Z

Test build #48831 has finished for PR 9525 at commit e087c6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-01-06T18:47:49Z

LGTM. Merging with master
Thanks!

rekhajoshm added 5 commits May 5, 2015 16:10

Merge pull request #1 from apache/master

e3677c9

Pulling functionality from apache spark

Merge pull request #2 from apache/master

106fd8e

pull latest from apache spark

Merge pull request #3 from apache/master

0be142d

Pulling functionality from apache spark

Merge pull request #4 from apache/master

6c6ee12

Pulling functionality from apache spark

Fix for sparsevector error

d74e084

Fix for sparsevector error

0993a5b

rekhajoshm added 2 commits January 5, 2016 18:11

updated for review comment

d67095d

updated for review comment

922028b

rekhajoshm added 2 commits January 5, 2016 20:27

updated for review comment

6f0da9f

updated for review comment

3ec5cff

updated for review comment

e087c6b

asfgit closed this in 007da1a Jan 6, 2016

rekhajoshm deleted the SPARK-11531 branch June 21, 2018 06:13

[SPARK-11531] [ML] : SparseVector error Msg #9525

[SPARK-11531] [ML] : SparseVector error Msg #9525

Uh oh!

Conversation

rekhajoshm commented Nov 6, 2015

Uh oh!

SparkQA commented Nov 6, 2015

Uh oh!

yu-iskw commented Nov 6, 2015

Python

Scala

Uh oh!

recurse-id commented Nov 6, 2015

Uh oh!

yu-iskw commented Nov 6, 2015

Uh oh!

recurse-id commented Nov 6, 2015

Uh oh!

yu-iskw commented Nov 6, 2015

Uh oh!

mengxr commented Nov 6, 2015

Uh oh!

recurse-id commented Nov 6, 2015

Uh oh!

jkbradley commented Jan 5, 2016

Uh oh!

rekhajoshm commented Jan 6, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

jkbradley commented Jan 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants