Skip to content

Conversation

@rekhajoshm
Copy link
Contributor

PySpark SparseVector should have "Found duplicate indices" error message

rekhajoshm added 5 commits May 5, 2015 16:10
Pulling functionality from apache spark
pull latest from apache spark
Pulling functionality from apache spark
Pulling functionality from apache spark
@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45236 has started for PR 9525 at commit 0993a5b.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 6, 2015

I think it would be natural to fix the Scala error message and fix the condition in Python. The algorithm checks not the indexes have duplicated values, but the indexes are sorted.

@srowen What do you think?

Python

Replace >= with >.

if self.indices[i] > self.indices[i + 1]:
    raise TypeError("indices array must be sorted")

Scala

Change the message.

require(prev < i, s"indices array must be sorted: $i.")

@recurse-id
Copy link

@yu-iskw The indices are sorted within the code base at:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L510

So the only way to actually get that error is not by the user supplying unsorted indices, but the user supplying duplicate indices.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 6, 2015

@urvishparikh oh, got it. Thank you for letting me know. It's my fault.
If the indexes are already sorted, the condition should be ==, right?

@recurse-id
Copy link

@yu-iskw Yes but for consistency sake (and for ensuring sortedness which is crucial for SparseVector to function) I think the >= condition is fine.

@yu-iskw
Copy link
Contributor

yu-iskw commented Nov 6, 2015

All right. We should focus on change the error message in this issue. Thank you for making it clear.

LGTM
But Jenkins is still running.

@mengxr
Copy link
Contributor

mengxr commented Nov 6, 2015

@urvishparikh People need to know the indices are ordered to understand the implementation here. It would be better if you put both the i-th index and the (i+1)-th index in the error message and say that the indices are not strictly increasing.

@recurse-id
Copy link

Yes I agree @mengxr that much is definitely true. I just thought that the error message conveying that the user supplied duplicate indices is much more direct. If a novice developer created a sparse vector and was replied to with an error message saying "the indices are not strictly increasing" he/she might instinctively drop duplicates and sort the indices before supplying it to the SparseVector constructor. Which is not the behavior we want to induce (since the indices are sorted within constructor).

@jkbradley
Copy link
Member

@rekhajoshm +1 for the suggestion from @mengxr Could you please update this accordingly? Thanks! (Or please comment if you don't have time.)

@urvishparikh There is a bit of inconsistency / lack of clarity currently about which vector construction methods sort vs. not, especially comparing Scala vs. Python. Let's clean that up in separate issues.

@rekhajoshm
Copy link
Contributor Author

Thanks @jkbradley might have missed it or thought it was under discussion.updated.thanks

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48811 has finished for PR 9525 at commit 922028b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48829 has finished for PR 9525 at commit 6f0da9f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48830 has finished for PR 9525 at commit 3ec5cff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48831 has finished for PR 9525 at commit e087c6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM. Merging with master
Thanks!

@asfgit asfgit closed this in 007da1a Jan 6, 2016
@rekhajoshm rekhajoshm deleted the SPARK-11531 branch June 21, 2018 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants