-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11531] [ML] : SparseVector error Msg #9525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pulling functionality from apache spark
pull latest from apache spark
Pulling functionality from apache spark
Pulling functionality from apache spark
|
Test build #45236 has started for PR 9525 at commit |
|
I think it would be natural to fix the Scala error message and fix the condition in Python. The algorithm checks not the indexes have duplicated values, but the indexes are sorted. @srowen What do you think? PythonReplace ScalaChange the message. |
|
@yu-iskw The indices are sorted within the code base at: So the only way to actually get that error is not by the user supplying unsorted indices, but the user supplying duplicate indices. |
|
@urvishparikh oh, got it. Thank you for letting me know. It's my fault. |
|
@yu-iskw Yes but for consistency sake (and for ensuring sortedness which is crucial for SparseVector to function) I think the >= condition is fine. |
|
All right. We should focus on change the error message in this issue. Thank you for making it clear. LGTM |
|
@urvishparikh People need to know the indices are ordered to understand the implementation here. It would be better if you put both the i-th index and the (i+1)-th index in the error message and say that the indices are not strictly increasing. |
|
Yes I agree @mengxr that much is definitely true. I just thought that the error message conveying that the user supplied duplicate indices is much more direct. If a novice developer created a sparse vector and was replied to with an error message saying "the indices are not strictly increasing" he/she might instinctively drop duplicates and sort the indices before supplying it to the SparseVector constructor. Which is not the behavior we want to induce (since the indices are sorted within constructor). |
|
@rekhajoshm +1 for the suggestion from @mengxr Could you please update this accordingly? Thanks! (Or please comment if you don't have time.) @urvishparikh There is a bit of inconsistency / lack of clarity currently about which vector construction methods sort vs. not, especially comparing Scala vs. Python. Let's clean that up in separate issues. |
|
Thanks @jkbradley might have missed it or thought it was under discussion.updated.thanks |
|
Test build #48811 has finished for PR 9525 at commit
|
|
Test build #48829 has finished for PR 9525 at commit
|
|
Test build #48830 has finished for PR 9525 at commit
|
|
Test build #48831 has finished for PR 9525 at commit
|
|
LGTM. Merging with master |
PySpark SparseVector should have "Found duplicate indices" error message