-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20214][ML] Make sure converted csc matrix has sorted indices #17532
Conversation
f88eb6b
to
10be94a
Compare
Test build #75528 has finished for PR 17532 at commit
|
Test build #75530 has finished for PR 17532 at commit
|
I don't think is directly related, but, I tried to make a similar change to the Spark Scala implementation of this type of conversion and couldn't make it work, just because it changed a bunch of little behaviors. (I could look up the JIRA.) This could be fine, being more narrow. |
python/pyspark/ml/linalg/__init__.py
Outdated
@@ -72,7 +72,9 @@ def _convert_to_vector(l): | |||
return DenseVector(l) | |||
elif _have_scipy and scipy.sparse.issparse(l): | |||
assert l.shape[1] == 1, "Expected column vector" | |||
csc = l.tocsc() | |||
# Make sure the converted csc_matrix has sorted indices. | |||
csc = l.tocsr().tocsc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just call tocsc. You can then check the attribute has_sorted_indices and call sort_indices() if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
Are you able to write a unit test which passes data through _convert_to_vector and fails before this fix? |
Btw, I'd really like to get this into 2.2, which will be cut soon. Let me know if you'd like me to take it over. Thanks! |
@jkbradley Thanks for comment. I will add the unit test now. |
@jkbradley An unit test is added. Please check this again. Thanks. |
@@ -853,6 +853,17 @@ def serialize(l): | |||
self.assertEqual(sv, serialize(lil.tocsr())) | |||
self.assertEqual(sv, serialize(lil.todok())) | |||
|
|||
def test_convert_to_vector(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no Scipy-related tests in ml
submodule, so I don't add it. If we really need one, please let me know. I can add it quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's Ok; I made a separate JIRA to port the tests there.
Test build #75560 has finished for PR 17532 at commit
|
LGTM |
## What changes were proposed in this pull request? `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that: from scipy.sparse import lil_matrix lil = lil_matrix((4, 1)) lil[1, 0] = 1 lil[3, 0] = 2 _convert_to_vector(lil.todok()) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector return SparseVector(l.shape[0], csc.indices, csc.data) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__ % (self.indices[i], self.indices[i + 1])) TypeError: Indices 3 and 1 are not strictly increasing A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices: >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17532 from viirya/make-sure-sorted-indices. (cherry picked from commit 1220605) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request? `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that: from scipy.sparse import lil_matrix lil = lil_matrix((4, 1)) lil[1, 0] = 1 lil[3, 0] = 2 _convert_to_vector(lil.todok()) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector return SparseVector(l.shape[0], csc.indices, csc.data) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__ % (self.indices[i], self.indices[i + 1])) TypeError: Indices 3 and 1 are not strictly increasing A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices: >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17532 from viirya/make-sure-sorted-indices. (cherry picked from commit 1220605) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
What changes were proposed in this pull request?
_convert_to_vector
converts a scipy sparse matrix to csc matrix for initializingSparseVector
. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:A simple test can confirm that
dok_matrix.tocsc()
won't guarantee sorted indices:I checked the source codes of scipy. The only way to guarantee it is
csc_matrix.tocsr()
andcsr_matrix.tocsc()
.How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.