Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20214][ML] Make sure converted csc matrix has sorted indices #17532

Closed
wants to merge 2 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 5, 2017

What changes were proposed in this pull request?

_convert_to_vector converts a scipy sparse matrix to csc matrix for initializing SparseVector. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

from scipy.sparse import lil_matrix
lil = lil_matrix((4, 1))
lil[1, 0] = 1
lil[3, 0] = 2
_convert_to_vector(lil.todok())

File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
  return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
  % (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that dok_matrix.tocsc() won't guarantee sorted indices:

>>> from scipy.sparse import lil_matrix
>>> lil = lil_matrix((4, 1))
>>> lil[1, 0] = 1
>>> lil[3, 0] = 2
>>> dok = lil.todok()
>>> csc = dok.tocsc()
>>> csc.has_sorted_indices
0
>>> csc.indices
array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is csc_matrix.tocsr() and csr_matrix.tocsc().

How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Apr 5, 2017

cc @jkbradley @MLnick @holdenk

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75528 has finished for PR 17532 at commit f88eb6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75530 has finished for PR 17532 at commit 10be94a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 5, 2017

I don't think is directly related, but, I tried to make a similar change to the Spark Scala implementation of this type of conversion and couldn't make it work, just because it changed a bunch of little behaviors. (I could look up the JIRA.) This could be fine, being more narrow.

@@ -72,7 +72,9 @@ def _convert_to_vector(l):
return DenseVector(l)
elif _have_scipy and scipy.sparse.issparse(l):
assert l.shape[1] == 1, "Expected column vector"
csc = l.tocsc()
# Make sure the converted csc_matrix has sorted indices.
csc = l.tocsr().tocsc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call tocsc. You can then check the attribute has_sorted_indices and call sort_indices() if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

@jkbradley
Copy link
Member

Are you able to write a unit test which passes data through _convert_to_vector and fails before this fix?

@jkbradley
Copy link
Member

Btw, I'd really like to get this into 2.2, which will be cut soon. Let me know if you'd like me to take it over. Thanks!

@viirya
Copy link
Member Author

viirya commented Apr 5, 2017

@jkbradley Thanks for comment. I will add the unit test now.

@viirya
Copy link
Member Author

viirya commented Apr 5, 2017

@jkbradley An unit test is added. Please check this again. Thanks.

@@ -853,6 +853,17 @@ def serialize(l):
self.assertEqual(sv, serialize(lil.tocsr()))
self.assertEqual(sv, serialize(lil.todok()))

def test_convert_to_vector(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no Scipy-related tests in ml submodule, so I don't add it. If we really need one, please let me know. I can add it quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's Ok; I made a separate JIRA to port the tests there.

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75560 has finished for PR 17532 at commit 2612d66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master and branch-2.1, branch-2.0
Thanks a lot!

asfgit pushed a commit that referenced this pull request Apr 6, 2017
## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.

(cherry picked from commit 1220605)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 1220605 Apr 6, 2017
asfgit pushed a commit that referenced this pull request Apr 6, 2017
## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.

(cherry picked from commit 1220605)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@viirya viirya deleted the make-sure-sorted-indices branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants