[SPARK-17950] [Python] Match SparseVector behavior with DenseVector #15496

itg-abby · 2016-10-15T00:07:29Z

What changes were proposed in this pull request?

Simply added the __getattr__ to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array

This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors.
i.e. you can simply call SparseVector.mean() to average the values in the entire vector.

Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways.

How was this patch tested?

Manual testing on local machine.
Passed ./python/run-tests
No UI changes.

<< What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <<How was this patch tested?>> Manual testing on local machine.

itg-abby · 2016-10-17T23:34:34Z

Whoops, looks like this code is fine and I just had a bug in my local build. Reopening.

srowen · 2016-10-18T07:05:34Z

As I say on the JIRA, if I understand this correctly, this turns O(n) operations into O(n^2) etc. I don't think that actually helps anything.

itg-abby · 2016-10-18T20:26:53Z

It is possible that I am missing something or that I have unintentionally obfuscated this pull request, I will try summarizing my understanding/purpose and see if it sheds any light:

DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and always stores the array values in the object attribute DenseVector.array , this allows for a lot of neat numpy functions to be run on the array values without any trouble.

SparseVector works differently, it never stores the full set of values as a full array. Instead, it uses a 'trick' which only searches non-zero index/value pairs if a specific entry is asked for (this can be found in the __getitem__ attribute for SparseVector). This prevents numpy functions from being usable on the SparseVector since there is no actual array to operate on directly. However, a conversion function is provided, toArray().

The solution proposed can, in effect, be thought of as a purely syntactical shortening from SparseVector.toArray().mean() to simply SparseVector.mean() . Thus, this should not introduce any increased complexity compared to how things are now. The current status of this object is confusing in that the intuitive function-call SparseVector.mean() just throws out an "AttributeError: 'SparseVector' object has no attribute 'mean'".

As mildly hinted at on JIRA, there are even better implementations which could follow this one. For example simply replacing directly calling numpy by manually providing the same functions with reduced complexity. Much along the lines of how __getitem__ was made for SparseVectors, rather than the typical array slicing that DenseVector has.

sethah · 2016-10-18T22:27:49Z

@itg-abby I see what you're going for, but I don't think it's a great idea in general. For many operations on sparse vectors, we should not materialize the sparse vector as dense. I'm not sure if it's reasonable to use Scipy, but a better solution to me would be:

def __getattr__(self, item):
    csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2]))
    return getattr(csr, item)

As you say, we can alternatively write our own. But the current patch would materialize potentially enormous arrays unnecessarily. Thoughts?

Update: Since we don't require users to have scipy, we can use a flag _have_scipy and throw an error when this is false, otherwise use the wrapper. I'd appreciate feedback from someone more involved in pyspark though.

cc @holdenk

itg-abby · 2016-10-19T18:17:38Z

@sethah
Thanks for your input! SciPy is definitely a much better solution, I was naively operating under the assumption that I must match DenseVector types 1:1.

Actually, there is already a flag "_have_scipy" so I have added the check for that in order to create the type when it is available. Otherwise, users just get the standard SparseVector back.

The PR has been updated.

holdenk · 2016-10-19T22:20:09Z

python/pyspark/mllib/linalg/__init__.py

+            csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2]))
+            return getattr(csr, item)
+        else:
+            return self


This doesn't look like the correct behaviour. Can you test this on a system without scipy installed? (Probably easist with a simple virtualenv).

Systems without Scipy return the following:

from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'SparseVector' object is not callable

This has been corrected in the latest commit.
__getattr__ essentially is the catch for "any attribute that does not have a name" so standard behavior should be an AttributeError not just returning the object (thank you for catching this!). The new code gives the following error when SciPy is not available:

a.sum() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 802, in __getattr__ raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item)) AttributeError: '<class 'pyspark.mllib.linalg.SparseVector'>' object has no attribute 'sum'.

Systems with Scipy will give the result as expected:

from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() 6.5

holdenk · 2016-10-19T22:28:08Z

Thanks for working on this and taking the time to update to avoid the array copy :) There doesn't appear to be any tests for the new functionality which is generally a requirement and that could help us see how we would expect people to use this as well. Also if we make this change here we would probably want to something similar in ml/linalg/ as well.

…seVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.

…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.

itg-abby · 2016-10-24T21:54:51Z

I have applied the code change to both ML and MLLIB now. And, I added some simple tests to check if the SciPy sparse functions are behaving correctly. (Only MLLIB has tests for SciPy functions so I only added test cases there).

Additionally, I updated the implementation with a wrapper script which 1) allows for functions with inputs to work correctly and 2) seamlessly allows for SciPy's functions which generate a SciPy matrix output to be automatically returned as a SparseVector object.

Example use case:

c = SparseVector(40, {1: 1, 3: 2, 23: 7, 25:9, 39:3})
c.power(4)

SparseVector(40, {1: 1.0, 3: 16.0, 23: 2401.0, 25: 6561.0, 39: 81.0})

itg-abby · 2016-11-01T19:38:03Z

Sorry, I forgot to ping anyone after the last update.
@holdenk how does this look? What next steps should we take to push this PR?

holdenk

Thanks for working on this some more :) I'd be curious to know if constructing the csr is expensive and should be memoized or not (could maybe do some simple timing tests quickly verify)

holdenk · 2016-11-03T23:57:05Z

python/pyspark/ml/linalg/__init__.py

+                    return SparseVector(result.shape[1],result.indices,result.data)
+                return result
+            else:
+                raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item))


This seems like it might be kind of a confusing way to communicate that the user doesn't have scipy installed

holdenk · 2016-11-03T23:57:28Z

python/pyspark/ml/linalg/__init__.py

+    def __getattr__(self, item):
+        def wrapper(*args, **kwargs):
+            if _have_scip:
+                csr = scipy.sparse.csr_matrix((\


Do we need the \s?

holdenk · 2016-11-03T23:58:08Z

python/pyspark/mllib/tests.py

@@ -861,6 +861,19 @@ def test_dot(self):
        dv = DenseVector(array([1., 2., 3., 4.]))
        self.assertEqual(10.0, dv.dot(lil))

+    def test_assorted_functs(self):


It would be good to have the same tests for ml as well.

…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.

itg-abby · 2016-11-04T23:53:50Z

@holdenk
I did a quick benchmark on my Macbook to get a single CSR's approximate construction time (in minutes) on a large case as well as compared to LIL and DOK types.

2mil x 2mil Elements:
CSR, 1.157418
DOK, 5.531311
LIL, 0.839315
LIL2CSR, 1.650390

Snippet used for benchmarking:

setup_small1 = 'from scipy.sparse import csr_matrix;'
method1 = 'csr_matrix((range(0, 2000000),range(0, 2000000),[0, 2000000]))'
setup_small2 = 'from scipy.sparse import dok_matrix;'
method2 = 'dok_matrix((range(0, 2000000),range(0, 2000000)))'
setup_small3 = 'from scipy.sparse import lil_matrix;'
method3 = 'lil_matrix((2000000,2000000))'
setup_small4 = 'from scipy.sparse import lil_matrix;'
method4 = 'lil_matrix((2000000,2000000)).tocsr()'

t = timeit.Timer(method1, setup_small1)
print('CSR construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method2, setup_small2)
print('DOK construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method3, setup_small3)
print('LIL construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method4, setup_small4)
print('LIL construction + Convert to CSR: ' + str(t.timeit(100)/float(100)))

I expect any changing of the Sparse Vector structure will take place using the PySpark object class, so CSR will definitely outperform LIL and DOK matrix types for function execution as well. This comes from the SciPy documentation [Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR * CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages of the CSR format - slow column slicing operations (consider CSC), changes to the sparsity structure are expensive (consider LIL or DOK)].

Other items resolved:

The \s have been removed
The error message has been lengthened to mention SciPy not being available as a possible cause for a lack of object attribute (the other cause being a call to something that does not exist in SciPy either).
I have only added my single test function to ML since I do not know the reason why all of the SciPy tests in MLlib are not in ML.

itg-abby · 2016-11-17T21:56:59Z

@holdenk , are we there yet :P?

holdenk · 2016-11-22T20:14:49Z

Ah sorry, I've been travelling a lot this month - let me take a look once I'm back from Strata singapore. I don't think we are going to get this in for 2.1 (sorry).

holdenk · 2016-11-24T09:53:24Z

python/pyspark/ml/linalg/__init__.py

+                return result
+            else:
+                raise AttributeError(
+                    "'{0}' object has no attribute '{1}' or SciPy not installed.".format(self.__class__, item))


This error message would probably be better of just saying SciPy not installed since its on the else branch of if _have_scipy unless I'm missing something?

There are 3 cases that can occur to call getattr and get an error:

Calling a SciPy function while not having SciPy installed.

Calling a function that is not in SparseVector or SciPy, when SciPy is installed- SciPy provides the attribute error on its own.

Calling a function that is not in SparseVector or SciPy, without SciPy installed.

The message is accounting for cases 1 & 3.
If User tries to call a function which does not exist in SparseVector or SciPy, while not having SciPy installed, they are warned that it might not exist at all as well as told they have to install SciPy as a possible solution.

Ok so maybe we can improve the error message to something like "'{0}' object has no attribute '{1}' and SciPy is not installed to proxy request to SparseVector" (or similar).

Because saying its X or Y is confusing since this error message only happens in the event SciPy is not installed.

What do you think?

That sounds great, I'll add it in right now.

…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.

itg-abby · 2016-12-19T00:28:01Z

ping @holdenk, can you take a look at this please? sorry for leaving this off for so long.

holdenk · 2016-12-19T03:03:02Z

Sure - I'll try and take a look tomorrow just driving back from Monterey tonight.

…

On Sun, Dec 18, 2016 at 4:28 PM AbderRahman N. Sobh < ***@***.***> wrote: ping @holdenk <https://github.com/holdenk>, can you take a look at this please? sorry for leaving this off for so long. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15496 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADp9T7TGN_IN6PQwTjcqc1vdW5EoTXBks5rJc-sgaJpZM4KXjGl> .

holdenk

Thanks for working on this, I've still got a few questions / concerns.

The primary one is the copying of the values is going to be O(|M|) which might be a surprise (since nb.append results in a copy not in place according to the docs).
Is there a reason why we copy the arrays rather than just passing in the ones we already have?

The other is reading the code its a bit difficult to understand whats going on (like why we are appending these values that then later get ignored).

I think if we could avoid the copy cost though and maybe clarify/simplify some of the code this is getting close. I know it can be frustrating to work on a simple feature for such a long time - so thanks for sticking with it and let me know if I can help somehow :)

holdenk · 2016-12-19T20:26:50Z

python/pyspark/ml/linalg/__init__.py

+    def __getattr__(self, item):
+        def wrapper(*args, **kwargs):
+            if _have_scipy:
+                csr = scipy.sparse.csr_matrix((np.append(self.values, 0),


Super minor nit, but using named parameters might make this a bit easier when skimming the code (same for mllin)

holdenk · 2016-12-19T20:35:16Z

python/pyspark/ml/linalg/__init__.py

@@ -705,6 +705,23 @@ def __eq__(self, other):
            return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array)
        return False

+    def __getattr__(self, item):
+        def wrapper(*args, **kwargs):


Would be good to have a comment here explaining its purpose.

holdenk · 2016-12-19T20:49:50Z

python/pyspark/ml/linalg/__init__.py

+    def __getattr__(self, item):
+        def wrapper(*args, **kwargs):
+            if _have_scipy:
+                csr = scipy.sparse.csr_matrix((np.append(self.values, 0),


More concretely thouhh, why are we padding the data/values with a 0?

holdenk · 2016-12-19T20:51:37Z

python/pyspark/ml/linalg/__init__.py

+        def wrapper(*args, **kwargs):
+            if _have_scipy:
+                csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
+                                               np.append(self.indices, self.size-1),


So this "works", in that it's skipped by the indexptrs range we supply bellow - but it makes the code confusing to read so why do we need it and maybe we could add a comment explaining why?

…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.

itg-abby · 2016-12-25T03:27:06Z

@holdenk
Thanks for the review and your availability, I really appreciate the work you are doing too by mentoring me! I didn't realize np.append was making copies, though it made a lot of sense once I thought about what is actually going on there in terms of memory.

The appends were also ultimately unnecessary and have been removed in favor of a much better, simpler, efficient call to make the csr matrix!

itg-abby · 2017-01-07T21:05:47Z

ping @holdenk , please

itg-abby · 2017-04-18T19:37:54Z

ping @holdenk , revisiting this PR for the memories today. Wondering whether it was shipworthy in the end, thanks!

holdenk · 2017-04-18T19:40:40Z

Thanks! I'm glad you got rid of the append (and sorry this slipped my radar). I'll add this to my queue to review and try and take a more thorough look this week. I'm sorry its taken so long.

cloud-fan · 2017-05-23T17:53:13Z

@holdenk any comments on this PR?

holdenk · 2017-05-23T18:38:20Z

Oh I'm sorry I haven't had a chance l try and take a look soon.

holdenk · 2018-10-26T16:25:17Z

Sorry for letting this slide for so long. This looks really close, I think now that we don't have append I don't have the concerns with the copy any more. Can you update this to master and we can make sure it passes the new style guides? Would be nice to get for Spark 3 for sure :)
And really sorry this slipped my plate.

holdenk · 2018-10-26T16:25:25Z

Jenkins Ok to test

holdenk · 2018-10-26T16:26:18Z

While you update to master i might include in the docstring that the similar funcitonality in densevector is done with manual delegation in _delegate.

holdenk · 2018-11-02T15:57:01Z

Gentle ping, are you still interested in this?

Extend docstring for SparseVector SciPy functions

itg-abby · 2018-11-02T22:42:48Z

Hi, can you give a ref to the new style guides? I'm not sure if anything major needs changing.

In the meantime I resolved the one conflict at the bottom of ml/tests.py and extended the docstring to include your comment.

AmplabJenkins · 2019-09-16T18:27:04Z

Can one of the admins verify this patch?

github-actions · 2020-01-18T00:08:13Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

itg-abby closed this Oct 15, 2016

itg-abby changed the title ~~Match SparseVector behavior with DenseVector~~ [SPARK-17950] [Python] Match SparseVector behavior with DenseVector Oct 15, 2016

itg-abby reopened this Oct 15, 2016

itg-abby closed this Oct 15, 2016

itg-abby reopened this Oct 17, 2016

SparseVectors return SciPy types when available

05f6142

holdenk reviewed Oct 19, 2016

View reviewed changes

itg-abby added 5 commits October 20, 2016 11:41

holdenk reviewed Nov 4, 2016

View reviewed changes

itg-abby added 5 commits November 4, 2016 15:42

holdenk reviewed Nov 24, 2016

View reviewed changes

itg-abby added 2 commits November 30, 2016 14:46

holdenk reviewed Dec 19, 2016

View reviewed changes

itg-abby added 2 commits December 24, 2016 19:23

itg-abby added 3 commits November 2, 2018 15:35

Merge branch 'master' into patch-2

75f4709

Update __init__.py

dcec517

Extend docstring for SparseVector SciPy functions

Extend docstring for SparseVector SciPy functions

5a919b6

Extend docstring for SparseVector SciPy functions

dongjoon-hyun added ML MLLIB PYSPARK labels Jun 14, 2019

github-actions bot added the Stale label Jan 18, 2020

github-actions bot closed this Jan 19, 2020

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector #15496

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector #15496

Conversation

itg-abby commented Oct 15, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

itg-abby commented Oct 17, 2016

srowen commented Oct 18, 2016

itg-abby commented Oct 18, 2016 • edited

sethah commented Oct 18, 2016 • edited

itg-abby commented Oct 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Oct 19, 2016

itg-abby commented Oct 24, 2016

itg-abby commented Nov 1, 2016

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itg-abby commented Nov 4, 2016

itg-abby commented Nov 17, 2016

holdenk commented Nov 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itg-abby commented Dec 19, 2016

holdenk commented Dec 19, 2016 via email

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itg-abby commented Dec 25, 2016

itg-abby commented Jan 7, 2017

itg-abby commented Apr 18, 2017

holdenk commented Apr 18, 2017

cloud-fan commented May 23, 2017

holdenk commented May 23, 2017

holdenk commented Oct 26, 2018

holdenk commented Oct 26, 2018

holdenk commented Oct 26, 2018

holdenk commented Nov 2, 2018

itg-abby commented Nov 2, 2018

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 18, 2020

itg-abby commented Oct 15, 2016 •

edited

itg-abby commented Oct 18, 2016 •

edited

sethah commented Oct 18, 2016 •

edited