New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17950] [Python] Match SparseVector behavior with DenseVector #15496
Conversation
<< What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <<How was this patch tested?>> Manual testing on local machine.
Whoops, looks like this code is fine and I just had a bug in my local build. Reopening. |
As I say on the JIRA, if I understand this correctly, this turns O(n) operations into O(n^2) etc. I don't think that actually helps anything. |
It is possible that I am missing something or that I have unintentionally obfuscated this pull request, I will try summarizing my understanding/purpose and see if it sheds any light: DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and always stores the array values in the object attribute DenseVector.array , this allows for a lot of neat numpy functions to be run on the array values without any trouble. SparseVector works differently, it never stores the full set of values as a full array. Instead, it uses a 'trick' which only searches non-zero index/value pairs if a specific entry is asked for (this can be found in the The solution proposed can, in effect, be thought of as a purely syntactical shortening from SparseVector.toArray().mean() to simply SparseVector.mean() . Thus, this should not introduce any increased complexity compared to how things are now. The current status of this object is confusing in that the intuitive function-call SparseVector.mean() just throws out an "AttributeError: 'SparseVector' object has no attribute 'mean'". As mildly hinted at on JIRA, there are even better implementations which could follow this one. For example simply replacing directly calling numpy by manually providing the same functions with reduced complexity. Much along the lines of how |
@itg-abby I see what you're going for, but I don't think it's a great idea in general. For many operations on sparse vectors, we should not materialize the sparse vector as dense. I'm not sure if it's reasonable to use Scipy, but a better solution to me would be: def __getattr__(self, item):
csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2]))
return getattr(csr, item) As you say, we can alternatively write our own. But the current patch would materialize potentially enormous arrays unnecessarily. Thoughts? Update: Since we don't require users to have scipy, we can use a flag cc @holdenk |
@sethah Actually, there is already a flag "_have_scipy" so I have added the check for that in order to create the type when it is available. Otherwise, users just get the standard SparseVector back. The PR has been updated. |
csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2])) | ||
return getattr(csr, item) | ||
else: | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like the correct behaviour. Can you test this on a system without scipy installed? (Probably easist with a simple virtualenv).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Systems without Scipy return the following:
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'SparseVector' object is not callable
This has been corrected in the latest commit.
__getattr__
essentially is the catch for "any attribute that does not have a name" so standard behavior should be an AttributeError not just returning the object (thank you for catching this!). The new code gives the following error when SciPy is not available:
a.sum()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 802, in __getattr__
raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item))
AttributeError: '<class 'pyspark.mllib.linalg.SparseVector'>' object has no attribute 'sum'.
Systems with Scipy will give the result as expected:
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()
6.5
Thanks for working on this and taking the time to update to avoid the array copy :) There doesn't appear to be any tests for the new functionality which is generally a requirement and that could help us see how we would expect people to use this as well. Also if we make this change here we would probably want to something similar in |
…seVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
I have applied the code change to both ML and MLLIB now. And, I added some simple tests to check if the SciPy sparse functions are behaving correctly. (Only MLLIB has tests for SciPy functions so I only added test cases there). Additionally, I updated the implementation with a wrapper script which 1) allows for functions with inputs to work correctly and 2) seamlessly allows for SciPy's functions which generate a SciPy matrix output to be automatically returned as a SparseVector object. Example use case:
|
Sorry, I forgot to ping anyone after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this some more :) I'd be curious to know if constructing the csr is expensive and should be memoized or not (could maybe do some simple timing tests quickly verify)
python/pyspark/ml/linalg/__init__.py
Outdated
return SparseVector(result.shape[1],result.indices,result.data) | ||
return result | ||
else: | ||
raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it might be kind of a confusing way to communicate that the user doesn't have scipy installed
python/pyspark/ml/linalg/__init__.py
Outdated
def __getattr__(self, item): | ||
def wrapper(*args, **kwargs): | ||
if _have_scip: | ||
csr = scipy.sparse.csr_matrix((\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the \
s?
@@ -861,6 +861,19 @@ def test_dot(self): | |||
dv = DenseVector(array([1., 2., 3., 4.])) | |||
self.assertEqual(10.0, dv.dot(lil)) | |||
|
|||
def test_assorted_functs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have the same tests for ml as well.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
@holdenk 2mil x 2mil Elements: Snippet used for benchmarking:
I expect any changing of the Sparse Vector structure will take place using the PySpark object class, so CSR will definitely outperform LIL and DOK matrix types for function execution as well. This comes from the SciPy documentation [Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR * CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages of the CSR format - slow column slicing operations (consider CSC), changes to the sparsity structure are expensive (consider LIL or DOK)]. Other items resolved:
|
@holdenk , are we there yet :P? |
Ah sorry, I've been travelling a lot this month - let me take a look once I'm back from Strata singapore. I don't think we are going to get this in for 2.1 (sorry). |
python/pyspark/ml/linalg/__init__.py
Outdated
return result | ||
else: | ||
raise AttributeError( | ||
"'{0}' object has no attribute '{1}' or SciPy not installed.".format(self.__class__, item)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message would probably be better of just saying SciPy not installed since its on the else branch of if _have_scipy
unless I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 cases that can occur to call getattr and get an error:
- Calling a SciPy function while not having SciPy installed.
- Calling a function that is not in SparseVector or SciPy, when SciPy is installed- SciPy provides the attribute error on its own.
- Calling a function that is not in SparseVector or SciPy, without SciPy installed.
The message is accounting for cases 1 & 3.
If User tries to call a function which does not exist in SparseVector or SciPy, while not having SciPy installed, they are warned that it might not exist at all as well as told they have to install SciPy as a possible solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so maybe we can improve the error message to something like "'{0}' object has no attribute '{1}' and SciPy is not installed to proxy request to SparseVector" (or similar).
Because saying its X or Y is confusing since this error message only happens in the event SciPy is not installed.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great, I'll add it in right now.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
ping @holdenk, can you take a look at this please? sorry for leaving this off for so long. |
Sure - I'll try and take a look tomorrow just driving back from Monterey
tonight.
…On Sun, Dec 18, 2016 at 4:28 PM AbderRahman N. Sobh < ***@***.***> wrote:
ping @holdenk <https://github.com/holdenk>, can you take a look at this
please? sorry for leaving this off for so long.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15496 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADp9T7TGN_IN6PQwTjcqc1vdW5EoTXBks5rJc-sgaJpZM4KXjGl>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, I've still got a few questions / concerns.
The primary one is the copying of the values is going to be O(|M|) which might be a surprise (since nb.append results in a copy not in place according to the docs).
Is there a reason why we copy the arrays rather than just passing in the ones we already have?
The other is reading the code its a bit difficult to understand whats going on (like why we are appending these values that then later get ignored).
I think if we could avoid the copy cost though and maybe clarify/simplify some of the code this is getting close. I know it can be frustrating to work on a simple feature for such a long time - so thanks for sticking with it and let me know if I can help somehow :)
python/pyspark/ml/linalg/__init__.py
Outdated
def __getattr__(self, item): | ||
def wrapper(*args, **kwargs): | ||
if _have_scipy: | ||
csr = scipy.sparse.csr_matrix((np.append(self.values, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor nit, but using named parameters might make this a bit easier when skimming the code (same for mllin)
@@ -705,6 +705,23 @@ def __eq__(self, other): | |||
return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) | |||
return False | |||
|
|||
def __getattr__(self, item): | |||
def wrapper(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have a comment here explaining its purpose.
python/pyspark/ml/linalg/__init__.py
Outdated
def __getattr__(self, item): | ||
def wrapper(*args, **kwargs): | ||
if _have_scipy: | ||
csr = scipy.sparse.csr_matrix((np.append(self.values, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More concretely thouhh, why are we padding the data/values with a 0?
python/pyspark/ml/linalg/__init__.py
Outdated
def wrapper(*args, **kwargs): | ||
if _have_scipy: | ||
csr = scipy.sparse.csr_matrix((np.append(self.values, 0), | ||
np.append(self.indices, self.size-1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this "works", in that it's skipped by the indexptrs range we supply bellow - but it makes the code confusing to read so why do we need it and maybe we could add a comment explaining why?
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
…seVector ## What changes were proposed in this pull request? Added the `__getattr__` to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. ## How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes.
@holdenk The appends were also ultimately unnecessary and have been removed in favor of a much better, simpler, efficient call to make the csr matrix! |
ping @holdenk , please |
ping @holdenk , revisiting this PR for the memories today. Wondering whether it was shipworthy in the end, thanks! |
Thanks! I'm glad you got rid of the append (and sorry this slipped my radar). I'll add this to my queue to review and try and take a more thorough look this week. I'm sorry its taken so long. |
@holdenk any comments on this PR? |
Oh I'm sorry I haven't had a chance l try and take a look soon. |
Sorry for letting this slide for so long. This looks really close, I think now that we don't have append I don't have the concerns with the copy any more. Can you update this to master and we can make sure it passes the new style guides? Would be nice to get for Spark 3 for sure :) |
Jenkins Ok to test |
While you update to master i might include in the docstring that the similar funcitonality in densevector is done with manual delegation in |
Gentle ping, are you still interested in this? |
Extend docstring for SparseVector SciPy functions
Extend docstring for SparseVector SciPy functions
Hi, can you give a ref to the new style guides? I'm not sure if anything major needs changing. In the meantime I resolved the one conflict at the bottom of ml/tests.py and extended the docstring to include your comment. |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Simply added the
__getattr__
to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.arrayThis allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors.
i.e. you can simply call SparseVector.mean() to average the values in the entire vector.
Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways.
How was this patch tested?
Manual testing on local machine.
Passed ./python/run-tests
No UI changes.