-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays #14725
Conversation
…ava_array-CountVectorizer
@yanboliang, @jkbradley what do you think of this proposal? |
Maybe it would be useful to see in the PR where this is going to simplify the current PySpark ML code? (Need not be every model right away since if we decide to make design changes that could involve a large rewrite). |
Test build #64111 has finished for PR 14725 at commit
|
Test build #64304 has finished for PR 14725 at commit
|
Thanks @holdenk. I have an example usage for primitive arrays in the PR description, let me know if that is not clear enough to show how this change useful. I also added usage for OneVsRestModel that uses this to build an array of Classification models and gets rid of the need for a Java/Python friendly constructor. |
Test build #64305 has finished for PR 14725 at commit
|
Test build #64306 has finished for PR 14725 at commit
|
ping @yanboliang, mind taking a look? I'd like to have this to create a CountVectorizerModel from a vocabulary list from SPARK-15009, thanks! |
ping @jkbradley @yanboliang |
…array-CountVectorizer-SPARK-17161
ping @jkbradley @yanboliang @MLnick . This seems to have gone stale, but I think it would be great to get in to add things like CountVectorizer constructor from vocab list to PySpark. If any of you can take a look, that would be much appreciated! |
Test build #70846 has finished for PR 14725 at commit
|
Test build #70885 has finished for PR 14725 at commit
|
Jenkins, retest this please |
Test build #70887 has finished for PR 14725 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea in principal, but I'm not so sure about the inference option - it seems like we might just be better off specifying our own types.
That being said, I think the base helper function for copying the Python stuff into the array could be useful and help reduce the amount of python explicit contructors we have to build. But we should really get a Python committers opinion on this.
@staticmethod | ||
def _new_java_primitive_array(pylist): | ||
if not pylist: | ||
raise ValueError("Unable to convert an empty list to Java array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I understand we can't look at the types on an empty list, but using this seems like a potential headache that might not be worth it - in at least some of the places where we take an array we will want to take an empty array.
Also I forsee this possibly leading to all kinds of fun problems with numeric types :(
What about if we kept _new_java_array
and just required in our code we specify the type? It should still be simpler when porting new algorithms to Python but give us more flexibility.
cc @davies perhaps - what are your thoughts? |
Thanks @holdenk for taking a look! Yeah, I think you're right about the issues trying to infer a type. It would be nice if there was some easy way to specify a primitive type since that would cover most of the cases in mllib, but it's not that big of deal. If anything it's probably just a little obscure to a dev who's not familiar with py4j in Spark that for a sting the java class should be |
Maybe it could be cleared up a bit with a good docstring? Although if the result is too confusing to be used then it's probably not worth doing. |
Sure, I can add a better docstring. This is just for developers and doesn't have to be used, but it can be used to avoid creating more Java-friendly functions only because they have arrays - which happens a lot in mllib. Anything to help limit public APIs is very useful, imho. |
I don't think the wrappers are public APIs per-se, but I agree reducing the amount of boilerplate scala code required to expose the ML stuff is good if we can make it robust :) |
…array-CountVectorizer-SPARK-17161
Hey @holdenk , I removed the attempt at type inference so now must be specified explicitly and added a docstring showing common examples. Please take another look when you can, thanks! |
This looks good, how about also adding a test for an empty array given that it was a consideration in the earlier iteration (not anymore but good to have as a test incase we forget or someone else comes back to this later)? |
ya good idea, I'll add that |
Test build #72051 has finished for PR 14725 at commit
|
Test build #72056 has finished for PR 14725 at commit
|
@@ -16,6 +16,10 @@ | |||
# | |||
|
|||
from abc import ABCMeta, abstractmethod | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these imports needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch thanks! I wasn't using the def for basestring
but I still am iterating the array/list with xrange
. I was thinking it would be possible for someone to use a large list, like when using a vocabulary. I suppose I could use enumerate
if you think that's better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this look ok now @holdenk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks fine, we use xrange similarly elsewhere.
Test build #72092 has finished for PR 14725 at commit
|
LGTM - thanks for doing this - please ping me on the follow up PRs with the |
Thanks @holdenk! I updated the description. I'll follow up with |
Thanks @holdenk! Would you mind assigning the JIRA to me? https://issues.apache.org/jira/browse/SPARK-17161 |
I'd love to - however my JIRA account still doesn't have those permissions. My plan was to bug people in person next week to get that sorted out and go back and update the JIRAs. |
Ah, I see.. No worries then, I though maybe you had forgotten. |
For future reference: Merged into master |
…ction to create Py4J JavaArrays ## What changes were proposed in this pull request? Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed. ## How was this patch tested? Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
What changes were proposed in this pull request?
Adding convenience function to Python
JavaWrapper
so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a ScalaArray
as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example,OneVsRest
has been updated to use this and the alternate constructor is removed.How was this patch tested?
Added unit tests for the new convenience function and updated
OneVsRest
doctests which use this to persist the model.