Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays #14725

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Aug 19, 2016

What changes were proposed in this pull request?

Adding convenience function to Python JavaWrapper so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala Array as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, OneVsRest has been updated to use this and the alternate constructor is removed.

How was this patch tested?

Added unit tests for the new convenience function and updated OneVsRest doctests which use this to persist the model.

@BryanCutler
Copy link
Member Author

@yanboliang, @jkbradley what do you think of this proposal?

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2016

Maybe it would be useful to see in the PR where this is going to simplify the current PySpark ML code? (Need not be every model right away since if we decide to make design changes that could involve a large rewrite).

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64111 has finished for PR 14725 at commit f9672bf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64304 has finished for PR 14725 at commit b9da983.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Thanks @holdenk. I have an example usage for primitive arrays in the PR description, let me know if that is not clear enough to show how this change useful. I also added usage for OneVsRestModel that uses this to build an array of Classification models and gets rid of the need for a Java/Python friendly constructor.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64305 has finished for PR 14725 at commit 32fd5bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64306 has finished for PR 14725 at commit 4957e0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

ping @yanboliang, mind taking a look? I'd like to have this to create a CountVectorizerModel from a vocabulary list from SPARK-15009, thanks!

@BryanCutler
Copy link
Member Author

ping @jkbradley @yanboliang

@BryanCutler
Copy link
Member Author

ping @jkbradley @yanboliang @MLnick . This seems to have gone stale, but I think it would be great to get in to add things like CountVectorizer constructor from vocab list to PySpark. If any of you can take a look, that would be much appreciated!

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70846 has finished for PR 14725 at commit ba63530.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70885 has finished for PR 14725 at commit 834c641.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70887 has finished for PR 14725 at commit 834c641.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea in principal, but I'm not so sure about the inference option - it seems like we might just be better off specifying our own types.

That being said, I think the base helper function for copying the Python stuff into the array could be useful and help reduce the amount of python explicit contructors we have to build. But we should really get a Python committers opinion on this.

@staticmethod
def _new_java_primitive_array(pylist):
if not pylist:
raise ValueError("Unable to convert an empty list to Java array")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand we can't look at the types on an empty list, but using this seems like a potential headache that might not be worth it - in at least some of the places where we take an array we will want to take an empty array.

Also I forsee this possibly leading to all kinds of fun problems with numeric types :(

What about if we kept _new_java_array and just required in our code we specify the type? It should still be simpler when porting new algorithms to Python but give us more flexibility.

@holdenk
Copy link
Contributor

holdenk commented Jan 5, 2017

cc @davies perhaps - what are your thoughts?

@BryanCutler
Copy link
Member Author

Thanks @holdenk for taking a look! Yeah, I think you're right about the issues trying to infer a type. It would be nice if there was some easy way to specify a primitive type since that would cover most of the cases in mllib, but it's not that big of deal. If anything it's probably just a little obscure to a dev who's not familiar with py4j in Spark that for a sting the java class should be sc._gateway.jvm.java.lang.String, for example.

@holdenk
Copy link
Contributor

holdenk commented Jan 5, 2017

Maybe it could be cleared up a bit with a good docstring? Although if the result is too confusing to be used then it's probably not worth doing.

@BryanCutler
Copy link
Member Author

Sure, I can add a better docstring. This is just for developers and doesn't have to be used, but it can be used to avoid creating more Java-friendly functions only because they have arrays - which happens a lot in mllib. Anything to help limit public APIs is very useful, imho.

@holdenk
Copy link
Contributor

holdenk commented Jan 10, 2017

I don't think the wrappers are public APIs per-se, but I agree reducing the amount of boilerplate scala code required to expose the ML stuff is good if we can make it robust :)

@BryanCutler
Copy link
Member Author

Hey @holdenk , I removed the attempt at type inference so now must be specified explicitly and added a docstring showing common examples. Please take another look when you can, thanks!

@holdenk
Copy link
Contributor

holdenk commented Jan 26, 2017

This looks good, how about also adding a test for an empty array given that it was a consideration in the earlier iteration (not anymore but good to have as a test incase we forget or someone else comes back to this later)?

@BryanCutler
Copy link
Member Author

ya good idea, I'll add that

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72051 has finished for PR 14725 at commit 65dcfb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72056 has finished for PR 14725 at commit 869981f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -16,6 +16,10 @@
#

from abc import ABCMeta, abstractmethod
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these imports needed anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch thanks! I wasn't using the def for basestring but I still am iterating the array/list with xrange. I was thinking it would be possible for someone to use a large list, like when using a vocabulary. I suppose I could use enumerate if you think that's better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look ok now @holdenk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine, we use xrange similarly elsewhere.

@SparkQA
Copy link

SparkQA commented Jan 28, 2017

Test build #72092 has finished for PR 14725 at commit 8b401ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 31, 2017

LGTM - thanks for doing this - please ping me on the follow up PRs with the CountVectorizerModel. Before merge would you mind updating the PR description for how this was tested to remove the note about the CountVectorizerModel since it isn't included in this PR?

@BryanCutler BryanCutler changed the title [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays Jan 31, 2017
@BryanCutler
Copy link
Member Author

Thanks @holdenk! I updated the description. I'll follow up with CountVectorizerModel after this merged and ping you.

@asfgit asfgit closed this in 57d70d2 Jan 31, 2017
@BryanCutler
Copy link
Member Author

Thanks @holdenk! Would you mind assigning the JIRA to me? https://issues.apache.org/jira/browse/SPARK-17161

@holdenk
Copy link
Contributor

holdenk commented Feb 1, 2017

I'd love to - however my JIRA account still doesn't have those permissions. My plan was to bug people in person next week to get that sorted out and go back and update the JIRAs.

@BryanCutler
Copy link
Member Author

Ah, I see.. No worries then, I though maybe you had forgotten.

@holdenk
Copy link
Contributor

holdenk commented Feb 2, 2017

For future reference: Merged into master

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…ction to create Py4J JavaArrays

## What changes were proposed in this pull request?

Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.

## How was this patch tested?

Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
@BryanCutler BryanCutler deleted the pyspark-new_java_array-CountVectorizer-SPARK-17161 branch November 19, 2018 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants