[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays #14725

BryanCutler · 2016-08-19T23:33:38Z

What changes were proposed in this pull request?

Adding convenience function to Python JavaWrapper so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala Array as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, OneVsRest has been updated to use this and the alternate constructor is removed.

How was this patch tested?

Added unit tests for the new convenience function and updated OneVsRest doctests which use this to persist the model.

…ava_array-CountVectorizer

BryanCutler · 2016-08-19T23:35:02Z

@yanboliang, @jkbradley what do you think of this proposal?

holdenk · 2016-08-19T23:51:12Z

Maybe it would be useful to see in the PR where this is going to simplify the current PySpark ML code? (Need not be every model right away since if we decide to make design changes that could involve a large rewrite).

SparkQA · 2016-08-19T23:57:32Z

Test build #64111 has finished for PR 14725 at commit f9672bf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-23T19:13:52Z

Test build #64304 has finished for PR 14725 at commit b9da983.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-08-23T19:15:30Z

Thanks @holdenk. I have an example usage for primitive arrays in the PR description, let me know if that is not clear enough to show how this change useful. I also added usage for OneVsRestModel that uses this to build an array of Classification models and gets rid of the need for a Java/Python friendly constructor.

SparkQA · 2016-08-23T19:45:31Z

Test build #64305 has finished for PR 14725 at commit 32fd5bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-23T20:31:28Z

Test build #64306 has finished for PR 14725 at commit 4957e0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2016-09-14T00:01:47Z

ping @yanboliang, mind taking a look? I'd like to have this to create a CountVectorizerModel from a vocabulary list from SPARK-15009, thanks!

BryanCutler · 2016-09-28T17:12:30Z

ping @jkbradley @yanboliang

…array-CountVectorizer-SPARK-17161

BryanCutler · 2017-01-04T01:07:29Z

ping @jkbradley @yanboliang @MLnick . This seems to have gone stale, but I think it would be great to get in to add things like CountVectorizer constructor from vocab list to PySpark. If any of you can take a look, that would be much appreciated!

SparkQA · 2017-01-04T01:19:36Z

Test build #70846 has finished for PR 14725 at commit ba63530.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-04T18:05:24Z

Test build #70885 has finished for PR 14725 at commit 834c641.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-01-04T18:11:01Z

Jenkins, retest this please

SparkQA · 2017-01-04T20:41:28Z

Test build #70887 has finished for PR 14725 at commit 834c641.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

I like the idea in principal, but I'm not so sure about the inference option - it seems like we might just be better off specifying our own types.

That being said, I think the base helper function for copying the Python stuff into the array could be useful and help reduce the amount of python explicit contructors we have to build. But we should really get a Python committers opinion on this.

holdenk · 2017-01-04T23:59:15Z

python/pyspark/ml/wrapper.py

+    @staticmethod
+    def _new_java_primitive_array(pylist):
+        if not pylist:
+            raise ValueError("Unable to convert an empty list to Java array")


So I understand we can't look at the types on an empty list, but using this seems like a potential headache that might not be worth it - in at least some of the places where we take an array we will want to take an empty array.

Also I forsee this possibly leading to all kinds of fun problems with numeric types :(

What about if we kept _new_java_array and just required in our code we specify the type? It should still be simpler when porting new algorithms to Python but give us more flexibility.

holdenk · 2017-01-05T00:02:27Z

cc @davies perhaps - what are your thoughts?

BryanCutler · 2017-01-05T05:55:50Z

Thanks @holdenk for taking a look! Yeah, I think you're right about the issues trying to infer a type. It would be nice if there was some easy way to specify a primitive type since that would cover most of the cases in mllib, but it's not that big of deal. If anything it's probably just a little obscure to a dev who's not familiar with py4j in Spark that for a sting the java class should be sc._gateway.jvm.java.lang.String, for example.

holdenk · 2017-01-05T07:43:50Z

Maybe it could be cleared up a bit with a good docstring? Although if the result is too confusing to be used then it's probably not worth doing.

BryanCutler · 2017-01-10T04:30:04Z

Sure, I can add a better docstring. This is just for developers and doesn't have to be used, but it can be used to avoid creating more Java-friendly functions only because they have arrays - which happens a lot in mllib. Anything to help limit public APIs is very useful, imho.

holdenk · 2017-01-10T05:16:35Z

I don't think the wrappers are public APIs per-se, but I agree reducing the amount of boilerplate scala code required to expose the ML stuff is good if we can make it robust :)

…array-CountVectorizer-SPARK-17161

…licitly

BryanCutler · 2017-01-26T23:04:11Z

Hey @holdenk , I removed the attempt at type inference so now must be specified explicitly and added a docstring showing common examples. Please take another look when you can, thanks!

holdenk · 2017-01-26T23:13:54Z

This looks good, how about also adding a test for an empty array given that it was a consideration in the earlier iteration (not anymore but good to have as a test incase we forget or someone else comes back to this later)?

BryanCutler · 2017-01-26T23:15:16Z

ya good idea, I'll add that

SparkQA · 2017-01-27T01:19:58Z

Test build #72051 has finished for PR 14725 at commit 65dcfb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-27T01:59:15Z

Test build #72056 has finished for PR 14725 at commit 869981f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-01-27T08:09:37Z

python/pyspark/ml/wrapper.py

@@ -16,6 +16,10 @@
 #

 from abc import ABCMeta, abstractmethod
+import sys


are these imports needed anymore?

Oh, good catch thanks! I wasn't using the def for basestring but I still am iterating the array/list with xrange. I was thinking it would be possible for someone to use a large list, like when using a vocabulary. I suppose I could use enumerate if you think that's better?

Does this look ok now @holdenk?

That looks fine, we use xrange similarly elsewhere.

SparkQA · 2017-01-28T02:34:59Z

Test build #72092 has finished for PR 14725 at commit 8b401ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-01-31T22:00:52Z

LGTM - thanks for doing this - please ping me on the follow up PRs with the CountVectorizerModel. Before merge would you mind updating the PR description for how this was tested to remove the note about the CountVectorizerModel since it isn't included in this PR?

BryanCutler · 2017-01-31T23:18:00Z

Thanks @holdenk! I updated the description. I'll follow up with CountVectorizerModel after this merged and ping you.

BryanCutler · 2017-02-01T17:33:03Z

Thanks @holdenk! Would you mind assigning the JIRA to me? https://issues.apache.org/jira/browse/SPARK-17161

holdenk · 2017-02-01T17:42:25Z

I'd love to - however my JIRA account still doesn't have those permissions. My plan was to bug people in person next week to get that sorted out and go back and update the JIRAs.

BryanCutler · 2017-02-01T17:45:58Z

Ah, I see.. No worries then, I though maybe you had forgotten.

holdenk · 2017-02-02T08:41:14Z

For future reference: Merged into master

…ction to create Py4J JavaArrays ## What changes were proposed in this pull request? Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed. ## How was this patch tested? Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.

BryanCutler added 5 commits July 14, 2016 10:16

testing out _new_java_array

2a8de60

Merge remote-tracking branch 'upstream/master' into wip-pyspark-new_j…

97bff07

…ava_array-CountVectorizer

undo changes to CountVectorizerModel used for testing

4766cdc

added convienience functions to JavaWrapper to create py4j JavaArray

1c0ddb9

fixed style checks and tests

f9672bf

BryanCutler added 2 commits August 23, 2016 12:10

added python3 compatibility for xrange and basestring

bbf7f58

added usage example for OneVsRestModel constructor

b9da983

BryanCutler added 2 commits August 23, 2016 12:19

fixed python style error

32fd5bd

can remove OneVsRest Java/Python friendly constructor now

4957e0c

Merge remote-tracking branch 'upstream/master' into pyspark-new_java_…

ba63530

…array-CountVectorizer-SPARK-17161

added MiMa exclusion for OneVsRest py-friendly constructor

834c641

holdenk reviewed Jan 5, 2017

View reviewed changes

BryanCutler added 2 commits January 26, 2017 13:50

Merge remote-tracking branch 'upstream/master' into pyspark-new_java_…

9321af8

…array-CountVectorizer-SPARK-17161

removed function that tried to infer type, better to just specify exp…

65dcfb6

…licitly

added test for empty lists and little cleanup

869981f

holdenk reviewed Jan 27, 2017

View reviewed changes

removed unused def

8b401ec

BryanCutler changed the title ~~[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays~~ [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays Jan 31, 2017

asfgit closed this in 57d70d2 Jan 31, 2017

BryanCutler deleted the pyspark-new_java_array-CountVectorizer-SPARK-17161 branch November 19, 2018 05:46

[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays #14725

[SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays #14725

Conversation

BryanCutler commented Aug 19, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

BryanCutler commented Aug 19, 2016

holdenk commented Aug 19, 2016

SparkQA commented Aug 19, 2016

SparkQA commented Aug 23, 2016

BryanCutler commented Aug 23, 2016

SparkQA commented Aug 23, 2016

SparkQA commented Aug 23, 2016

BryanCutler commented Sep 14, 2016

BryanCutler commented Sep 28, 2016

BryanCutler commented Jan 4, 2017

SparkQA commented Jan 4, 2017

SparkQA commented Jan 4, 2017

BryanCutler commented Jan 4, 2017

SparkQA commented Jan 4, 2017

holdenk left a comment

Choose a reason for hiding this comment

holdenk Jan 4, 2017

Choose a reason for hiding this comment

holdenk commented Jan 5, 2017

BryanCutler commented Jan 5, 2017

holdenk commented Jan 5, 2017

BryanCutler commented Jan 10, 2017

holdenk commented Jan 10, 2017

BryanCutler commented Jan 26, 2017

holdenk commented Jan 26, 2017

BryanCutler commented Jan 26, 2017

SparkQA commented Jan 27, 2017

SparkQA commented Jan 27, 2017

holdenk Jan 27, 2017

Choose a reason for hiding this comment

BryanCutler Jan 27, 2017

Choose a reason for hiding this comment

BryanCutler Jan 31, 2017

Choose a reason for hiding this comment

holdenk Jan 31, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 28, 2017

holdenk commented Jan 31, 2017

BryanCutler commented Jan 31, 2017

BryanCutler commented Feb 1, 2017

holdenk commented Feb 1, 2017

BryanCutler commented Feb 1, 2017

holdenk commented Feb 2, 2017

BryanCutler commented Aug 19, 2016 •

edited

Loading