[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085

holdenk · 2015-12-02T01:22:30Z

Add Python API for ml.feature.QuantileDiscretizer.

One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc @brkyvz & @mengxr

…ectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure

… a param, print out the splits from the trained bucketizer

SparkQA · 2015-12-02T01:57:41Z

Test build #47026 has finished for PR 10085 at commit 2540101.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

SparkQA · 2015-12-02T02:44:34Z

Test build #47030 has finished for PR 10085 at commit 1145ec4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

SparkQA · 2015-12-02T03:13:48Z

Test build #47031 has finished for PR 10085 at commit 2afd197.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

holdenk · 2015-12-02T14:05:11Z

cc @yanboliang who filed the JIRA for this.

jkbradley · 2015-12-07T20:51:32Z

Can you please only link to the specific JIRA, not the umbrella?

yinxusen · 2015-12-08T03:29:37Z

Hi @holdenk, I think the PR is duplicated with mine: #10007

yinxusen · 2015-12-08T03:40:39Z

OK, I was not realized that there is an umbrella JIRA for this. I'll close mine.

yinxusen · 2015-12-08T04:43:21Z

@holdenk For your questions, I first tried to modify the interface of Bucketizer, making it to a JavaModel other than a JavaTransfomer. But I finally decided not to touch the Bucketizer, and added a inner class of QuantileDiscretizerModel to get the splits.

~~But I recommend to test the getSplits of Bucketizer that generating from the QuantileDiscretizer, since I got a serialization error, and I added a getJavaSplits to avoide it. JIRA issue here.~~

yanboliang · 2015-12-09T09:47:04Z

I vote for making Bucketizer to a Model rather than Transformer which is consistent with Scala code. @yinxusen Could you let us know the reason that you give up this proposal?

yinxusen · 2015-12-09T10:08:45Z

@yanboliang I am OK to change Bucketizer to a JavaModel. At that time I just do not want to change that piece of code. That's also why I closed my PR because I think @holdenk's implementation is better. ~~But like what I said, be careful with getSplits. :)~~ Cross it since this implementation avoid the serialization problem.

holdenk · 2015-12-09T21:08:21Z

Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests.

yinxusen · 2015-12-10T00:06:39Z

@holdenk No more issue in getSplits. It looks good.

holdenk · 2015-12-10T00:29:39Z

@yinxusen thanks :)

holdenk · 2015-12-14T18:59:25Z

cc @yanboliang if you have a chance to take a look

…feature.QuantileDiscretizer

holdenk · 2015-12-30T18:38:13Z

re-ping @yanboliang or @jkbradley if you've got the time to look at this (already been reviewed a bit).

SparkQA · 2015-12-30T19:02:23Z

Test build #48495 has finished for PR 10085 at commit 601a9ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…feature.QuantileDiscretizer

SparkQA · 2016-01-11T22:24:29Z

Test build #49177 has finished for PR 10085 at commit 798798c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-01-13T23:42:16Z

re-ping @jkbradley ?

jkbradley · 2016-01-14T01:20:38Z

I'll try to check this soon, but have some others first. It will be great if someone else can review this PR in the meantime. @yinxusen Would you have time? Thanks!

yinxusen · 2016-01-14T03:21:01Z

@jkbradley I'll help you reviewing this.

yinxusen · 2016-01-15T08:28:20Z

python/pyspark/ml/feature.py

+    # a placeholder to make it appear in the generated doc
+    numBuckets = Param(Params._dummy(), "numBuckets",
+                       "Maximum number of buckets (quantiles, or " +
+                       "categories) into which data points are grouped. Must be >= 2.")


Should we add a default 2 here?

yinxusen · 2016-01-15T08:29:18Z

python/pyspark/ml/feature.py

+    >>> bucketed[0].buckets
+    0.0
+
+    .. versionadded:: 1.6.0


change it to 2.0.0.

yinxusen · 2016-01-15T08:49:13Z

@jkbradley LGTM except for the version labels.

…feature.QuantileDiscretizer

holdenk · 2016-01-16T17:41:18Z

@yinxusen /@jkbradley updated the version added tag to 2.0.0 :)

SparkQA · 2016-01-16T18:06:35Z

Test build #49534 has finished for PR 10085 at commit 5e18778.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-01-18T05:29:07Z

LGTM as well! Thanks.

jkbradley · 2016-01-19T18:16:28Z

python/pyspark/ml/feature.py

@@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
              "specified will be treated as errors.")

    @keyword_only
-    def __init__(self, splits=None, inputCol=None, outputCol=None):
+    def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):


Why is _java_model needed? It does not seem to be used.

Oh yah, I think the original plan was to avoid the overhead of object creation and sending the params back to the JVM if it is supplied since we already had a transformer. I'll remove this.

jkbradley · 2016-01-19T18:16:44Z

Those are the only issues I see. Thanks everyone for reviewing & @holdenk for the PR!

SparkQA · 2016-01-19T19:42:52Z

Test build #49694 has finished for PR 10085 at commit f21ebef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…feature.QuantileDiscretizer

SparkQA · 2016-01-19T23:29:55Z

Test build #49719 has finished for PR 10085 at commit f9e3086.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-20T00:10:56Z

Test build #49726 has finished for PR 10085 at commit 194ec6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-01-21T17:26:16Z

Think I addressed all of @jkbradley's comments

…feature.QuantileDiscretizer

SparkQA · 2016-01-25T23:27:15Z

Test build #50015 has finished for PR 10085 at commit 463aa37.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-01-25T23:36:23Z

seems unrelated, jenkins retest this please.

SparkQA · 2016-01-25T23:59:06Z

Test build #50037 has finished for PR 10085 at commit 463aa37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-01-26T06:37:00Z

LGTM
Merging with master
Thanks for the PR!

holdenk added 4 commits December 1, 2015 14:03

Start working towards implementing python interface for quantilediscr…

dbabade

…ectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure

Ok remove _java_model before setting the params since it isn't really…

1cacd76

… a param, print out the splits from the trained bucketizer

And make sure the generated model works

cfb255f

pep8 style fix

2540101

Floating point funtimes with doctest

1145ec4

chop of extra

2afd197

holdenk changed the title ~~[SPARK-11937][SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer~~ [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer Dec 8, 2015

Merge branch 'master' into SPARK-11937-SPARK-11922-Python-API-for-ml.…

601a9ea

…feature.QuantileDiscretizer

Merge branch 'master' into SPARK-11937-SPARK-11922-Python-API-for-ml.…

798798c

…feature.QuantileDiscretizer

yinxusen reviewed Jan 15, 2016
View reviewed changes

python/pyspark/ml/feature.py

>>> bucketed[0].buckets

0.0

.. versionadded:: 1.6.0

Copy link

Contributor

yinxusen Jan 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to 2.0.0.

yinxusen mentioned this pull request Jan 15, 2016

[SPARK-11923][ML] Python API for ml.feature.ChiSqSelector #10186

Closed

holdenk added 3 commits January 16, 2016 09:37

Merge branch 'master' into SPARK-11937-SPARK-11922-Python-API-for-ml.…

b44c74d

…feature.QuantileDiscretizer

Add default 2 to the parameter name

27a4098

Updated since tags to 2.0.0 since we didn't make the 1.6 cut

5e18778

jkbradley reviewed Jan 19, 2016
View reviewed changes

CR feedback

f21ebef

holdenk added 2 commits January 19, 2016 13:08

Merge branch 'master' into SPARK-11937-SPARK-11922-Python-API-for-ml.…

d90339a

…feature.QuantileDiscretizer

Round to one digit

f9e3086

consistent formatting for the printed split

194ec6d

Merge branch 'master' into SPARK-11937-SPARK-11922-Python-API-for-ml.…

463aa37

…feature.QuantileDiscretizer

asfgit closed this in b66afde Jan 26, 2016

[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085

[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085

Conversation

holdenk commented Dec 2, 2015

SparkQA commented Dec 2, 2015

SparkQA commented Dec 2, 2015

SparkQA commented Dec 2, 2015

holdenk commented Dec 2, 2015

jkbradley commented Dec 7, 2015

yinxusen commented Dec 8, 2015

yinxusen commented Dec 8, 2015

yinxusen commented Dec 8, 2015

yanboliang commented Dec 9, 2015

yinxusen commented Dec 9, 2015

holdenk commented Dec 9, 2015

yinxusen commented Dec 10, 2015

holdenk commented Dec 10, 2015

holdenk commented Dec 14, 2015

holdenk commented Dec 30, 2015

SparkQA commented Dec 30, 2015

SparkQA commented Jan 11, 2016

holdenk commented Jan 13, 2016

jkbradley commented Jan 14, 2016

yinxusen commented Jan 14, 2016

yinxusen Jan 15, 2016

Choose a reason for hiding this comment

yinxusen Jan 15, 2016

Choose a reason for hiding this comment

yinxusen commented Jan 15, 2016

holdenk commented Jan 16, 2016

SparkQA commented Jan 16, 2016

dbtsai commented Jan 18, 2016

jkbradley Jan 19, 2016

Choose a reason for hiding this comment

holdenk Jan 19, 2016

Choose a reason for hiding this comment

jkbradley commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 19, 2016

SparkQA commented Jan 20, 2016

holdenk commented Jan 21, 2016

SparkQA commented Jan 25, 2016

holdenk commented Jan 25, 2016

SparkQA commented Jan 25, 2016

jkbradley commented Jan 26, 2016