Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer #10085

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Dec 2, 2015

Add Python API for ml.feature.QuantileDiscretizer.

One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc @brkyvz & @mengxr

…ectizer. One question (for review) is do we want to change the bucketizer as I've done or create a different wrapper? I think this way is better but it does introduce an extra param so no sure
… a param, print out the splits from the trained bucketizer
@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47026 has finished for PR 10085 at commit 2540101.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47030 has finished for PR 10085 at commit 1145ec4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47031 has finished for PR 10085 at commit 2afd197.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol):\n

@holdenk
Copy link
Contributor Author

holdenk commented Dec 2, 2015

cc @yanboliang who filed the JIRA for this.

@jkbradley
Copy link
Member

Can you please only link to the specific JIRA, not the umbrella?

@yinxusen
Copy link
Contributor

yinxusen commented Dec 8, 2015

Hi @holdenk, I think the PR is duplicated with mine: #10007

@yinxusen
Copy link
Contributor

yinxusen commented Dec 8, 2015

OK, I was not realized that there is an umbrella JIRA for this. I'll close mine.

@yinxusen
Copy link
Contributor

yinxusen commented Dec 8, 2015

@holdenk For your questions, I first tried to modify the interface of Bucketizer, making it to a JavaModel other than a JavaTransfomer. But I finally decided not to touch the Bucketizer, and added a inner class of QuantileDiscretizerModel to get the splits.

But I recommend to test the getSplits of Bucketizer that generating from the QuantileDiscretizer, since I got a serialization error, and I added a getJavaSplits to avoide it. JIRA issue here.

@holdenk holdenk changed the title [SPARK-11937][SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer Dec 8, 2015
@yanboliang
Copy link
Contributor

I vote for making Bucketizer to a Model rather than Transformer which is consistent with Scala code. @yinxusen Could you let us know the reason that you give up this proposal?

@yinxusen
Copy link
Contributor

yinxusen commented Dec 9, 2015

@yanboliang I am OK to change Bucketizer to a JavaModel. At that time I just do not want to change that piece of code. That's also why I closed my PR because I think @holdenk's implementation is better. But like what I said, be careful with getSplits. :) Cross it since this implementation avoid the serialization problem.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 9, 2015

Ok - just to make sure do you see any issues with the current approach for getSplits? Its tested a bit in the doctests but if there is a potential issue I can add some more tests.

@yinxusen
Copy link
Contributor

@holdenk No more issue in getSplits. It looks good.

@holdenk
Copy link
Contributor Author

holdenk commented Dec 10, 2015

@yinxusen thanks :)

@holdenk
Copy link
Contributor Author

holdenk commented Dec 14, 2015

cc @yanboliang if you have a chance to take a look

@holdenk
Copy link
Contributor Author

holdenk commented Dec 30, 2015

re-ping @yanboliang or @jkbradley if you've got the time to look at this (already been reviewed a bit).

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48495 has finished for PR 10085 at commit 601a9ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49177 has finished for PR 10085 at commit 798798c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 13, 2016

re-ping @jkbradley ?

@jkbradley
Copy link
Member

I'll try to check this soon, but have some others first. It will be great if someone else can review this PR in the meantime. @yinxusen Would you have time? Thanks!

@yinxusen
Copy link
Contributor

@jkbradley I'll help you reviewing this.

# a placeholder to make it appear in the generated doc
numBuckets = Param(Params._dummy(), "numBuckets",
"Maximum number of buckets (quantiles, or " +
"categories) into which data points are grouped. Must be >= 2.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a default 2 here?

>>> bucketed[0].buckets
0.0

.. versionadded:: 1.6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to 2.0.0.

@yinxusen
Copy link
Contributor

@jkbradley LGTM except for the version labels.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 16, 2016

@yinxusen /@jkbradley updated the version added tag to 2.0.0 :)

@SparkQA
Copy link

SparkQA commented Jan 16, 2016

Test build #49534 has finished for PR 10085 at commit 5e18778.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jan 18, 2016

LGTM as well! Thanks.

@@ -135,9 +135,9 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol):
"specified will be treated as errors.")

@keyword_only
def __init__(self, splits=None, inputCol=None, outputCol=None):
def __init__(self, splits=None, inputCol=None, outputCol=None, _java_model=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is _java_model needed? It does not seem to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yah, I think the original plan was to avoid the overhead of object creation and sending the params back to the JVM if it is supplied since we already had a transformer. I'll remove this.

@jkbradley
Copy link
Member

Those are the only issues I see. Thanks everyone for reviewing & @holdenk for the PR!

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49694 has finished for PR 10085 at commit f21ebef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49719 has finished for PR 10085 at commit f9e3086.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49726 has finished for PR 10085 at commit 194ec6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 21, 2016

Think I addressed all of @jkbradley's comments

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50015 has finished for PR 10085 at commit 463aa37.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jan 25, 2016

seems unrelated, jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50037 has finished for PR 10085 at commit 463aa37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks for the PR!

@asfgit asfgit closed this in b66afde Jan 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants