Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10931][PYSPARK][ML] PySpark ML Models should contain Param values #14653

Closed
wants to merge 1 commit into from

Conversation

evanyc15
Copy link

@evanyc15 evanyc15 commented Aug 15, 2016

What changes were proposed in this pull request?

Changed PySpark models to include the Param values.
Refer to the closed PR #10270 for additional information.

How was this patch tested?

Tested using Python doctests

Changesets:

Estimator UID is being copied correctly to the Transformer model objects and params now, working on Doctests

Changed the way parameters are copied from the Estimator to Transformer

Checkpoint, switching back to inheritance method

Working on DocTests

Implemented Doctests for Recommendation, Clustering, Classification (except RandomForestClassifier), Evaluation, Tuning, Regression (except RandomRegression)

Ready for Code Review

Code Review changeset #1

@evanyc15
Copy link
Author

CC @MLnick

@@ -59,6 +59,16 @@ class LogisticRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti
... Row(label=0.0, weight=2.0, features=Vectors.sparse(1, [], []))]).toDF()
>>> lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight")
>>> model = lr.fit(df)
>>> emap = lr.extractParamMap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:
emap -> estimator_paramMap
mmap -> model_paramMap
?

@MechCoder
Copy link
Contributor

Should we start having

PredictorParams -> (HasLabelCol, HasFeaturesCol, HasPredictionCol)
ClassifierParams -> (HasRawPredictionCol)

as done in the Scala side?

@@ -243,7 +240,7 @@ def __init__(self, java_model=None):
"""
Initialize this instance with a Java model object.
Subclasses should call this constructor, initialize params,
and then call _transfer_params_from_java.
and then call _transformer_params.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you intended this change.

@evanyc15
Copy link
Author

@MechCoder I made the changes for emap -> estimator_paramMap, mmap -> model_paramMap, and (param, value) -> param, value.

@@ -336,6 +336,11 @@ def hasParam(self, paramName):
return isinstance(p, Param)
else:
raise TypeError("hasParam(): paramName must be a string")
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code is reachable, is this necessary?

@BryanCutler
Copy link
Member

Hey @evanyc15 , this is looking pretty good. I had a couple initial comments, but I'll have to look at it more in depth since there's a lot of changes. Mind resolving the conflicts first?

@evanyc15 evanyc15 force-pushed the SPARK-10931-pyspark-mllib branch 2 times, most recently from 8d7aedb to e12cbd7 Compare September 13, 2016 22:20
Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanyc15 it looks good in general.

I think the doctests you have here would be better suited as unit tests because it's an area normal users would touch, and you could reduce some code duplication.

This PR might be easier for others to review if you pick a single estimator/model like LogisticRegression and demonstrate this change once before changing it everywhere.

>>> estimator_paramMap = lr.extractParamMap()
>>> model_paramMap = model.extractParamMap()
>>> all([estimator_paramMap[getattr(lr, param.name)] == value
... for param, value in model_paramMap.items()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comparison should be the other way around. Here you check that each param in the model is in the estimator, but it should be checking that each param in estimator made it to the model.

>>> all([param.parent == model.uid for param in model_paramMap])
True
>>> [param.name for param in model.params] # doctest: +NORMALIZE_WHITESPACE
['elasticNetParam', 'featuresCol', 'fitIntercept', 'labelCol', 'maxIter',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this test, it's too brittle and doesn't really add much from the tests above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for moving these kinds of tests to unit tests. Here, they make the documentation example confusing.

class LogisticRegressionModel(JavaModel, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
HasRegParam, HasTol, HasProbabilityCol, HasRawPredictionCol,
HasElasticNetParam, HasFitIntercept, HasStandardization,
HasThresholds, JavaMLWritable, JavaMLReadable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this is missing a couple shared params like HasWeightCol

@@ -748,8 +785,9 @@ def _create_model(self, java_model):
return RandomForestClassificationModel(java_model)


class RandomForestClassificationModel(TreeEnsembleModel, JavaClassificationModel, JavaMLWritable,
JavaMLReadable):
class RandomForestClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs JavaClassificationModel

@@ -900,8 +947,8 @@ def getLossType(self):
return self.getOrDefault(self.lossType)


class GBTClassificationModel(TreeEnsembleModel, JavaPredictionModel, JavaMLWritable,
JavaMLReadable):
class GBTClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol, HasPredictionCol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs JavaPredictionModel

@@ -560,6 +586,7 @@ class TreeRegressorParams(Params):
"""

supportedImpurities = ["variance"]
# a placeholder to make it appear in the generated doc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? maybe just a empty line will do

Abstraction for Decision Tree models.
class DecisionTreeModel(JavaModel, JavaPredictionModel,
HasFeaturesCol, HasLabelCol, HasPredictionCol):
"""Abstraction for Decision Tree models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline after quotes

JavaMLReadable):
class RandomForestRegressionModel(TreeEnsembleModel, JavaPredictionModel, HasFeaturesCol,
HasLabelCol, HasPredictionCol,
JavaMLWritable, JavaMLReadable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could probably merge these 2 lines

@@ -1116,7 +1183,7 @@ class AFTSurvivalRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredi
| 1.0| [1.0]| 1.0| 1.0|
| 0.0|(1,[],[])| 0.0| 1.0|
+-----+---------+------+----------+
...
<BLANKLINE>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this is needed

@@ -200,7 +197,6 @@ def _create_model(self, java_model):
def _fit_java(self, dataset):
"""
Fits a Java model to the input dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these blank lines should stay. Even though they are private functions, that's how it's usually done for Sphinx I believe

@BryanCutler
Copy link
Member

@evanyc15 it looks good in general.

I think the doctests you have here would be better suited as unit tests because it's not an area normal users would care about and it could reduce code duplication

It might also be easier for others to review this PR if you picked a single estimator/model and demonstrated this change once to get feedback before applying it everywhere.

@BryanCutler
Copy link
Member

It might also be good to discuss grouping the param mixins, similar to how it's done in Scala, so that both the estimator and model can inherit from a single common trait. This way you could be sure they will contain the same shared params.

@jkbradley
Copy link
Member

ok to test

Sorry for the delay on this, but it'd be great to fix now!

@evanyc15
Copy link
Author

evanyc15 commented Oct 4, 2016

Sounds good Joseph. I'll resolve the conflicts.

@evanyc15 evanyc15 force-pushed the SPARK-10931-pyspark-mllib branch 2 times, most recently from fc11247 to e706c7e Compare October 4, 2016 21:33
@evanyc15
Copy link
Author

evanyc15 commented Oct 4, 2016

@jkbradley Hey Joseph,

I've resolved the merge conflicts. Can you please test?

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Huh I'm not sure why jenkins isn't picking this up - @jkbradley or @davidnavas can you tell jenkins this is ok to test again?

@davidnavas
Copy link
Contributor

@holdenk happy to help if I can, is there something a mere mortal like myself can accomplish? [dunno how to poke jenkins hereabouts]

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #3328 has finished for PR 14653 at commit e706c7e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegressionModel(JavaModel, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class DecisionTreeClassificationModel(DecisionTreeModel, JavaClassificationModel, HasFeaturesCol,
    • class RandomForestClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol,
    • class GBTClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol, HasPredictionCol,
    • class NaiveBayesModel(JavaModel, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasProbabilityCol,
    • class MultilayerPerceptronClassificationModel(JavaModel, HasFeaturesCol, HasLabelCol,
    • class GaussianMixtureModel(JavaModel, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol, HasSeed,
    • class KMeansModel(JavaModel, JavaMLWritable, JavaMLReadable, HasFeaturesCol,
    • class BisectingKMeansModel(JavaModel, HasFeaturesCol, HasPredictionCol, HasMaxIter,
    • class LDAModel(JavaModel, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):
    • class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol,
    • class IDFModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class MaxAbsScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class MinMaxScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class StandardScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class StringIndexerModel(JavaModel, HasInputCol, HasOutputCol, HasHandleInvalid,
    • class VectorIndexerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class Word2VecModel(JavaModel, HasStepSize, HasMaxIter, HasSeed, HasInputCol,
    • class PCAModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class RFormulaModel(JavaModel, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaMLWritable):
    • class ChiSqSelectorModel(JavaModel, HasFeaturesCol, HasOutputCol, HasLabelCol,
    • class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol,
    • class ALSModel(JavaModel, HasPredictionCol, JavaMLWritable, JavaMLReadable):
    • class LinearRegressionModel(JavaModel, JavaPredictionModel, HasFeaturesCol, HasLabelCol,
    • class IsotonicRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable,
    • class DecisionTreeModel(JavaModel, JavaPredictionModel,
    • class RandomForestRegressionModel(TreeEnsembleModel, JavaPredictionModel, HasFeaturesCol,
    • class GBTRegressionModel(TreeEnsembleModel, JavaPredictionModel,
    • class AFTSurvivalRegressionModel(JavaModel, HasFeaturesCol, HasLabelCol,
    • class GeneralizedLinearRegressionModel(JavaModel, JavaPredictionModel, HasLabelCol, HasFeaturesCol,

@holdenk
Copy link
Contributor

holdenk commented Oct 12, 2016

Looks like jenkins has picked it up. Maybe @evanyc15 can merge in master (or rebase on master) so jenkins re-runs and verify the tests?

@evanyc15
Copy link
Author

@holdenk I just rebased and pushed again. Hopefully, Jenkins passes this time

@evanyc15
Copy link
Author

@davidnavas
Hi David,
I have rebased and pushed again. Could you tell Jenkins to re-test the PR?
Thank you

@davidnavas
Copy link
Contributor

Sadly, I don't have that superpower :( Leastwise not that I know. Perhaps Holden's appeal to @jkbradley was what worked last time?

@MLnick
Copy link
Contributor

MLnick commented Oct 14, 2016

jenkins test this please

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66953 has finished for PR 14653 at commit eb7ca31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegressionModel(JavaModel, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class DecisionTreeClassificationModel(DecisionTreeModel, JavaClassificationModel, HasFeaturesCol,
    • class RandomForestClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol,
    • class GBTClassificationModel(TreeEnsembleModel, HasFeaturesCol, HasLabelCol, HasPredictionCol,
    • class NaiveBayesModel(JavaModel, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasProbabilityCol,
    • class MultilayerPerceptronClassificationModel(JavaModel, HasFeaturesCol, HasLabelCol,
    • class GaussianMixtureModel(JavaModel, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol, HasSeed,
    • class KMeansModel(JavaModel, JavaMLWritable, JavaMLReadable, HasFeaturesCol,
    • class BisectingKMeansModel(JavaModel, HasFeaturesCol, HasPredictionCol, HasMaxIter,
    • class LDAModel(JavaModel, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):
    • class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol,
    • class IDFModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class MaxAbsScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class MinMaxScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class StandardScalerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class StringIndexerModel(JavaModel, HasInputCol, HasOutputCol, HasHandleInvalid,
    • class VectorIndexerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class Word2VecModel(JavaModel, HasStepSize, HasMaxIter, HasSeed, HasInputCol,
    • class PCAModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):
    • class RFormulaModel(JavaModel, HasFeaturesCol, HasLabelCol, JavaMLReadable, JavaMLWritable):
    • class ChiSqSelectorModel(JavaModel, HasFeaturesCol, HasOutputCol, HasLabelCol,
    • class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol,
    • class ALSModel(JavaModel, HasPredictionCol, JavaMLWritable, JavaMLReadable):
    • class LinearRegressionModel(JavaModel, JavaPredictionModel, HasFeaturesCol, HasLabelCol,
    • class IsotonicRegressionModel(JavaModel, JavaMLWritable, JavaMLReadable,
    • class DecisionTreeModel(JavaModel, JavaPredictionModel,
    • class RandomForestRegressionModel(TreeEnsembleModel, JavaPredictionModel, HasFeaturesCol,
    • class GBTRegressionModel(TreeEnsembleModel, JavaPredictionModel,
    • class AFTSurvivalRegressionModel(JavaModel, HasFeaturesCol, HasLabelCol,
    • class GeneralizedLinearRegressionModel(JavaModel, JavaPredictionModel, HasLabelCol, HasFeaturesCol,

@evanyc15
Copy link
Author

@MLnick @jkbradley Do you mind merging the PR? Thank you

@jkbradley
Copy link
Member

@evanyc15 OK back for real now...sorry for the delay. @BryanCutler has a lot of good comments. Could you please address them? Regarding splitting this up into multiple PRs, I strongly +1 that in general, though I'm OK if you want to do this as a batch.

I'll test this PR out now...

@jkbradley
Copy link
Member

jkbradley commented Oct 25, 2016

Having some trouble b/c the doc build was apparently broken 19 days ago. Looking into a fix now.

~ Doc build is being fixed...

@jkbradley
Copy link
Member

One good set of unit tests might emulate ParamsSuite.checkParams from Scala. That tests several things which should be uniform across all Params subclasses.

@evanyc15
Copy link
Author

Hey @jkbradley the checkParams method already exists in the Python side. It's defined in the tests.py DefaultValuesTests class and is being called by test_java_params. I'm removing the param testing from the Python Doctests now and will be implementing the Unit test in one of the classes for now. Once approved, I will then implement the Unit test in the remaining classes.

Copied parameters over from Estimator to Transformer

Estimator UID is being copied correctly to the Transformer model objects and params now, working on Doctests

Changed the way parameters are copied from the Estimator to Transformer

Checkpoint, switching back to inheritance method

Working on DocTests

Implemented Doctests for Recommendation, Clustering, Classification (except RandomForestClassifier), Evaluation, Tuning, Regression (except RandomRegression)

Ready for Code Review

Code Review changeset apache#1
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@BryanCutler
Copy link
Member

@evanyc15 would you mind closing this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants