Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5956][MLLIB] Pipeline components should be copyable. #5820

Closed
wants to merge 19 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Apr 30, 2015

This PR added copy(extra: ParamMap): Params to Params, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement fit and transform without extra param values given the default implementation of fit(dataset, extra):

def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}

Inside fit and transform, since only the embedded values are used, I added $ as an alias for getOrDefault to make the code easier to read. For example, in LinearRegression.fit we have:

val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam

Meta-algorithm like Pipeline implements its own copy(extra). So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:

  • Params$.inheritValues is moved to Params!.copyValues and returns the target instance.
  • fittingParamMap was removed because the parent carries this information.
  • validate was renamed to validateParams to be more precise.

TODOs:

  • add tests for newly added methods
  • update documentation

@jkbradley @dbtsai

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31451 has finished for PR 5820 at commit b642872.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31537 has finished for PR 5820 at commit 819dd2d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31543 has finished for PR 5820 at commit 282a1a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31600 has finished for PR 5820 at commit 93e7924.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31594 has finished for PR 5820 at commit 463ecae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol

@dbtsai
Copy link
Member

dbtsai commented May 1, 2015

I like this approach, and it cleaned up and lowed the maintenance burden a lot.

@@ -24,7 +24,7 @@ import breeze.optimize.{LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN}
import breeze.optimize.{CachedDiffFunction, DiffFunction}

import org.apache.spark.annotation.AlphaComponent
import org.apache.spark.ml.param.{Params, ParamMap}
import org.apache.spark.ml.param.{ParamMap, Params}
import org.apache.spark.ml.param.shared.{HasTol, HasElasticNetParam, HasMaxIter, HasRegParam}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Params is no longer used.

@@ -34,13 +34,16 @@ abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
* Fits a single model to the input data with optional parameters.
*
* @param dataset input dataset
* @param paramPairs Optional list of param pairs.
* @param firstParamPair the first param pair, overwrites embedded params
* @param otherParamPairs other param pairs.
* These values override any specified in this Estimator's embedded ParamMap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented May 4, 2015

Test build #31734 has finished for PR 5820 at commit 7bef88d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class Evaluator extends Params
    • abstract class PipelineStage extends Params with Logging
    • class BinaryClassificationEvaluator extends Evaluator with HasRawPredictionCol with HasLabelCol

assert(!schema.fieldNames.contains(predictionColName),
protected def validateAndTransformSchema(schema: StructType): StructType = {
require(schema($(userCol)).dataType == IntegerType)
require(schema($(itemCol)).dataType== IntegerType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before ==

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before ==

@dbtsai
Copy link
Member

dbtsai commented May 4, 2015

LGTM

asfgit pushed a commit that referenced this pull request May 4, 2015
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:

~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}
~~~

Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:

~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~

Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.

TODOs:
* [x] add tests for newly added methods
* [ ] update documentation

jkbradley dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:

7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components

(cherry picked from commit e0833c5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in e0833c5 May 4, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 4, 2015

Merged into master and branch-1.4. I will address comments in a follow-up PR. Thanks for reviewing!


override def validate(paramMap: ParamMap): Unit = {
/** @group getParam */
def getStages: Array[PipelineStage] = $(stages).clone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth putting a comment here about why stages need to be cloned.

@jkbradley
Copy link
Member

@mengxr Good point about it being unclear which Estimator produced the Model. Not including fittingParamMap sounds fine.

@jkbradley
Copy link
Member

LGTM too. My comments were all small items.

@jkbradley
Copy link
Member

A few more to-do items

  • CrossValidator should make a copy of its underlying Estimator (but not Evaluator, I guess). Same for CrossValidatorModel and its underlying Model. CV also needs to make sure its estimatorParamMaps are mapped from the old estimator to the copy (with a different UID).
    • Or should it copy the Evaluator? thinking about shared params
  • After copying a Params instance, all of the parameters are explicitly set. (I.e., default values are copied to explicitly set values.) That's a little odd.

// Precedence of ParamMaps: paramMap > this.paramMap > fittingParamMap
val map = fittingParamMap ++ extractParamMap(paramMap)
stages.foldLeft(schema)((cur, transformer) => transformer.transformSchema(cur, map))
override def copy(extra: ParamMap): PipelineModel = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use "extra" and copy its stages

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:

~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}
~~~

Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:

~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~

Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.

TODOs:
* [x] add tests for newly added methods
* [ ] update documentation

jkbradley dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5820 from mengxr/SPARK-5956 and squashes the following commits:

7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
asfgit pushed a commit that referenced this pull request Jun 2, 2015
… Updating ML Doc "Estimator, Transformer, and Param" examples.

Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](#5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.

(cherry picked from commit ad06727)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
asfgit pushed a commit that referenced this pull request Jun 2, 2015
… Updating ML Doc "Estimator, Transformer, and Param" examples.

Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](#5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:

~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}
~~~

Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:

~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~

Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.

TODOs:
* [x] add tests for newly added methods
* [ ] update documentation

jkbradley dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5820 from mengxr/SPARK-5956 and squashes the following commits:

7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
… Updating ML Doc "Estimator, Transformer, and Param" examples.

Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](apache#5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes apache#6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:

~~~scala
def fit(dataset: DataFrame, extra: ParamMap): Model = {
  copy(extra).fit(dataset)
}
~~~

Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:

~~~scala
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
~~~

Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).

Other changes:
* `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
* `fittingParamMap` was removed because the `parent` carries this information.
* `validate` was renamed to `validateParams` to be more precise.

TODOs:
* [x] add tests for newly added methods
* [ ] update documentation

jkbradley dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5820 from mengxr/SPARK-5956 and squashes the following commits:

7bef88d [Xiangrui Meng] address comments
05229c3 [Xiangrui Meng] assert -> assertEquals
b2927b1 [Xiangrui Meng] organize imports
f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
93e7924 [Xiangrui Meng] add tests for hasParam & copy
463ecae [Xiangrui Meng] merge master
2b954c3 [Xiangrui Meng] update Binarizer
465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
282a1a8 [Xiangrui Meng] fix test
819dd2d [Xiangrui Meng] merge master
b642872 [Xiangrui Meng] example code runs
5a67779 [Xiangrui Meng] examples compile
c76b4d1 [Xiangrui Meng] fix all unit tests
0f4fd64 [Xiangrui Meng] fix some tests
9286a22 [Xiangrui Meng] copyValues to trained models
53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
d882afc [Xiangrui Meng] test compile
f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… Updating ML Doc "Estimator, Transformer, and Param" examples.

Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](apache#5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes apache#6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants