[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark #8508

yanboliang · 2015-08-28T12:54:24Z

LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:

HasElasticNetParam 
HasFitIntercept
HasStandardization
HasThresholds

Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.

SparkQA · 2015-08-28T13:04:30Z

Test build #41744 has finished for PR 8508 at commit 730b0a7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- ("thresholds", "Thresholds in multi-class classification to adjust the probability of " +
- class HasElasticNetParam(Params):
- class HasFitIntercept(Params):
- class HasStandardization(Params):
- class HasThresholds(Params):
- thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")
- self.thresholds = Param(self, "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")

SparkQA · 2015-08-28T13:44:19Z

Test build #41745 has finished for PR 8508 at commit d44ac06.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- ("thresholds", "Thresholds in multi-class classification to adjust the probability of " +
- class HasElasticNetParam(Params):
- class HasFitIntercept(Params):
- class HasStandardization(Params):
- class HasThresholds(Params):
- thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")
- self.thresholds = Param(self, "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")

feynmanliang · 2015-09-01T00:01:43Z

python/pyspark/ml/classification.py

-                       " to adjust the probability of predicting each class." +
-                       " Array must have length equal to the number of classes, with values >= 0." +
-                       " The class with largest value p/t is predicted, where p is the original" +
-                       " probability of that class and t is the class' threshold.")
    threshold = Param(Params._dummy(), "threshold",


Perhaps we should also extract a HasThreshold mixin for binary classifier thresholds

threshold is a deprecated parameter, it is replaced by thresholds. LogisticRegression still reserve threshold is just for binary compatibility. So I think we don't need to extract HasThreshold as shared Param. @jkbradley

My understanding is that the HasThresholds trait mixin in ml.LogisticRegression is actually an artifact resulting from transient dependency through ProbabilisticClassifier. We don't actually support multi-class classification in ml.LogisticRegression ATM and did quite a bit of work to make the API less confusing.

After mutli-class is supported I think it makes sense to use HasThresholds, but for the time being I would prefer we only use HasThreshold in the Python API.

feynmanliang · 2015-09-01T00:07:58Z

LGTM 👍, some minor formatting comments and a suggestion.

mengxr · 2015-09-11T03:27:55Z

python/pyspark/ml/classification.py

        """
        setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \
                  maxIter=100, regParam=0.1, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, \
-                  threshold=0.5, thresholds=None, \
-                  probabilityCol="probability", rawPredictionCol="rawPrediction")
+                  threshold=0.5, thresholds=None, probabilityCol="probability",


SparkQA · 2015-09-11T08:27:48Z

Test build #42319 has finished for PR 8508 at commit 962692b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
- case class ExecutorLostFailure(execId: String, isNormalExit: Boolean = false)
- class CoGroupedRDD[K: ClassTag](
- class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag](
- class ExecutorLossReason(val message: String) extends Serializable
- case class ExecutorExited(exitCode: Int, isNormalExit: Boolean, reason: String)
- case class RemoveExecutor(executorId: String, reason: ExecutorLossReason)
- case class GetExecutorLossReason(executorId: String) extends CoarseGrainedClusterMessage
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid):
- ("thresholds", "Thresholds in multi-class classification to adjust the probability of " +
- class HasHandleInvalid(Params):
- class HasElasticNetParam(Params):
- class HasFitIntercept(Params):
- class HasStandardization(Params):
- class HasThresholds(Params):
- thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")
- self.thresholds = Param(self, "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.")
- case class ConvertToSafeNode(conf: SQLConf, child: LocalNode) extends UnaryLocalNode(conf)
- case class ConvertToUnsafeNode(conf: SQLConf, child: LocalNode) extends UnaryLocalNode(conf)
- case class FilterNode(conf: SQLConf, condition: Expression, child: LocalNode)
- case class HashJoinNode(
- case class LimitNode(conf: SQLConf, limit: Int, child: LocalNode) extends UnaryLocalNode(conf)
- abstract class LocalNode(conf: SQLConf) extends TreeNode[LocalNode] with Logging
- abstract class LeafLocalNode(conf: SQLConf) extends LocalNode(conf)
- abstract class UnaryLocalNode(conf: SQLConf) extends LocalNode(conf)
- abstract class BinaryLocalNode(conf: SQLConf) extends LocalNode(conf)
- case class ProjectNode(conf: SQLConf, projectList: Seq[NamedExpression], child: LocalNode)
- case class SeqScanNode(conf: SQLConf, output: Seq[Attribute], data: Seq[InternalRow])
- case class UnionNode(conf: SQLConf, children: Seq[LocalNode]) extends LocalNode(conf)

mengxr · 2015-09-11T15:51:15Z

Merged into master. Thanks!

Implement some common Params for regression in PySpark

730b0a7

fix typo

d44ac06

feynmanliang reviewed Sep 1, 2015
View reviewed changes

mengxr reviewed Sep 11, 2015
View reviewed changes

yanboliang added 2 commits September 11, 2015 15:54

fix typos

093bbe2

resolve merge conflicts

962692b

asfgit closed this in b656e61 Sep 11, 2015

yanboliang deleted the spark-10026 branch May 5, 2016 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark #8508

[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark #8508

yanboliang commented Aug 28, 2015

SparkQA commented Aug 28, 2015

SparkQA commented Aug 28, 2015

feynmanliang Sep 1, 2015

yanboliang Sep 1, 2015

feynmanliang Sep 1, 2015

feynmanliang commented Sep 1, 2015

mengxr Sep 11, 2015

SparkQA commented Sep 11, 2015

mengxr commented Sep 11, 2015

[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark #8508

[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark #8508

Conversation

yanboliang commented Aug 28, 2015

SparkQA commented Aug 28, 2015

SparkQA commented Aug 28, 2015

feynmanliang Sep 1, 2015

Choose a reason for hiding this comment

yanboliang Sep 1, 2015

Choose a reason for hiding this comment

feynmanliang Sep 1, 2015

Choose a reason for hiding this comment

feynmanliang commented Sep 1, 2015

mengxr Sep 11, 2015

Choose a reason for hiding this comment

SparkQA commented Sep 11, 2015

mengxr commented Sep 11, 2015