[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) #11136

yanboliang · 2016-02-09T15:42:47Z

Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS.

yanboliang · 2016-02-09T16:13:53Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+        Instance(label, weight, features)
+    }
+
+    if ($(family) == "gaussian" && $(link) == "identity") {


For gaussian family with identity link, we only use WeightedLeastSquares to train the model.

yanboliang · 2016-02-09T16:23:29Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+  override def deriv(mu: Double): Double = 1.0 / (mu * (1.0 - mu))
+
+  override def unlink(eta: Double): Double = 1.0 / (1.0 + math.exp(-1.0 * eta))
+}


The link functions should be further refinement to guarantee that the endogenous variable does not contain invalid values.

I think the restriction on endogenous variable should go into the Family class since it is truly the distribution on Y that restricts the values. This is how R does it.

Yes, I have add restriction in Family. We use the clean function to trim invalid data.

SparkQA · 2016-02-09T16:41:16Z

Test build #50978 has finished for PR 11136 at commit 5af604e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GeneralizedLinearRegression @Since(\"2.0.0\") (@Since(\"2.0.0\") override val uid: String)

SparkQA · 2016-02-09T17:17:02Z

Test build #50980 has finished for PR 11136 at commit 3082686.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-02-10T00:49:16Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+ * The default link for the Gamma family is the inverse link.
+ * @param link a link function instance
+ */
+private[ml] class Gamma(link: Link = Log) extends Family(link) {


Typo in the default link. Log => Inverse

yanboliang · 2016-02-11T15:13:00Z

Jenkins, test this please.

SparkQA · 2016-02-11T15:32:12Z

Test build #51101 has started for PR 11136 at commit 97c3f6a.

shaneknapp · 2016-02-11T16:08:03Z

jenkins, test this please

SparkQA · 2016-02-11T17:08:04Z

Test build #51109 has finished for PR 11136 at commit 97c3f6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-14T11:05:10Z

Test build #51264 has finished for PR 11136 at commit cc10147.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-15T11:30:25Z

Test build #51306 has finished for PR 11136 at commit 4a27970.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-02-22T07:21:43Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+
+  def variance(mu: Double): Double = 1.0
+
+  override def clean(mu: Double): Double = {


Here we constrict mu in valid range using a different method compared with R. In R, if mu or eta is invalid, it will diminish coefficients until it makes validmu and validaeta passed. I think is will make convergence slowness. I'm looking forward to hear others' thought.

mengxr · 2016-02-22T23:54:58Z

I'm making a pass.

mengxr · 2016-02-23T01:09:25Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+    "the name of family which is a description of the error distribution to be used in the model",


Include supported options and the default value in the param doc (and the ScalaDoc).

Shall we make "gaussian" the default?

mengxr · 2016-02-25T08:00:57Z

Only some minor comments on the implementation. I will make a pass on the tests tomorrow. @dbtsai It would be great if you can make a pass too.

SparkQA · 2016-02-25T11:38:53Z

Test build #51963 has finished for PR 11136 at commit 2ebcef7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-02-25T18:35:16Z

mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala

+   * In order to take the normal equation approach efficiently, [[WeightedLeastSquares]]
+   * only supports the number of features is no more than 4096.
+   */
+  val MaxNumFeatures: Int = 4096


For constant, do we have naming convention? Like MAX_NUM_FEATURES?

This is not specified in Spark Code Style guide and Scala code style guide recommends MaxNumFeatures. But I do like MAX_NUM_FEATURES better.

OK, I will update it to MAX_NUM_FEATURES after collecting other comments. Thanks!

dbtsai · 2016-02-26T09:40:37Z

Gonna do another detail pass of the code tomorrow.

SparkQA · 2016-02-26T10:35:22Z

Test build #52044 has finished for PR 11136 at commit c05a948.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-01T01:09:02Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+      val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+      val testData =
+        generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed)


it would be good to say addIntercept = true instead of just true.

mengxr · 2016-03-01T01:10:44Z

I made one pass on the tests, only some minor comments.

SparkQA · 2016-03-01T03:32:16Z

Test build #52211 has finished for PR 11136 at commit 314b562.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T03:47:47Z

Test build #52214 has finished for PR 11136 at commit 314b562.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T04:07:59Z

Test build #52215 has finished for PR 11136 at commit 314b562.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T04:37:39Z

Test build #52216 has finished for PR 11136 at commit 31a912c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-01T05:36:48Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+
+    val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
+    val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol)))
+      .map { case Row(label: Double, weight: Double, features: Vector) =>


.rdd.map instead of .map. This is caused by recent DataFrame API changes.

SparkQA · 2016-03-01T08:13:08Z

Test build #52227 has finished for PR 11136 at commit 007a4ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-01T16:48:24Z

LGTM. Merged into master. Thanks! I created SPARK-13597 for the Python API.

Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#11136 from yanboliang/spark-12811.

Initial version of Generalized Linear Regression

a37e285

yanboliang reviewed Feb 9, 2016
View reviewed changes

fix setParent

3082686

yanboliang force-pushed the spark-12811 branch from 5af604e to 3082686 Compare February 9, 2016 16:19

yanboliang reviewed Feb 9, 2016
View reviewed changes

sethah reviewed Feb 10, 2016
View reviewed changes

yanboliang added 2 commits February 11, 2016 22:30

clear doc of numFeatures <= 4096, and fix default link

378ad6c

fix validateParams

97c3f6a

add constriction to mu for family

cc10147

Add clean for Gaussian, Poisson and Gamma

4a27970

yanboliang reviewed Feb 22, 2016
View reviewed changes

mengxr reviewed Feb 23, 2016
View reviewed changes

Address comments

2ebcef7

dbtsai reviewed Feb 25, 2016
View reviewed changes

Rename MaxNumFeatures to MAX_NUM_FEATURES

c05a948

mengxr reviewed Mar 1, 2016
View reviewed changes

Fix test issues

314b562

Better error message

31a912c

mengxr reviewed Mar 1, 2016
View reviewed changes

Use .rdd.map instead of .map

007a4ec

asfgit closed this in 5ed48dd Mar 1, 2016

yanboliang deleted the spark-12811 branch March 2, 2016 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) #11136

[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) #11136

yanboliang commented Feb 9, 2016

yanboliang Feb 9, 2016

yanboliang Feb 9, 2016

sethah Feb 10, 2016

yanboliang Feb 15, 2016

SparkQA commented Feb 9, 2016

SparkQA commented Feb 9, 2016

sethah Feb 10, 2016

yanboliang commented Feb 11, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 14, 2016

SparkQA commented Feb 15, 2016

yanboliang Feb 22, 2016

mengxr commented Feb 22, 2016

mengxr Feb 23, 2016

mengxr commented Feb 25, 2016

SparkQA commented Feb 25, 2016

dbtsai Feb 25, 2016

mengxr Feb 26, 2016

yanboliang Feb 26, 2016

dbtsai commented Feb 26, 2016

SparkQA commented Feb 26, 2016

mengxr Mar 1, 2016

mengxr commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

mengxr Mar 1, 2016

SparkQA commented Mar 1, 2016

mengxr commented Mar 1, 2016


		def variance(mu: Double): Double = 1.0

		override def clean(mu: Double): Double = {

[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) #11136

[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) #11136

Conversation

yanboliang commented Feb 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 9, 2016

SparkQA commented Feb 9, 2016

Choose a reason for hiding this comment

yanboliang commented Feb 11, 2016

SparkQA commented Feb 11, 2016

shaneknapp commented Feb 11, 2016

SparkQA commented Feb 11, 2016

SparkQA commented Feb 14, 2016

SparkQA commented Feb 15, 2016

Choose a reason for hiding this comment

mengxr commented Feb 22, 2016

Choose a reason for hiding this comment

mengxr commented Feb 25, 2016

SparkQA commented Feb 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Feb 26, 2016

SparkQA commented Feb 26, 2016

Choose a reason for hiding this comment

mengxr commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2016

mengxr commented Mar 1, 2016