[SPARK-19400][ML] Allow GLM to handle intercept only model #16740

actuaryzhang · 2017-01-30T08:12:57Z

What changes were proposed in this pull request?

Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y)) fails in the intercept-only model because oldCoefficients is empty. This PR fixes this issue.

@yanboliang @srowen @imatiach-msft @zhengruifeng

How was this patch tested?

New test for intercept only model.

actuaryzhang · 2017-01-30T08:16:19Z

The following is a simple example to illustrate the issue.

val dataset = Seq(
          (1.0, 1.0, 2.0, 0.0, 5.0),
          (0.5, 2.0, 1.0, 1.0, 2.0),
          (1.0, 3.0, 0.5, 2.0, 1.0),
          (2.0, 4.0, 1.5, 3.0, 3.0)
        ).toDF("y", "w", "off", "x1", "x2")

val formula = new RFormula().setFormula("y ~ 1")
val output = formula.fit(dataset).transform(dataset)
val glr = new GeneralizedLinearRegression().setFamily("poisson")
val model = glr.fit(output)

The above prints out the following error message:

java.lang.UnsupportedOperationException: empty.reduceLeft
  at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
  at scala.collection.mutable.ArrayOps$ofDouble.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayOps.scala:270)
  at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)

SparkQA · 2017-01-30T09:20:25Z

Test build #72148 has finished for PR 16740 at commit 69d9631.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-01-30T10:42:42Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

-        math.max(math.abs(x), math.abs(y))
+      val oldCoefficients = oldModel.coefficients.toArray :+ oldModel.intercept
+      val coefficients = model.coefficients.toArray :+ model.intercept
+      val maxTol = oldCoefficients.zip(coefficients).map(x => x._1 - x._2).reduce {


What about simply oldCoefficients.zip(coefficients).map(x => math.abs(x._1 - x._2)).max ? (I think that exists, but if not you get the idea.)

Is there some way to verify that this change does not impact execution time? The BLAS.axpy operation on coefficients/oldcoefficients means that only oldcoefficients needs to be converted to an array, and the BLAS.axpy operation is probably faster than the map with subtraction. The WeightedLeastSquares().fit() probably takes most of the time though, so it might not matter anyway.

@imatiach-msft The performance change is minor since the coefficients are not high-dimensional and these are simple operations and don't consume much time than the actual model fitting.

imatiach-msft · 2017-01-30T16:58:32Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+
+    val expected = Seq(3.091, 3.178)
+
+    import GeneralizedLinearRegression._


it seems weird to me that several tests do this, import GeneralizedLinearRegression._, and we don't just have one at the top, but it looks like it is unrelated to your changes.

imatiach-msft · 2017-01-30T17:02:30Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+      val familyLink = FamilyAndLink(trainer)
+      model.transform(dataset).select("features", "prediction", "linkPrediction").collect()
+        .foreach {
+          case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) =>


maybe rename these to expectedPrediction and expectedLinkPrediction. Having numbers at the end of variable names is confusing.

imatiach-msft · 2017-01-30T17:05:51Z

Added a few nit-pick comments, otherwise the changes LGTM!

actuaryzhang · 2017-01-30T17:19:12Z

@srowen Thanks much for the suggestion. Included the simplification. Please let me know if there is anything else needed for this PR. Thanks!

actuaryzhang · 2017-01-30T17:21:16Z

@imatiach-msft Thanks for the comments! This test is based on existing tests in GLM. I can try to improve the style and streamline all tests in another PR but it will be weird to just change this test.

SparkQA · 2017-01-30T18:25:20Z

Test build #72161 has finished for PR 16740 at commit 6e0b6bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-01-31T01:32:40Z

Since we already compute the number of features in the train method, why don't we just check if numFeatures == 0 and then just compute the intercept as the link of the weighted average of the labels and return it. Then we don't have to run IRLS which doesn't seem to converge in a single iteration anyway. We have typically logged a warning in cases like these, here we might say that the model will be equivalent to the null model.

sethah · 2017-01-31T00:55:42Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+      val model = trainer.fit(dataset)
+      val actual = model.intercept
+      assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " +
+        s"useWeight = $useWeight.")


assert(model.coefficients === new DenseVector(Array.empty[Double]))

sethah · 2017-01-31T01:04:24Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

-      val oldCoefficients = oldModel.coefficients
-      val coefficients = model.coefficients
-      BLAS.axpy(-1.0, coefficients, oldCoefficients)
-      val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) =>


You can just change reduce to foldLeft here.

actuaryzhang · 2017-01-31T04:50:30Z

@sethah Thanks for your review. Yes, using foldLeft would be the simplest fix. I have included both your suggested changes in the new commit.

Yes, we could handle the special case of the intercept only model outside IWLS. But I think it would be better to use the current fix: a) it allows IWLS to handle edge case, b) the fix is general and less likely to be affected by future structure changes of GLM (e.g., allowing offset). Does this make sense?

SparkQA · 2017-01-31T05:47:51Z

Test build #72185 has finished for PR 16740 at commit 0b3c085.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-01-31T20:20:23Z

Allowing offset will only require a small change to the intercept calculation, won't it?

 val agg = data.agg(sum(w * (col("label") - col("offset"))), sum(w)).first()
link.link(agg.getDouble(0) / agg.getDouble(1))

I'm still in favor of fixing the IRLS bug, but we should be able to return an analytic result without too much trouble, unless I'm missing something.

actuaryzhang · 2017-01-31T22:10:28Z

@sethah Yes, we can directly compute the intercept easily and I was just using offset as an example. I'm concerned that such special handling may not integrate well with other features or future changes. For example, we will need to compute the standard error analytically as well. The point is that every time there is new feature, one would have to modify the intercept calculation part to handle it. This does not seem efficient. Thoughts?

sethah · 2017-02-01T01:53:08Z

I don't really expect that we'll be changing things so often that this becomes a hassle. I think there is value in getting known results - in the current test the IRLS solver takes 3 iterations to converge on four data points, and it's not clear to me how this will extend to other links, families, larger datasets.

Regardless, I think we should test more familes and links than just poisson. And instead of using R results we can just compute the result analytically. What do you think?

actuaryzhang · 2017-02-01T03:39:31Z

@sethah Thanks for your input. I can add more tests, but I don't think they are adding much since the algorithm is already tested in other tests.

I must emphasize that the analytical approach does not integrate well with the summary method. One has to derive the general formula for the standard error of the intercept, and then change the code substantially to make it work with summary. This would create lots of repeated code that is unnecessary. BTW, R fits the intercept only model also using IWLS with multiple iterations. It is just weird to have a special implementation in this case which is much harder to implement (standard error and summary) and does not integrate with the current setup.

Ping @srowen @yanboliang for comments. Thanks.

sethah · 2017-02-01T05:07:40Z

I agree having a special case is unsatisfying from an engineering perspective. In Spark it's a bit different than R since every iteration of IRLS will launch a Spark job, making a pass over the data, so the cost of the extra iterations is much higher. We have special-cased other algorithms for this reason.

It's entirely possible I'm missing something since I do not know the GLM code quite so well, and I did not thoroughly check it, but this code seemed to do the trick:

if (numFeatures == 0 && getFitIntercept) {
      val agg = dataset.agg(sum(w * col(getLabelCol)), sum(w)).first()
      val mu = agg.getDouble(0) / agg.getDouble(1)
      val diagInvAtA = (familyAndLink.family.variance(mu) * familyAndLink.link.deriv(mu)) / agg.getDouble(0)
      val model = copyValues(new GeneralizedLinearRegressionModel(uid, Vectors.zeros(0),
        familyAndLink.link.link(mu)).setParent(this))
      val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
        Array(diagInvAtA), 1, getSolver)
      return model.setSummary(Some(trainingSummary))
    }

The best answer here may depend on the use cases - do we expect users to be training "intercept-only" models often? If yes, then the savings on the iteration time may be worth it. If not, it is a clunky solution. We can see what others think.

Also, I got some strange failures when training with no features and fitIntercept == false. We should just throw an error in this case and add a test for it.

actuaryzhang · 2017-02-01T05:26:52Z

@sethah Thanks for the clarification and providing an implementation. So, the pros is some speed improvement and the cons is the increased complexity (now we have three case: one for intercept only, one for Gaussian with identity and one for all the others). Let's see get other committers' opinions.

Yes, I will throw an error for the special case of no intercept and no features.

actuaryzhang · 2017-02-02T23:59:49Z

@sethah Your formula for offset does not seem to be a general solution, and I'm not sure if there exists an analytical formula, in particular when the link function is not identity or log. In GLM, the normal equation for the coefficient is: sum_i (y - mu) * w = 0. When there is offset, mu = link.inv(intcpt + offset). Take the case where link.inv is inverse as an example, then we have sum_i (y - inverse(intcpt + offset)) * w = 0. This is nonlinear and has to be solved iteratively, right? The following is a simple R example to show that the formula you provided does not recover the coefficient.

set.seed(11)
off <- rlnorm(200)
a <- 0.52
mu <- 1/(a + off)
y <- rgamma(200, 1, scale = mu)
f <- glm(y~1, offset = off, family = Gamma())
coef(f)
1/mean(y - off)

sethah · 2017-02-03T01:40:35Z

Ok, yeah, let's go with this fix now then - seems both R and statsmodels fit to compute the null model. Thanks for following up on that!

actuaryzhang · 2017-02-03T02:38:33Z

@srowen would you please take a look and merge this if all is good? Thanks.

sethah · 2017-02-03T02:50:16Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+        .foreach {
+          case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) =>
+            val eta = BLAS.dot(features, model.coefficients) + model.intercept
+            val prediction2 = familyLink.fitted(eta)


I don't think we need to test this. This is essentially checking the correctness of the prediction mechanism, regardless of the "intercept-only" part. The prediction mechanism is tested elsewhere.

@sethah Agree. Removed this. Thanks!

That was fast! :)

imatiach-msft · 2017-02-03T04:49:40Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+
+    // throw exception for empty model
+    val trainer = new GeneralizedLinearRegression().setFitIntercept(false)
+    intercept[SparkException] {


thank you for adding the test, could you also please wrap it in withClue to verify the message contents, eg:
withClue("Specified model is empty with neither intercept nor feature") {
intercept[SparkException] {
trainer.fit(dataset)
}
}

SparkQA · 2017-02-03T05:49:21Z

Test build #72298 has finished for PR 16740 at commit 931f7ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-03T05:59:14Z

Test build #72299 has finished for PR 16740 at commit b57af08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-03T06:14:04Z

@seth @imatiach-msft Let me know if there is any other changes needed. Thanks much for your review!

sethah · 2017-02-03T19:37:23Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

@@ -335,6 +335,11 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
      throw new SparkException(msg)
    }

+    if (numFeatures == 0 && !$(fitIntercept)) {
+      val msg = "Specified model is empty with neither intercept nor feature."
+      throw new SparkException(msg)


I think require, which throws IllegalArgumentException is more appropriate here

sethah · 2017-02-03T19:43:20Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+
+    var idx = 0
+    for (useWeight <- Seq(false, true)) {
+      val trainer = new GeneralizedLinearRegression().setFamily("poisson")


I'm still in favor of checking other families. We had been discussion using a formula which ended up working with some families but not others - testing all the families would have caught this.

sethah · 2017-02-03T20:10:09Z

mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala

@@ -89,7 +89,7 @@ private[ml] class IterativelyReweightedLeastSquares(
      val oldCoefficients = oldModel.coefficients
      val coefficients = model.coefficients
      BLAS.axpy(-1.0, coefficients, oldCoefficients)
-      val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) =>
+      val maxTolOfCoefficients = oldCoefficients.toArray.foldLeft(0.0) { (x, y) =>


nit: this could be

val maxTol = oldCoefficients.foldLeft(math.abs(oldModel.intercept - model.intercept)) { (x, y) => math.max(math.abs(x), math.abs(y)) }

Will not change this. I think the current one is more clear.

actuaryzhang · 2017-02-04T06:33:48Z

@sethah Thanks for the review. Made changes you suggested (except for the nit part). I added more tests although I don't think they are really necessary. The analytical approach is taking a different path from IRWLS, so I agree if we use it then we should thoroughly test it. But the current fix is just allowing the existing algorithm to work in a special case, which is well tested in the more general cases. Anyway, hope we can close this PR now.

SparkQA · 2017-02-04T07:37:30Z

Test build #72360 has finished for PR 16740 at commit 95b7a10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-06T22:43:33Z

@sethah @imatiach-msft Could you take another look and let me know if there are any additional changes needed on this PR? Thanks!

imatiach-msft · 2017-02-06T23:26:23Z

@actuaryzhang the changes look good to me. I had some nit-picks which you marked as won't fix, and I'm ok with that. Thank you for fixing this issue! Maybe a committer can review this - @jkbradley @srowen @yanboliang thanks!

sethah · 2017-02-06T23:38:03Z

Regarding the tests - I don't think the tests should change depending on the implementation. I don't think it's valid to say that we don't need to test this thoroughly because we know that it's just calling IRLS under the hood - especially since someone in the future might come along and change it to, say an analytical approach. Then maybe the non-rigorous tests still pass even though the implementation is wrong. And since we are already aware of a reasonable situation where some families will pass and some will not, I think it's a good idea to test them all. Sorry, not trying to be a pain :)

This LGTM otherwise. Thanks!

actuaryzhang · 2017-02-07T05:04:16Z

@sethah Thanks for the comments.

OK, added more tests to cover all families. It's not possible to test all family and link combination if that's what you mean: the tweedie family supports a family of links. Now, the link tested includes identity, log, inverse, logit and mu^0.4. This should be enough to prevent any non-general changes to accidentally pass the tests.

SparkQA · 2017-02-07T07:23:27Z

Test build #72487 has finished for PR 16740 at commit 37c41aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-07T17:23:35Z

Can one of the admins merge this PR since we have two approvals now? Thanks.

@srowen @jkbradley @felixcheung @yanboliang

jkbradley · 2017-02-08T02:02:34Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

@@ -335,6 +335,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
      throw new SparkException(msg)
    }

+    require(numFeatures > 0 || $(fitIntercept),
+      "Specified model is empty with neither intercept nor feature.")


The message is a bit cryptic. How about "GeneralizedLinearRegression was given data with 0 features, and with Param fitIntercept set to false. To fit a model with 0 features, fitIntercept must be set to true."

Other than this, your PR looks good to me. Thanks!

SparkQA · 2017-02-08T06:31:45Z

Test build #72559 has finished for PR 16740 at commit 5a0ff27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-08T07:25:41Z

@jkbradley Thanks much for the review and suggestion. I updated the error message. Please let me know if there's anything else needed for this PR. Thanks.

jkbradley · 2017-02-08T18:41:22Z

LGTM
Merging with master
Thank you + @sethah for reviewing!

## What changes were proposed in this pull request? Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue. yanboliang srowen imatiach-msft zhengruifeng ## How was this patch tested? New test for intercept only model. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes apache#16740 from actuaryzhang/interceptOnly.

fix error in reducing empty array

69d9631

actuaryzhang changed the title ~~[SPARK-19400] Allow GLM to handle intercept only model~~ [SPARK-19400][ML] Allow GLM to handle intercept only model Jan 30, 2017

srowen reviewed Jan 30, 2017

View reviewed changes

imatiach-msft reviewed Jan 30, 2017

View reviewed changes

simplify fix

6e0b6bf

imatiach-msft approved these changes Jan 30, 2017

View reviewed changes

sethah reviewed Jan 31, 2017

View reviewed changes

simplify fix

0b3c085

throw exception for when there is no intercept or feature

cc46d51

sethah reviewed Feb 3, 2017

View reviewed changes

simplify test

3a0a2af

add test for empty model

931f7ec

imatiach-msft reviewed Feb 3, 2017

View reviewed changes

add clue

b57af08

sethah reviewed Feb 3, 2017

View reviewed changes

add more tests

95b7a10

add more tests

37c41aa

sethah approved these changes Feb 7, 2017

View reviewed changes

jkbradley reviewed Feb 8, 2017

View reviewed changes

update error msg

5a0ff27

asfgit closed this in 1aeb9f6 Feb 8, 2017

actuaryzhang mentioned this pull request Feb 10, 2017

[SPARK-18710][ML] Add offset in GLM #16699

Closed

actuaryzhang deleted the interceptOnly branch May 9, 2017 05:13


		val expected = Seq(3.091, 3.178)

		import GeneralizedLinearRegression._

[SPARK-19400][ML] Allow GLM to handle intercept only model #16740

[SPARK-19400][ML] Allow GLM to handle intercept only model #16740

Conversation

actuaryzhang commented Jan 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

actuaryzhang commented Jan 30, 2017

SparkQA commented Jan 30, 2017

Choose a reason for hiding this comment

imatiach-msft Jan 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Jan 30, 2017

actuaryzhang commented Jan 30, 2017

actuaryzhang commented Jan 30, 2017

SparkQA commented Jan 30, 2017

sethah commented Jan 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

actuaryzhang commented Jan 31, 2017

SparkQA commented Jan 31, 2017

sethah commented Jan 31, 2017 • edited Loading

actuaryzhang commented Jan 31, 2017 • edited Loading

sethah commented Feb 1, 2017

actuaryzhang commented Feb 1, 2017 • edited Loading

sethah commented Feb 1, 2017

actuaryzhang commented Feb 1, 2017

actuaryzhang commented Feb 2, 2017 • edited Loading

sethah commented Feb 3, 2017

actuaryzhang commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 3, 2017

SparkQA commented Feb 3, 2017

actuaryzhang commented Feb 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

actuaryzhang commented Feb 4, 2017 • edited Loading

SparkQA commented Feb 4, 2017

actuaryzhang commented Feb 6, 2017

imatiach-msft commented Feb 6, 2017

sethah commented Feb 6, 2017 • edited Loading

actuaryzhang commented Feb 7, 2017

SparkQA commented Feb 7, 2017

actuaryzhang commented Feb 7, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 8, 2017

actuaryzhang commented Feb 8, 2017

jkbradley commented Feb 8, 2017

imatiach-msft Jan 30, 2017 •

edited

Loading

sethah commented Jan 31, 2017 •

edited

Loading

actuaryzhang commented Jan 31, 2017 •

edited

Loading

actuaryzhang commented Feb 1, 2017 •

edited

Loading

actuaryzhang commented Feb 2, 2017 •

edited

Loading

actuaryzhang commented Feb 4, 2017 •

edited

Loading

sethah commented Feb 6, 2017 •

edited

Loading