Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19400][ML] Allow GLM to handle intercept only model #16740

Closed
wants to merge 10 commits into from

Conversation

actuaryzhang
Copy link
Contributor

What changes were proposed in this pull request?

Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y)) fails in the intercept-only model because oldCoefficients is empty. This PR fixes this issue.

@yanboliang @srowen @imatiach-msft @zhengruifeng

How was this patch tested?

New test for intercept only model.

@actuaryzhang
Copy link
Contributor Author

The following is a simple example to illustrate the issue.

val dataset = Seq(
          (1.0, 1.0, 2.0, 0.0, 5.0),
          (0.5, 2.0, 1.0, 1.0, 2.0),
          (1.0, 3.0, 0.5, 2.0, 1.0),
          (2.0, 4.0, 1.5, 3.0, 3.0)
        ).toDF("y", "w", "off", "x1", "x2")

val formula = new RFormula().setFormula("y ~ 1")
val output = formula.fit(dataset).transform(dataset)
val glr = new GeneralizedLinearRegression().setFamily("poisson")
val model = glr.fit(output)

The above prints out the following error message:

java.lang.UnsupportedOperationException: empty.reduceLeft
  at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
  at scala.collection.mutable.ArrayOps$ofDouble.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayOps.scala:270)
  at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74)

@actuaryzhang actuaryzhang changed the title [SPARK-19400] Allow GLM to handle intercept only model [SPARK-19400][ML] Allow GLM to handle intercept only model Jan 30, 2017
@SparkQA
Copy link

SparkQA commented Jan 30, 2017

Test build #72148 has finished for PR 16740 at commit 69d9631.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

math.max(math.abs(x), math.abs(y))
val oldCoefficients = oldModel.coefficients.toArray :+ oldModel.intercept
val coefficients = model.coefficients.toArray :+ model.intercept
val maxTol = oldCoefficients.zip(coefficients).map(x => x._1 - x._2).reduce {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply oldCoefficients.zip(coefficients).map(x => math.abs(x._1 - x._2)).max ? (I think that exists, but if not you get the idea.)

Copy link
Contributor

@imatiach-msft imatiach-msft Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to verify that this change does not impact execution time? The BLAS.axpy operation on coefficients/oldcoefficients means that only oldcoefficients needs to be converted to an array, and the BLAS.axpy operation is probably faster than the map with subtraction. The WeightedLeastSquares().fit() probably takes most of the time though, so it might not matter anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imatiach-msft The performance change is minor since the coefficients are not high-dimensional and these are simple operations and don't consume much time than the actual model fitting.


val expected = Seq(3.091, 3.178)

import GeneralizedLinearRegression._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems weird to me that several tests do this, import GeneralizedLinearRegression._, and we don't just have one at the top, but it looks like it is unrelated to your changes.

val familyLink = FamilyAndLink(trainer)
model.transform(dataset).select("features", "prediction", "linkPrediction").collect()
.foreach {
case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename these to expectedPrediction and expectedLinkPrediction. Having numbers at the end of variable names is confusing.

@imatiach-msft
Copy link
Contributor

Added a few nit-pick comments, otherwise the changes LGTM!

@actuaryzhang
Copy link
Contributor Author

@srowen Thanks much for the suggestion. Included the simplification. Please let me know if there is anything else needed for this PR. Thanks!

@actuaryzhang
Copy link
Contributor Author

@imatiach-msft Thanks for the comments! This test is based on existing tests in GLM. I can try to improve the style and streamline all tests in another PR but it will be weird to just change this test.

@SparkQA
Copy link

SparkQA commented Jan 30, 2017

Test build #72161 has finished for PR 16740 at commit 6e0b6bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Jan 31, 2017

Since we already compute the number of features in the train method, why don't we just check if numFeatures == 0 and then just compute the intercept as the link of the weighted average of the labels and return it. Then we don't have to run IRLS which doesn't seem to converge in a single iteration anyway. We have typically logged a warning in cases like these, here we might say that the model will be equivalent to the null model.

val model = trainer.fit(dataset)
val actual = model.intercept
assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " +
s"useWeight = $useWeight.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(model.coefficients === new DenseVector(Array.empty[Double]))

val oldCoefficients = oldModel.coefficients
val coefficients = model.coefficients
BLAS.axpy(-1.0, coefficients, oldCoefficients)
val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just change reduce to foldLeft here.

@actuaryzhang
Copy link
Contributor Author

@sethah Thanks for your review. Yes, using foldLeft would be the simplest fix. I have included both your suggested changes in the new commit.

Yes, we could handle the special case of the intercept only model outside IWLS. But I think it would be better to use the current fix: a) it allows IWLS to handle edge case, b) the fix is general and less likely to be affected by future structure changes of GLM (e.g., allowing offset). Does this make sense?

@SparkQA
Copy link

SparkQA commented Jan 31, 2017

Test build #72185 has finished for PR 16740 at commit 0b3c085.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Jan 31, 2017

Allowing offset will only require a small change to the intercept calculation, won't it?

 val agg = data.agg(sum(w * (col("label") - col("offset"))), sum(w)).first()
link.link(agg.getDouble(0) / agg.getDouble(1))

I'm still in favor of fixing the IRLS bug, but we should be able to return an analytic result without too much trouble, unless I'm missing something.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Jan 31, 2017

@sethah Yes, we can directly compute the intercept easily and I was just using offset as an example. I'm concerned that such special handling may not integrate well with other features or future changes. For example, we will need to compute the standard error analytically as well. The point is that every time there is new feature, one would have to modify the intercept calculation part to handle it. This does not seem efficient. Thoughts?

@sethah
Copy link
Contributor

sethah commented Feb 1, 2017

I don't really expect that we'll be changing things so often that this becomes a hassle. I think there is value in getting known results - in the current test the IRLS solver takes 3 iterations to converge on four data points, and it's not clear to me how this will extend to other links, families, larger datasets.

Regardless, I think we should test more familes and links than just poisson. And instead of using R results we can just compute the result analytically. What do you think?

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Feb 1, 2017

@sethah Thanks for your input. I can add more tests, but I don't think they are adding much since the algorithm is already tested in other tests.

I must emphasize that the analytical approach does not integrate well with the summary method. One has to derive the general formula for the standard error of the intercept, and then change the code substantially to make it work with summary. This would create lots of repeated code that is unnecessary. BTW, R fits the intercept only model also using IWLS with multiple iterations. It is just weird to have a special implementation in this case which is much harder to implement (standard error and summary) and does not integrate with the current setup.

Ping @srowen @yanboliang for comments. Thanks.

@sethah
Copy link
Contributor

sethah commented Feb 1, 2017

I agree having a special case is unsatisfying from an engineering perspective. In Spark it's a bit different than R since every iteration of IRLS will launch a Spark job, making a pass over the data, so the cost of the extra iterations is much higher. We have special-cased other algorithms for this reason.

It's entirely possible I'm missing something since I do not know the GLM code quite so well, and I did not thoroughly check it, but this code seemed to do the trick:

if (numFeatures == 0 && getFitIntercept) {
      val agg = dataset.agg(sum(w * col(getLabelCol)), sum(w)).first()
      val mu = agg.getDouble(0) / agg.getDouble(1)
      val diagInvAtA = (familyAndLink.family.variance(mu) * familyAndLink.link.deriv(mu)) / agg.getDouble(0)
      val model = copyValues(new GeneralizedLinearRegressionModel(uid, Vectors.zeros(0),
        familyAndLink.link.link(mu)).setParent(this))
      val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
        Array(diagInvAtA), 1, getSolver)
      return model.setSummary(Some(trainingSummary))
    }

The best answer here may depend on the use cases - do we expect users to be training "intercept-only" models often? If yes, then the savings on the iteration time may be worth it. If not, it is a clunky solution. We can see what others think.

Also, I got some strange failures when training with no features and fitIntercept == false. We should just throw an error in this case and add a test for it.

@actuaryzhang
Copy link
Contributor Author

@sethah Thanks for the clarification and providing an implementation. So, the pros is some speed improvement and the cons is the increased complexity (now we have three case: one for intercept only, one for Gaussian with identity and one for all the others). Let's see get other committers' opinions.

Yes, I will throw an error for the special case of no intercept and no features.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Feb 2, 2017

@sethah Your formula for offset does not seem to be a general solution, and I'm not sure if there exists an analytical formula, in particular when the link function is not identity or log. In GLM, the normal equation for the coefficient is: sum_i (y - mu) * w = 0. When there is offset, mu = link.inv(intcpt + offset). Take the case where link.inv is inverse as an example, then we have sum_i (y - inverse(intcpt + offset)) * w = 0. This is nonlinear and has to be solved iteratively, right? The following is a simple R example to show that the formula you provided does not recover the coefficient.

set.seed(11)
off <- rlnorm(200)
a <- 0.52
mu <- 1/(a + off)
y <- rgamma(200, 1, scale = mu)
f <- glm(y~1, offset = off, family = Gamma())
coef(f)
1/mean(y - off)

@sethah
Copy link
Contributor

sethah commented Feb 3, 2017

Ok, yeah, let's go with this fix now then - seems both R and statsmodels fit to compute the null model. Thanks for following up on that!

@actuaryzhang
Copy link
Contributor Author

@srowen would you please take a look and merge this if all is good? Thanks.

.foreach {
case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) =>
val eta = BLAS.dot(features, model.coefficients) + model.intercept
val prediction2 = familyLink.fitted(eta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to test this. This is essentially checking the correctness of the prediction mechanism, regardless of the "intercept-only" part. The prediction mechanism is tested elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah Agree. Removed this. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fast! :)


// throw exception for empty model
val trainer = new GeneralizedLinearRegression().setFitIntercept(false)
intercept[SparkException] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding the test, could you also please wrap it in withClue to verify the message contents, eg:
withClue("Specified model is empty with neither intercept nor feature") {
intercept[SparkException] {
trainer.fit(dataset)
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72298 has finished for PR 16740 at commit 931f7ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72299 has finished for PR 16740 at commit b57af08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@seth @imatiach-msft Let me know if there is any other changes needed. Thanks much for your review!

@@ -335,6 +335,11 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
throw new SparkException(msg)
}

if (numFeatures == 0 && !$(fitIntercept)) {
val msg = "Specified model is empty with neither intercept nor feature."
throw new SparkException(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think require, which throws IllegalArgumentException is more appropriate here


var idx = 0
for (useWeight <- Seq(false, true)) {
val trainer = new GeneralizedLinearRegression().setFamily("poisson")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in favor of checking other families. We had been discussion using a formula which ended up working with some families but not others - testing all the families would have caught this.

@@ -89,7 +89,7 @@ private[ml] class IterativelyReweightedLeastSquares(
val oldCoefficients = oldModel.coefficients
val coefficients = model.coefficients
BLAS.axpy(-1.0, coefficients, oldCoefficients)
val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) =>
val maxTolOfCoefficients = oldCoefficients.toArray.foldLeft(0.0) { (x, y) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could be

val maxTol = oldCoefficients.foldLeft(math.abs(oldModel.intercept - model.intercept)) { (x, y) => 
  math.max(math.abs(x), math.abs(y))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will not change this. I think the current one is more clear.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Feb 4, 2017

@sethah Thanks for the review. Made changes you suggested (except for the nit part). I added more tests although I don't think they are really necessary. The analytical approach is taking a different path from IRWLS, so I agree if we use it then we should thoroughly test it. But the current fix is just allowing the existing algorithm to work in a special case, which is well tested in the more general cases. Anyway, hope we can close this PR now.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72360 has finished for PR 16740 at commit 95b7a10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@sethah @imatiach-msft Could you take another look and let me know if there are any additional changes needed on this PR? Thanks!

@imatiach-msft
Copy link
Contributor

@actuaryzhang the changes look good to me. I had some nit-picks which you marked as won't fix, and I'm ok with that. Thank you for fixing this issue! Maybe a committer can review this - @jkbradley @srowen @yanboliang thanks!

@sethah
Copy link
Contributor

sethah commented Feb 6, 2017

Regarding the tests - I don't think the tests should change depending on the implementation. I don't think it's valid to say that we don't need to test this thoroughly because we know that it's just calling IRLS under the hood - especially since someone in the future might come along and change it to, say an analytical approach. Then maybe the non-rigorous tests still pass even though the implementation is wrong. And since we are already aware of a reasonable situation where some families will pass and some will not, I think it's a good idea to test them all. Sorry, not trying to be a pain :)

This LGTM otherwise. Thanks!

@actuaryzhang
Copy link
Contributor Author

@sethah Thanks for the comments.

OK, added more tests to cover all families. It's not possible to test all family and link combination if that's what you mean: the tweedie family supports a family of links. Now, the link tested includes identity, log, inverse, logit and mu^0.4. This should be enough to prevent any non-general changes to accidentally pass the tests.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72487 has finished for PR 16740 at commit 37c41aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

Can one of the admins merge this PR since we have two approvals now? Thanks.

@srowen @jkbradley @felixcheung @yanboliang

@@ -335,6 +335,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
throw new SparkException(msg)
}

require(numFeatures > 0 || $(fitIntercept),
"Specified model is empty with neither intercept nor feature.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is a bit cryptic. How about "GeneralizedLinearRegression was given data with 0 features, and with Param fitIntercept set to false. To fit a model with 0 features, fitIntercept must be set to true."

Other than this, your PR looks good to me. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72559 has finished for PR 16740 at commit 5a0ff27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

@jkbradley Thanks much for the review and suggestion. I updated the error message. Please let me know if there's anything else needed for this PR. Thanks.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thank you + @sethah for reviewing!

@asfgit asfgit closed this in 1aeb9f6 Feb 8, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue.

yanboliang srowen imatiach-msft zhengruifeng

## How was this patch tested?
New test for intercept only model.

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes apache#16740 from actuaryzhang/interceptOnly.
@actuaryzhang actuaryzhang deleted the interceptOnly branch May 9, 2017 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants