-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19400][ML] Allow GLM to handle intercept only model #16740
Conversation
The following is a simple example to illustrate the issue.
The above prints out the following error message:
|
Test build #72148 has finished for PR 16740 at commit
|
math.max(math.abs(x), math.abs(y)) | ||
val oldCoefficients = oldModel.coefficients.toArray :+ oldModel.intercept | ||
val coefficients = model.coefficients.toArray :+ model.intercept | ||
val maxTol = oldCoefficients.zip(coefficients).map(x => x._1 - x._2).reduce { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about simply oldCoefficients.zip(coefficients).map(x => math.abs(x._1 - x._2)).max
? (I think that exists, but if not you get the idea.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to verify that this change does not impact execution time? The BLAS.axpy operation on coefficients/oldcoefficients means that only oldcoefficients needs to be converted to an array, and the BLAS.axpy operation is probably faster than the map with subtraction. The WeightedLeastSquares().fit() probably takes most of the time though, so it might not matter anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imatiach-msft The performance change is minor since the coefficients are not high-dimensional and these are simple operations and don't consume much time than the actual model fitting.
|
||
val expected = Seq(3.091, 3.178) | ||
|
||
import GeneralizedLinearRegression._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems weird to me that several tests do this, import GeneralizedLinearRegression._, and we don't just have one at the top, but it looks like it is unrelated to your changes.
val familyLink = FamilyAndLink(trainer) | ||
model.transform(dataset).select("features", "prediction", "linkPrediction").collect() | ||
.foreach { | ||
case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename these to expectedPrediction and expectedLinkPrediction. Having numbers at the end of variable names is confusing.
Added a few nit-pick comments, otherwise the changes LGTM! |
@srowen Thanks much for the suggestion. Included the simplification. Please let me know if there is anything else needed for this PR. Thanks! |
@imatiach-msft Thanks for the comments! This test is based on existing tests in GLM. I can try to improve the style and streamline all tests in another PR but it will be weird to just change this test. |
Test build #72161 has finished for PR 16740 at commit
|
Since we already compute the number of features in the train method, why don't we just check if |
val model = trainer.fit(dataset) | ||
val actual = model.intercept | ||
assert(actual ~== expected(idx) absTol 1E-3, "Model mismatch: intercept only GLM with " + | ||
s"useWeight = $useWeight.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(model.coefficients === new DenseVector(Array.empty[Double]))
val oldCoefficients = oldModel.coefficients | ||
val coefficients = model.coefficients | ||
BLAS.axpy(-1.0, coefficients, oldCoefficients) | ||
val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just change reduce
to foldLeft
here.
@sethah Thanks for your review. Yes, using Yes, we could handle the special case of the intercept only model outside IWLS. But I think it would be better to use the current fix: a) it allows IWLS to handle edge case, b) the fix is general and less likely to be affected by future structure changes of GLM (e.g., allowing offset). Does this make sense? |
Test build #72185 has finished for PR 16740 at commit
|
Allowing offset will only require a small change to the intercept calculation, won't it? val agg = data.agg(sum(w * (col("label") - col("offset"))), sum(w)).first()
link.link(agg.getDouble(0) / agg.getDouble(1)) I'm still in favor of fixing the IRLS bug, but we should be able to return an analytic result without too much trouble, unless I'm missing something. |
@sethah Yes, we can directly compute the intercept easily and I was just using offset as an example. I'm concerned that such special handling may not integrate well with other features or future changes. For example, we will need to compute the standard error analytically as well. The point is that every time there is new feature, one would have to modify the intercept calculation part to handle it. This does not seem efficient. Thoughts? |
I don't really expect that we'll be changing things so often that this becomes a hassle. I think there is value in getting known results - in the current test the IRLS solver takes 3 iterations to converge on four data points, and it's not clear to me how this will extend to other links, families, larger datasets. Regardless, I think we should test more familes and links than just poisson. And instead of using R results we can just compute the result analytically. What do you think? |
@sethah Thanks for your input. I can add more tests, but I don't think they are adding much since the algorithm is already tested in other tests. I must emphasize that the analytical approach does not integrate well with the summary method. One has to derive the general formula for the standard error of the intercept, and then change the code substantially to make it work with summary. This would create lots of repeated code that is unnecessary. BTW, R fits the intercept only model also using IWLS with multiple iterations. It is just weird to have a special implementation in this case which is much harder to implement (standard error and summary) and does not integrate with the current setup. Ping @srowen @yanboliang for comments. Thanks. |
I agree having a special case is unsatisfying from an engineering perspective. In Spark it's a bit different than R since every iteration of IRLS will launch a Spark job, making a pass over the data, so the cost of the extra iterations is much higher. We have special-cased other algorithms for this reason. It's entirely possible I'm missing something since I do not know the GLM code quite so well, and I did not thoroughly check it, but this code seemed to do the trick: if (numFeatures == 0 && getFitIntercept) {
val agg = dataset.agg(sum(w * col(getLabelCol)), sum(w)).first()
val mu = agg.getDouble(0) / agg.getDouble(1)
val diagInvAtA = (familyAndLink.family.variance(mu) * familyAndLink.link.deriv(mu)) / agg.getDouble(0)
val model = copyValues(new GeneralizedLinearRegressionModel(uid, Vectors.zeros(0),
familyAndLink.link.link(mu)).setParent(this))
val trainingSummary = new GeneralizedLinearRegressionTrainingSummary(dataset, model,
Array(diagInvAtA), 1, getSolver)
return model.setSummary(Some(trainingSummary))
} The best answer here may depend on the use cases - do we expect users to be training "intercept-only" models often? If yes, then the savings on the iteration time may be worth it. If not, it is a clunky solution. We can see what others think. Also, I got some strange failures when training with no features and |
@sethah Thanks for the clarification and providing an implementation. So, the pros is some speed improvement and the cons is the increased complexity (now we have three case: one for intercept only, one for Gaussian with identity and one for all the others). Let's see get other committers' opinions. Yes, I will throw an error for the special case of no intercept and no features. |
@sethah Your formula for offset does not seem to be a general solution, and I'm not sure if there exists an analytical formula, in particular when the link function is not identity or log. In GLM, the normal equation for the coefficient is: sum_i (y - mu) * w = 0. When there is offset, mu = link.inv(intcpt + offset). Take the case where link.inv is inverse as an example, then we have sum_i (y - inverse(intcpt + offset)) * w = 0. This is nonlinear and has to be solved iteratively, right? The following is a simple R example to show that the formula you provided does not recover the coefficient.
|
Ok, yeah, let's go with this fix now then - seems both R and statsmodels fit to compute the null model. Thanks for following up on that! |
@srowen would you please take a look and merge this if all is good? Thanks. |
.foreach { | ||
case Row(features: DenseVector, prediction1: Double, linkPrediction1: Double) => | ||
val eta = BLAS.dot(features, model.coefficients) + model.intercept | ||
val prediction2 = familyLink.fitted(eta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to test this. This is essentially checking the correctness of the prediction mechanism, regardless of the "intercept-only" part. The prediction mechanism is tested elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah Agree. Removed this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was fast! :)
|
||
// throw exception for empty model | ||
val trainer = new GeneralizedLinearRegression().setFitIntercept(false) | ||
intercept[SparkException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for adding the test, could you also please wrap it in withClue to verify the message contents, eg:
withClue("Specified model is empty with neither intercept nor feature") {
intercept[SparkException] {
trainer.fit(dataset)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Test build #72298 has finished for PR 16740 at commit
|
Test build #72299 has finished for PR 16740 at commit
|
@seth @imatiach-msft Let me know if there is any other changes needed. Thanks much for your review! |
@@ -335,6 +335,11 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val | |||
throw new SparkException(msg) | |||
} | |||
|
|||
if (numFeatures == 0 && !$(fitIntercept)) { | |||
val msg = "Specified model is empty with neither intercept nor feature." | |||
throw new SparkException(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think require
, which throws IllegalArgumentException
is more appropriate here
|
||
var idx = 0 | ||
for (useWeight <- Seq(false, true)) { | ||
val trainer = new GeneralizedLinearRegression().setFamily("poisson") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in favor of checking other families. We had been discussion using a formula which ended up working with some families but not others - testing all the families would have caught this.
@@ -89,7 +89,7 @@ private[ml] class IterativelyReweightedLeastSquares( | |||
val oldCoefficients = oldModel.coefficients | |||
val coefficients = model.coefficients | |||
BLAS.axpy(-1.0, coefficients, oldCoefficients) | |||
val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => | |||
val maxTolOfCoefficients = oldCoefficients.toArray.foldLeft(0.0) { (x, y) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this could be
val maxTol = oldCoefficients.foldLeft(math.abs(oldModel.intercept - model.intercept)) { (x, y) =>
math.max(math.abs(x), math.abs(y))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will not change this. I think the current one is more clear.
@sethah Thanks for the review. Made changes you suggested (except for the nit part). I added more tests although I don't think they are really necessary. The analytical approach is taking a different path from IRWLS, so I agree if we use it then we should thoroughly test it. But the current fix is just allowing the existing algorithm to work in a special case, which is well tested in the more general cases. Anyway, hope we can close this PR now. |
Test build #72360 has finished for PR 16740 at commit
|
@sethah @imatiach-msft Could you take another look and let me know if there are any additional changes needed on this PR? Thanks! |
@actuaryzhang the changes look good to me. I had some nit-picks which you marked as won't fix, and I'm ok with that. Thank you for fixing this issue! Maybe a committer can review this - @jkbradley @srowen @yanboliang thanks! |
Regarding the tests - I don't think the tests should change depending on the implementation. I don't think it's valid to say that we don't need to test this thoroughly because we know that it's just calling IRLS under the hood - especially since someone in the future might come along and change it to, say an analytical approach. Then maybe the non-rigorous tests still pass even though the implementation is wrong. And since we are already aware of a reasonable situation where some families will pass and some will not, I think it's a good idea to test them all. Sorry, not trying to be a pain :) This LGTM otherwise. Thanks! |
@sethah Thanks for the comments. OK, added more tests to cover all families. It's not possible to test all family and link combination if that's what you mean: the tweedie family supports a family of links. Now, the link tested includes identity, log, inverse, logit and mu^0.4. This should be enough to prevent any non-general changes to accidentally pass the tests. |
Test build #72487 has finished for PR 16740 at commit
|
Can one of the admins merge this PR since we have two approvals now? Thanks. |
@@ -335,6 +335,9 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val | |||
throw new SparkException(msg) | |||
} | |||
|
|||
require(numFeatures > 0 || $(fitIntercept), | |||
"Specified model is empty with neither intercept nor feature.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is a bit cryptic. How about "GeneralizedLinearRegression was given data with 0 features, and with Param fitIntercept set to false. To fit a model with 0 features, fitIntercept must be set to true."
Other than this, your PR looks good to me. Thanks!
Test build #72559 has finished for PR 16740 at commit
|
@jkbradley Thanks much for the review and suggestion. I updated the error message. Please let me know if there's anything else needed for this PR. Thanks. |
LGTM |
## What changes were proposed in this pull request? Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue. yanboliang srowen imatiach-msft zhengruifeng ## How was this patch tested? New test for intercept only model. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes apache#16740 from actuaryzhang/interceptOnly.
What changes were proposed in this pull request?
Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code
val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))
fails in the intercept-only model becauseoldCoefficients
is empty. This PR fixes this issue.@yanboliang @srowen @imatiach-msft @zhengruifeng
How was this patch tested?
New test for intercept only model.