Skip to content

Commit

Permalink
[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle …
Browse files Browse the repository at this point in the history
…y = 0

## What changes were proposed in this pull request?

It is reported by Spark users that the deviance calculation for poisson regression does not handle y = 0. Thus, the correct model summary cannot be obtained. The user has confirmed the the issue is in
```
override def deviance(y: Double, mu: Double, weight: Double): Double =
{ 2.0 * weight * (y * math.log(y / mu) - (y - mu)) }
when y = 0.
```

The user also mentioned there are many other places he believe we should check the same thing. However, no other changes are needed, including Gamma distribution.

## How was this patch tested?
Add a comparison with R deviance calculation to the existing unit test.

Author: Teng Peng <josephtengpeng@gmail.com>

Closes #21125 from tengpeng/Spark24024GLM.
  • Loading branch information
tengpeng authored and dbtsai committed Apr 23, 2018
1 parent afbdf42 commit 293a0f2
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 5 deletions.
Expand Up @@ -471,6 +471,10 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine

private[regression] val epsilon: Double = 1E-16

private[regression] def ylogy(y: Double, mu: Double): Double = {
if (y == 0) 0.0 else y * math.log(y / mu)
}

/**
* Wrapper of family and link combination used in the model.
*/
Expand Down Expand Up @@ -725,10 +729,6 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine

override def variance(mu: Double): Double = mu * (1.0 - mu)

private def ylogy(y: Double, mu: Double): Double = {
if (y == 0) 0.0 else y * math.log(y / mu)
}

override def deviance(y: Double, mu: Double, weight: Double): Double = {
2.0 * weight * (ylogy(y, mu) + ylogy(1.0 - y, 1.0 - mu))
}
Expand Down Expand Up @@ -783,7 +783,7 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
override def variance(mu: Double): Double = mu

override def deviance(y: Double, mu: Double, weight: Double): Double = {
2.0 * weight * (y * math.log(y / mu) - (y - mu))
2.0 * weight * (ylogy(y, mu) - (y - mu))
}

override def aic(
Expand Down
Expand Up @@ -493,11 +493,20 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
}
[1] -0.0457441 -0.6833928
[1] 1.8121235 -0.1747493 -0.5815417
R code for deivance calculation:
data = cbind(y=c(0,1,0,0,0,1), x1=c(18, 12, 15, 13, 15, 16), x2=c(1,0,0,2,1,1))
summary(glm(y~x1+x2, family=poisson, data=data.frame(data)))$deviance
[1] 3.70055
summary(glm(y~x1+x2-1, family=poisson, data=data.frame(data)))$deviance
[1] 3.809296
*/
val expected = Seq(
Vectors.dense(0.0, -0.0457441, -0.6833928),
Vectors.dense(1.8121235, -0.1747493, -0.5815417))

val residualDeviancesR = Array(3.809296, 3.70055)

import GeneralizedLinearRegression._

var idx = 0
Expand All @@ -510,6 +519,7 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
assert(actual ~= expected(idx) absTol 1e-4, "Model mismatch: GLM with poisson family, " +
s"$link link and fitIntercept = $fitIntercept (with zero values).")
assert(model.summary.deviance ~== residualDeviancesR(idx) absTol 1E-3)
idx += 1
}
}
Expand Down

0 comments on commit 293a0f2

Please sign in to comment.