[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

tengpeng · 2018-04-23T02:32:25Z

What changes were proposed in this pull request?

It is reported by Spark users that the deviance calculation for poisson regression does not handle y = 0. Thus, the correct model summary cannot be obtained. The user has confirmed the the issue is in

override def deviance(y: Double, mu: Double, weight: Double): Double =
{ 2.0 * weight * (y * math.log(y / mu) - (y - mu)) }
when y = 0.

The user also mentioned there are many other places he believe we should check the same thing. However, no other changes are needed, including Gamma distribution.

How was this patch tested?

Add a comparison with R deviance calculation to the existing unit test.

dbtsai · 2018-04-23T05:33:23Z

ok to test

dbtsai

Only couple small comments, and we're ready to merge it once they're resolved.

Thanks.

DB Tsai | Siri Open Source Technologies |  Apple, Inc

dbtsai · 2018-04-23T05:53:12Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+    private def ylogy(y: Double, mu: Double): Double = {
+      if (y == 0) 0.0 else y * math.log(y / mu)
+    }
+


Another ylogy implementation in Binomial. Can you move this code to object GeneralizedLinearRegression and make it private to this package?

Thanks so much for the quick review. I have moved the ylog implementation to object GeneralizedLinearRegression. One quick question here: I am not sure I have fully understood why this is the right place for ylog? Thanks!

Any suggestion to avoid the duplicated code? Let's followup this later if you have an idea.

dbtsai · 2018-04-23T05:56:11Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

@@ -495,8 +495,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
       [1] 1.8121235  -0.1747493  -0.5815417


Can you update the R script which generate the deviance?

Updated. The updated script is sufficient to calculate deviance on its own.

dbtsai · 2018-04-23T06:01:42Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

-      Vectors.dense(0.0, -0.0457441, -0.6833928),
-      Vectors.dense(1.8121235, -0.1747493, -0.5815417))
+      Vectors.dense(0.0, -0.0457441, -0.6833928, 3.8093),
+      Vectors.dense(1.8121235, -0.1747493, -0.5815417, 3.7006))



Adding them to expected is not consistent to the rest of the test code.

How about

val residualDeviancesR = Array(3.8093, 3.7006)

Modified. Thanks!

dbtsai · 2018-04-23T06:01:59Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

@@ -507,7 +507,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
      val trainer = new GeneralizedLinearRegression().setFamily("poisson").setLink(link)
        .setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction")
      val model = trainer.fit(dataset)
-      val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
+      val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1),
+        model.summary.deviance)
      assert(actual ~= expected(idx) absTol 1e-4, "Model mismatch: GLM with poisson family, " +
        s"$link link and fitIntercept = $fitIntercept (with zero values).")


assert(model.summary.deviance ~== residualDeviancesR(idx) absTol 1E-3)

SparkQA · 2018-04-23T06:45:24Z

Test build #89699 has finished for PR 21125 at commit 3c6a4da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T14:10:02Z

Test build #89723 has finished for PR 21125 at commit da53b1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-04-23T17:59:28Z

LGTM, merged into master. Thanks.

DB Tsai | Siri Open Source Technologies |  Apple, Inc

fix deviance calculation when y = 0

3c6a4da

dbtsai reviewed Apr 23, 2018

View reviewed changes

Address comments

da53b1a

tengpeng changed the title ~~[Spark-24024] Fix poisson deviance calculations in GLM to handle y = 0~~ [Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 Apr 23, 2018

srowen approved these changes Apr 23, 2018

View reviewed changes

asfgit closed this in 293a0f2 Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

tengpeng commented Apr 23, 2018 •

edited

Loading

dbtsai commented Apr 23, 2018

dbtsai left a comment

dbtsai Apr 23, 2018

tengpeng Apr 23, 2018

dbtsai Apr 23, 2018

dbtsai Apr 23, 2018

tengpeng Apr 23, 2018

dbtsai Apr 23, 2018

tengpeng Apr 23, 2018

dbtsai Apr 23, 2018

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

dbtsai commented Apr 23, 2018

		@@ -495,8 +495,8 @@ class GeneralizedLinearRegressionSuite extends MLTest with DefaultReadWriteTest
		[1] 1.8121235 -0.1747493 -0.5815417

[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

[Spark-24024][ML] Fix poisson deviance calculations in GLM to handle y = 0 #21125

Conversation

tengpeng commented Apr 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dbtsai commented Apr 23, 2018

dbtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

dbtsai commented Apr 23, 2018

tengpeng commented Apr 23, 2018 •

edited

Loading