[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

WeichenXu123 · 2017-08-09T17:19:27Z

What changes were proposed in this pull request?

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)

    val multinomialDatasetWithZeroVar = {
      val nPoints = 100
      val coefficients = Array(
        -0.57997, 0.912083, -0.371077,
        -0.16624, -0.84355, -0.048509)

      val xMean = Array(5.843, 3.0)
      val xVariance = Array(0.6856, 0.0)  // including zero variance

      val testData = generateMultinomialLogisticInput(
        coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

      val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
      df.cache()
      df
    }

How was this patch tested?

testcase added.

SparkQA · 2017-08-09T18:23:41Z

Test build #80462 has finished for PR 18896 at commit c415dde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-15T11:40:54Z

Test build #80680 has finished for PR 18896 at commit 9339145.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago

The only thing I would change is the name of the new test you added. I would add "multinomial logistic regression with zero var" or something similar to the test name.

MrBago · 2017-08-15T20:57:52Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite
    assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps)
  }

+  test("test SPARK-21681") {


I would include a description of the test in addition to the ticket #.

MrBago · 2017-08-15T20:59:42Z

@jkbradley please take a look when you get a chance.

jkbradley · 2017-08-16T01:20:39Z

LGTM except for making the test's title more descriptive. Thanks!

SparkQA · 2017-08-16T03:39:44Z

Test build #80708 has finished for PR 18896 at commit 2eda876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-16T07:35:12Z

Fix seems ok - but just wondering why the test for zero std dev in LogisticAggregatorSuite didn't pick this up? cc @sethah

spark/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala

Line 232 in cf29828

test("check with zero standard deviation") {

MLnick · 2017-08-16T09:30:21Z

The model is always trained in the scaled space regardless of the standardization param.

e.g. in the test case doing:

    val model = mlr.fit(multinomialDatasetWithZeroVar)

    println(model.interceptVector)
    println(model.coefficientMatrix)

    val model2 = mlr.setStandardization(false).fit(multinomialDatasetWithZeroVar)

    println(model2.interceptVector)
    println(model2.coefficientMatrix)

Gives you

[0.2659042827402973,0.5361544814545004,-0.8020587641947977]
0.18818263258382864    0.0  
-0.024143986869901088  0.0  
-0.16403864571392754   0.0  
[0.2659042827402915,0.5361544814544981,-0.8020587641947895]
0.18818263258382964    0.0  
-0.024143986869900692  0.0  
-0.16403864571392893   0.0

WeichenXu123 · 2017-08-16T14:49:28Z

@MLnick Yes it is always trained in scaled space.
The testcase seems should pick up this. I will check it later.

WeichenXu123 · 2017-08-16T16:53:55Z

@MLnick I debug the testcase your mentioned.
The reason is, the zero var cause the computation generate Infinite and NaN so the result is unexpectable, in this case, it happened to generate 0.0 at gradient(0), so it escape the testcase.

jkbradley · 2017-08-18T00:33:45Z

How about updating the LogisticAggregatorSuite so it catches this error:

diff --git a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
index 2b29c67..a2924ad 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
@@ -239,7 +239,14 @@ class LogisticAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
       Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, isMultinomial = true)
     instances.foreach(aggConstantFeature.add)
     // constant features should not affect gradient
-    assert(aggConstantFeature.gradient(0) === 0.0)
+    def validateGradient(grad: Vector): Unit = {
+      assert(grad(0) === 0.0)
+      grad.toArray.foreach { gradientValue =>
+        assert(!gradientValue.isNaN &&
+          gradientValue > Double.NegativeInfinity && gradientValue < Double.PositiveInfinity)
+      }
+    }
+    validateGradient(aggConstantFeature.gradient)
 
     val binaryCoefArray = Array(1.0, 2.0)
     val intercept = 1.0
@@ -248,6 +255,6 @@ class LogisticAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
       isMultinomial = false)
     instances.foreach(aggConstantFeatureBinary.add)
     // constant features should not affect gradient
-    assert(aggConstantFeatureBinary.gradient(0) === 0.0)
+    validateGradient(aggConstantFeatureBinary.gradient)
   }
 }

WeichenXu123 · 2017-08-18T00:50:15Z

@jkkbradley OK. So I can remove the test I added ?

SparkQA · 2017-08-18T04:07:11Z

Test build #80820 has finished for PR 18896 at commit 5e73f63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-08-19T00:21:13Z

mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala

-    assert(aggConstantFeature.gradient(0) === 0.0)
+    def validateGradient(grad: Vector): Unit = {
+      assert(grad(0) === 0.0)
+      grad.toArray.foreach { gradientValue =>


The problem with this test was that it checked that part of the gradient was zero, but didn't check that the rest of the gradient was correct. Here, you're checking that the rest of the gradient isn't nan or infinite, but not that it's actually correct. A more appropriate test, IMO, is to also run an aggregator over the same instances with the constant feature filtered out, then check that the portion of the gradients they share are the same. e.g.

val aggConstantFeature = getNewAggregator(instancesConstantFeature, Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, isMultinomial = true) val filteredInstances = instancesConstantFeature.map { case Instance(l, w, f) => Instance(l, w, Vectors.dense(f.toArray.tail)) } val aggMultinomial = getNewAggregator(filteredInstances, Vectors.dense(coefArray.slice(3, 6) ++ interceptArray), fitIntercept = true, isMultinomial = true) filteredInstances.foreach(aggMultinomial.add) instancesConstantFeature.foreach(aggConstantFeature.add) // constant features should not affect gradient assert(aggConstantFeature.gradient.toArray.take(numClasses) === Array.fill(numClasses)(0.0)) assert(aggMultinomial.gradient.toArray === aggConstantFeature.gradient.toArray.slice(3, 9))

Just to note, this code is just for an example, not meant to be copy and pasted.

sethah · 2017-08-19T00:21:50Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

+    */
+
+    val coefficientsR = new DenseMatrix(3, 2, Array(
+      0.1881871, -0.0,


Why -0.0?

sethah · 2017-08-19T00:22:31Z

Thanks for catching this @WeichenXu123! I just added a note about the intent of test.

SparkQA · 2017-08-21T14:31:42Z

Test build #80927 has finished for PR 18896 at commit 1f4ba14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-08-21T19:55:16Z

Thanks @WeichenXu123 and @sethah !

@WeichenXu123 I'd leave the MLOR test since it's cheap and has a clear purpose, even if it overlaps a little.

LGTM
@sethah any more comments before I merge it?

jkbradley · 2017-08-22T23:54:18Z

Merging with master
Thanks @WeichenXu123 @MLnick @sethah !

jkbradley · 2017-08-22T23:57:35Z

@WeichenXu123 would you mind sending a backport PR for 2.2?

WeichenXu123 · 2017-08-22T23:59:41Z

@jkbradley OK. (Can this directly merged to 2.2 ?)

…td contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of #18896 fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.

…td contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of apache#18896 fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.

WeichenXu123 mentioned this pull request Aug 9, 2017

[WIP][SPARK-17134][ML] Use level 2 BLAS operations in LogisticAggregator #17894

Closed

MrBago approved these changes Aug 15, 2017

View reviewed changes

sethah reviewed Aug 19, 2017

View reviewed changes

WeichenXu123 added 5 commits August 21, 2017 19:52

init pr

3c535e4

add test

48ea363

update testcase description

8515b20

update logisticAggregatorSuite

0f28e5e

update testcase

1f4ba14

WeichenXu123 force-pushed the fix_mlor_stdvalue_zero_bug branch from 5e73f63 to 1f4ba14 Compare August 21, 2017 13:15

sethah mentioned this pull request Aug 21, 2017

[SPARK-21108] [ML] convert LinearSVC to aggregator framework #18315

Closed

asfgit closed this in d56c262 Aug 22, 2017

WeichenXu123 deleted the fix_mlor_stdvalue_zero_bug branch August 22, 2017 23:59

WeichenXu123 mentioned this pull request Aug 23, 2017

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2) #19026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

WeichenXu123 commented Aug 9, 2017 •

edited

Loading

SparkQA commented Aug 9, 2017

SparkQA commented Aug 15, 2017

MrBago left a comment

MrBago Aug 15, 2017

MrBago commented Aug 15, 2017

jkbradley commented Aug 16, 2017

SparkQA commented Aug 16, 2017

MLnick commented Aug 16, 2017

MLnick commented Aug 16, 2017

WeichenXu123 commented Aug 16, 2017 •

edited

Loading

WeichenXu123 commented Aug 16, 2017

jkbradley commented Aug 18, 2017

WeichenXu123 commented Aug 18, 2017

SparkQA commented Aug 18, 2017

sethah Aug 19, 2017

sethah Aug 19, 2017

sethah commented Aug 19, 2017

SparkQA commented Aug 21, 2017

jkbradley commented Aug 21, 2017

jkbradley commented Aug 22, 2017 •

edited

Loading

jkbradley commented Aug 22, 2017

WeichenXu123 commented Aug 22, 2017

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

Conversation

WeichenXu123 commented Aug 9, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 9, 2017

SparkQA commented Aug 15, 2017

MrBago left a comment

Choose a reason for hiding this comment

MrBago Aug 15, 2017

Choose a reason for hiding this comment

MrBago commented Aug 15, 2017

jkbradley commented Aug 16, 2017

SparkQA commented Aug 16, 2017

MLnick commented Aug 16, 2017

MLnick commented Aug 16, 2017

WeichenXu123 commented Aug 16, 2017 • edited Loading

WeichenXu123 commented Aug 16, 2017

jkbradley commented Aug 18, 2017

WeichenXu123 commented Aug 18, 2017

SparkQA commented Aug 18, 2017

sethah Aug 19, 2017

Choose a reason for hiding this comment

sethah Aug 19, 2017

Choose a reason for hiding this comment

sethah commented Aug 19, 2017

SparkQA commented Aug 21, 2017

jkbradley commented Aug 21, 2017

jkbradley commented Aug 22, 2017 • edited Loading

jkbradley commented Aug 22, 2017

WeichenXu123 commented Aug 22, 2017

WeichenXu123 commented Aug 9, 2017 •

edited

Loading

WeichenXu123 commented Aug 16, 2017 •

edited

Loading

jkbradley commented Aug 22, 2017 •

edited

Loading