Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896

Closed

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Aug 9, 2017

What changes were proposed in this pull request?

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)

    val multinomialDatasetWithZeroVar = {
      val nPoints = 100
      val coefficients = Array(
        -0.57997, 0.912083, -0.371077,
        -0.16624, -0.84355, -0.048509)

      val xMean = Array(5.843, 3.0)
      val xVariance = Array(0.6856, 0.0)  // including zero variance

      val testData = generateMultinomialLogisticInput(
        coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

      val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
      df.cache()
      df
    }

How was this patch tested?

testcase added.

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80462 has finished for PR 18896 at commit c415dde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80680 has finished for PR 18896 at commit 9339145.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MrBago MrBago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I would change is the name of the new test you added. I would add "multinomial logistic regression with zero var" or something similar to the test name.

@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite
assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps)
}

test("test SPARK-21681") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include a description of the test in addition to the ticket #.

@MrBago
Copy link
Contributor

MrBago commented Aug 15, 2017

@jkbradley please take a look when you get a chance.

@jkbradley
Copy link
Member

LGTM except for making the test's title more descriptive. Thanks!

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80708 has finished for PR 18896 at commit 2eda876.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Aug 16, 2017

Fix seems ok - but just wondering why the test for zero std dev in LogisticAggregatorSuite didn't pick this up? cc @sethah

@MLnick
Copy link
Contributor

MLnick commented Aug 16, 2017

The model is always trained in the scaled space regardless of the standardization param.

e.g. in the test case doing:

    val model = mlr.fit(multinomialDatasetWithZeroVar)

    println(model.interceptVector)
    println(model.coefficientMatrix)

    val model2 = mlr.setStandardization(false).fit(multinomialDatasetWithZeroVar)

    println(model2.interceptVector)
    println(model2.coefficientMatrix)

Gives you

[0.2659042827402973,0.5361544814545004,-0.8020587641947977]
0.18818263258382864    0.0  
-0.024143986869901088  0.0  
-0.16403864571392754   0.0  
[0.2659042827402915,0.5361544814544981,-0.8020587641947895]
0.18818263258382964    0.0  
-0.024143986869900692  0.0  
-0.16403864571392893   0.0  

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Aug 16, 2017

@MLnick Yes it is always trained in scaled space.
The testcase seems should pick up this. I will check it later.

@WeichenXu123
Copy link
Contributor Author

@MLnick I debug the testcase your mentioned.
The reason is, the zero var cause the computation generate Infinite and NaN so the result is unexpectable, in this case, it happened to generate 0.0 at gradient(0), so it escape the testcase.

@jkbradley
Copy link
Member

How about updating the LogisticAggregatorSuite so it catches this error:

diff --git a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
index 2b29c67..a2924ad 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala
@@ -239,7 +239,14 @@ class LogisticAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
       Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, isMultinomial = true)
     instances.foreach(aggConstantFeature.add)
     // constant features should not affect gradient
-    assert(aggConstantFeature.gradient(0) === 0.0)
+    def validateGradient(grad: Vector): Unit = {
+      assert(grad(0) === 0.0)
+      grad.toArray.foreach { gradientValue =>
+        assert(!gradientValue.isNaN &&
+          gradientValue > Double.NegativeInfinity && gradientValue < Double.PositiveInfinity)
+      }
+    }
+    validateGradient(aggConstantFeature.gradient)
 
     val binaryCoefArray = Array(1.0, 2.0)
     val intercept = 1.0
@@ -248,6 +255,6 @@ class LogisticAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
       isMultinomial = false)
     instances.foreach(aggConstantFeatureBinary.add)
     // constant features should not affect gradient
-    assert(aggConstantFeatureBinary.gradient(0) === 0.0)
+    validateGradient(aggConstantFeatureBinary.gradient)
   }
 }

@WeichenXu123
Copy link
Contributor Author

@jkkbradley OK. So I can remove the test I added ?

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80820 has finished for PR 18896 at commit 5e73f63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(aggConstantFeature.gradient(0) === 0.0)
def validateGradient(grad: Vector): Unit = {
assert(grad(0) === 0.0)
grad.toArray.foreach { gradientValue =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this test was that it checked that part of the gradient was zero, but didn't check that the rest of the gradient was correct. Here, you're checking that the rest of the gradient isn't nan or infinite, but not that it's actually correct. A more appropriate test, IMO, is to also run an aggregator over the same instances with the constant feature filtered out, then check that the portion of the gradients they share are the same. e.g.

    val aggConstantFeature = getNewAggregator(instancesConstantFeature,
      Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, isMultinomial = true)
    val filteredInstances = instancesConstantFeature.map { case Instance(l, w, f) =>
      Instance(l, w, Vectors.dense(f.toArray.tail))
    }
    val aggMultinomial = getNewAggregator(filteredInstances,
      Vectors.dense(coefArray.slice(3, 6) ++ interceptArray), fitIntercept = true,
      isMultinomial = true)
    filteredInstances.foreach(aggMultinomial.add)
    instancesConstantFeature.foreach(aggConstantFeature.add)

    // constant features should not affect gradient
    assert(aggConstantFeature.gradient.toArray.take(numClasses) === Array.fill(numClasses)(0.0))
    assert(aggMultinomial.gradient.toArray === aggConstantFeature.gradient.toArray.slice(3, 9))

Just to note, this code is just for an example, not meant to be copy and pasted.

*/

val coefficientsR = new DenseMatrix(3, 2, Array(
0.1881871, -0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -0.0?

@sethah
Copy link
Contributor

sethah commented Aug 19, 2017

Thanks for catching this @WeichenXu123! I just added a note about the intent of test.

@SparkQA
Copy link

SparkQA commented Aug 21, 2017

Test build #80927 has finished for PR 18896 at commit 1f4ba14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks @WeichenXu123 and @sethah !

@WeichenXu123 I'd leave the MLOR test since it's cheap and has a clear purpose, even if it overlaps a little.

LGTM
@sethah any more comments before I merge it?

@jkbradley
Copy link
Member

jkbradley commented Aug 22, 2017

Merging with master
Thanks @WeichenXu123 @MLnick @sethah !

@jkbradley
Copy link
Member

@WeichenXu123 would you mind sending a backport PR for 2.2?

@asfgit asfgit closed this in d56c262 Aug 22, 2017
@WeichenXu123
Copy link
Contributor Author

@jkbradley OK. (Can this directly merged to 2.2 ?)

@WeichenXu123 WeichenXu123 deleted the fix_mlor_stdvalue_zero_bug branch August 22, 2017 23:59
asfgit pushed a commit that referenced this pull request Aug 24, 2017
…td contains zero (backport PR for 2.2)

## What changes were proposed in this pull request?

This is backport PR of #18896

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
```
    val multinomialDatasetWithZeroVar = {
      val nPoints = 100
      val coefficients = Array(
        -0.57997, 0.912083, -0.371077,
        -0.16624, -0.84355, -0.048509)

      val xMean = Array(5.843, 3.0)
      val xVariance = Array(0.6856, 0.0)  // including zero variance

      val testData = generateMultinomialLogisticInput(
        coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

      val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
      df.cache()
      df
    }
```
## How was this patch tested?

testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…td contains zero (backport PR for 2.2)

## What changes were proposed in this pull request?

This is backport PR of apache#18896

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
```
    val multinomialDatasetWithZeroVar = {
      val nPoints = 100
      val coefficients = Array(
        -0.57997, 0.912083, -0.371077,
        -0.16624, -0.84355, -0.048509)

      val xMean = Array(5.843, 3.0)
      val xVariance = Array(0.6856, 0.0)  // including zero variance

      val testData = generateMultinomialLogisticInput(
        coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

      val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
      df.cache()
      df
    }
```
## How was this patch tested?

testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants