-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896
[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero #18896
Conversation
Test build #80462 has finished for PR 18896 at commit
|
Test build #80680 has finished for PR 18896 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I would change is the name of the new test you added. I would add "multinomial logistic regression with zero var" or something similar to the test name.
@@ -1392,6 +1415,61 @@ class LogisticRegressionSuite | |||
assert(model2.interceptVector.toArray.sum ~== 0.0 absTol eps) | |||
} | |||
|
|||
test("test SPARK-21681") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include a description of the test in addition to the ticket #.
@jkbradley please take a look when you get a chance. |
LGTM except for making the test's title more descriptive. Thanks! |
Test build #80708 has finished for PR 18896 at commit
|
Fix seems ok - but just wondering why the test for zero std dev in spark/mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregatorSuite.scala Line 232 in cf29828
|
The model is always trained in the scaled space regardless of the e.g. in the test case doing: val model = mlr.fit(multinomialDatasetWithZeroVar)
println(model.interceptVector)
println(model.coefficientMatrix)
val model2 = mlr.setStandardization(false).fit(multinomialDatasetWithZeroVar)
println(model2.interceptVector)
println(model2.coefficientMatrix) Gives you
|
@MLnick Yes it is always trained in scaled space. |
@MLnick I debug the testcase your mentioned. |
How about updating the LogisticAggregatorSuite so it catches this error:
|
@jkkbradley OK. So I can remove the test I added ? |
Test build #80820 has finished for PR 18896 at commit
|
assert(aggConstantFeature.gradient(0) === 0.0) | ||
def validateGradient(grad: Vector): Unit = { | ||
assert(grad(0) === 0.0) | ||
grad.toArray.foreach { gradientValue => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this test was that it checked that part of the gradient was zero, but didn't check that the rest of the gradient was correct. Here, you're checking that the rest of the gradient isn't nan or infinite, but not that it's actually correct. A more appropriate test, IMO, is to also run an aggregator over the same instances with the constant feature filtered out, then check that the portion of the gradients they share are the same. e.g.
val aggConstantFeature = getNewAggregator(instancesConstantFeature,
Vectors.dense(coefArray ++ interceptArray), fitIntercept = true, isMultinomial = true)
val filteredInstances = instancesConstantFeature.map { case Instance(l, w, f) =>
Instance(l, w, Vectors.dense(f.toArray.tail))
}
val aggMultinomial = getNewAggregator(filteredInstances,
Vectors.dense(coefArray.slice(3, 6) ++ interceptArray), fitIntercept = true,
isMultinomial = true)
filteredInstances.foreach(aggMultinomial.add)
instancesConstantFeature.foreach(aggConstantFeature.add)
// constant features should not affect gradient
assert(aggConstantFeature.gradient.toArray.take(numClasses) === Array.fill(numClasses)(0.0))
assert(aggMultinomial.gradient.toArray === aggConstantFeature.gradient.toArray.slice(3, 9))
Just to note, this code is just for an example, not meant to be copy and pasted.
*/ | ||
|
||
val coefficientsR = new DenseMatrix(3, 2, Array( | ||
0.1881871, -0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -0.0
?
Thanks for catching this @WeichenXu123! I just added a note about the intent of test. |
5e73f63
to
1f4ba14
Compare
Test build #80927 has finished for PR 18896 at commit
|
Thanks @WeichenXu123 and @sethah ! @WeichenXu123 I'd leave the MLOR test since it's cheap and has a clear purpose, even if it overlaps a little. LGTM |
Merging with master |
@WeichenXu123 would you mind sending a backport PR for 2.2? |
@jkbradley OK. (Can this directly merged to 2.2 ?) |
…td contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of #18896 fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
…td contains zero (backport PR for 2.2) ## What changes were proposed in this pull request? This is backport PR of apache#18896 fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#19026 from WeichenXu123/fix_mlor_zero_var_bug_2_2.
What changes were proposed in this pull request?
fix bug of MLOR do not work correctly when featureStd contains zero
We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
How was this patch tested?
testcase added.