[Spark-12732][ML] bug fix in linear regression train #10702

iyounus · 2016-01-11T19:44:02Z

Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for fitIntercept=true or fitIntercept=false should be treated differently.

srowen · 2016-01-11T20:32:47Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -219,33 +219,41 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
    }

    val yMean = ySummarizer.mean(0)
-    val yStd = math.sqrt(ySummarizer.variance(0))
+    var yStd = math.sqrt(ySummarizer.variance(0))


Can you put all this logic up here?

val rawYStd = math.sqrt(ySummarizer.variance(0)) val yStd = if (rawYStd > 0.0 || fitIntercept) { rawYStd } else { logWarning(...) 1.0 }

Also, when rawYStd == 0.0, standardization == true and regParam != 0.0, the problem will be ill-defined. We may want to throw an exception.

@iyounus what do you think of these comments?

@srowen I agree with these comments. I'm working on WeightedLeastSquares (#10274), and there are some commonalities between these two issues. Once I've completed all tests for WeightedLeastSquares, I'll compete this issue.

Some of the tests for LinearRegression use both normal and l-bfgs solvers. It would nice if WeightedLeastSquares issue is merged so that I can write tests in similar way.

SparkQA · 2016-01-11T20:41:59Z

Test build #49163 has finished for PR 10702 at commit 944be71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-20T19:18:50Z

@dbtsai Do you have time to make a pass?

iyounus · 2016-01-20T21:25:58Z

@mengxr I haven't implemented the changes suggested by @dbtsai and @srowen yet. It think the solution I proposed to this issue may not be very suitable. I'll make some updates to this PR either today or tomorrow.

…ails.

dbtsai · 2016-01-20T23:12:19Z

@mengxr This PR is also on my radar. Working on another PR now, once @iyounus is ready, I will work on this.

SparkQA · 2016-01-20T23:12:39Z

Test build #49816 has finished for PR 10702 at commit 23ce5f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…0, standardization=true and fitIntercept=false. added test for this case. Modified existing test for constant label.

iyounus · 2016-01-21T22:38:45Z

I've added an exception for the case when label is constant and standardization == true and regParam != 0.0. Also added test for this case.

I cannot test the case when standardizeLabel = false with non-zero regParam against analytic normal equation solution, because the analytic normal equation gives me quite different results.

Even for the case when regParam=0 and fitIntercept=false, the results from normal equation are slightly different from glmnet and spark. Please see my detailed comment regarding this comparison at #10274.

SparkQA · 2016-01-21T23:06:41Z

Test build #49889 has finished for PR 10702 at commit 5803bd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-01-27T22:54:41Z

@iyounus standardizeLabel = false/ture with non-zero regParam, let's throw the exception. I explained the mismatch against the analytic normal equation in the other PR.

Thanks.

dbtsai · 2016-01-27T22:58:31Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

+
+  test("regularized linear regression through origin with constant label") {
+    // The problem is ill-defined if fitIntercept=false, regParam is non-zero and \
+    // standardization=true. An exception is thrown in this case.


When standardization=false, the problem is still ill-defined since GLMNET always standardizes the labels. That's why you see it in the analytical solution. Let's throw exception when fitIntercept=false and regParam != 0.0.

…se. Also, if yMean==0 and yStd==0, no training is done.

SparkQA · 2016-01-31T00:20:15Z

Test build #50450 has finished for PR 10702 at commit e83b822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iyounus · 2016-01-31T03:33:11Z

I've completed this PR. I think all the tests are there. Here, I'm going to document a couple of minor issues just for future reference.

Issue 1
For the case when yStd = 0 and fitIntercept = false, we've four possibilities (reParam: zero/non-zero and standardization: true/false). Using WeightedLeastSquares (normal solver), I can get the following results:

# data used for the following results
val df = sc.parallelize(Seq(
  (17.0, Vectors.dense(0.0, 5.0)),
  (17.0, Vectors.dense(1.0, 7.0)),
  (17.0, Vectors.dense(2.0, 11.0)),
  (17.0, Vectors.dense(3.0, 13.0))
), 2).toDF("label", "features")

# coefficients obtained from WeightedLeastSquares
(1) reg: 0.0, standardization: false
--------> 0.0 [-9.508474576271158,3.457627118644062]

(2) reg: 0.0, standardization: true
--------> 0.0 [-9.508474576271158,3.457627118644062]

(3) reg: 0.1, standardization: false
--------> 0.0 [-7.134240246406588,3.010780287474336]

(4) reg: 0.1, standardization: true
--------> 0.0 [-5.730337078651679,2.7219101123595495]

This is with L2 regularization, and ignoring standardization of the label for the case (4). For the case (4), we throw an error because this is ill-defined, so the user never sees these results.

For case (3), even though the standardization is false, the label is still standardized because the standardizeLable is hardwired to be true when calling WeightedLeastSquares within LinearRegression class. Therefore, an error is thrown in this case too. Which, in my opinion, is not right thing to do because the analytical solution does exist for this case.

Issue 2
Again, for the case when yStd = 0 and fitIntercept = false, I can get the following results using l-bfgs solver:

(1) reg: 0.0, standardization: false
--------> 0.0 [-9.508474576271176,3.4576271186440652]

(2) reg: 0.0, standardization: true
--------> 0.0 [-9.508474576271176,3.4576271186440652]

(3) reg: 0.1, standardization: false
--------> 0.0 [-9.327614273741196,3.423618722197146]

(4) reg: 0.1, standardization: true
--------> 0.0 [-9.08129403505256,3.374915377479131]

Here, results (1) and (2) are identical to what we get from WeightedLeastSquares as expected. Case (4) is ill-defined and we throw an error.

Now, for case (3), the numerical values are different as compared to WeightedLeastSquares. This is because we standardize label using yMean. Otherwise, the values obtained from l-bfgs are identical to WeightedLeastSquares. Note that the user will not see these values because an error is thrown for this case instead.

Issue 3
The normal equation with regression (Ridge Regression), gives significantly different results as compared to case (3) above. Here is my R code with results:

ridge_regression <- function(A, b, lambda, intercept=TRUE){
    if (intercept) {
        A = cbind(rep(1.0, length(b)), A)
        I = diag(ncol(A))
        I[1,1] = 0.0
    } else {
        I = diag(ncol(A))
    }
    R = chol( t(A) %*% A + lambda*I )
    z = solve(t(R), t(A) %*% b)
    w = solve(R, z)
    return(w)
}
A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
b <- c(17, 17, 17, 17)
df <- as.data.frame(cbind(A, b))

ridge_regression(A, b, 0.1, intercept = FALSE)

[1,] -8.783272
[2,]  3.321237

In my opinion, when standardization=flase, the results from normal solver must match this. Even though the user doesn't see this case, it gives me less confidence in the implementation of normal equation, because it doesn't match this simple case. I also wrote about this at #10274.

dbtsai · 2016-01-31T03:34:12Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

    }

+    // if y is constant (rawYStd is zero), then y cannot be scaled. In this case
+    // setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm.
+    val yStd = if (rawYStd > 0) rawYStd else if (yMean != 0.0) math.abs(yMean) else 1.0


val yStd = if (rawYStd > 0) rawYStd else math.abs(yMean)

since you already check the condition before.

dbtsai · 2016-01-31T03:35:44Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

+        // zero coefficient; as a result, training is not needed.
+        // Also, if yMean==0 and rawYStd==0, all the coefficients are zero regardless of
+        // the fitIntercept
+        logWarning(s"The standard deviation of the label is zero, so the coefficients will be " +


Maybe you want to update the warning message for the second situation as well.

dbtsai · 2016-01-31T03:41:44Z

LGTM except minor comments. Thanks.

dbtsai · 2016-01-31T04:27:55Z

Commenting on your issues.

Issue 1:
With WeightedLeastSquares, we have option to standardize the label and features separately. As a result, if the label is not standardized, even yStd == 0, the problem can be solved.

As a result, in your case 4, when label is not standardized, and the features are standardized, this is not defined, so the users should get the result.

For case 3, can you elaborate why analytical solution exists even the label is standardized?

Issue 2:

In my opinion, even case 1, and case 2 are ill-defined since in GLMNET, the label is standardized by default, and GLMNET will not return any result at all. It just happens that without regularization, with/without standardization on labels will not change the solution, so we just treat them as if we don't standardize the label. This can explain your case 3.

Issue 3:

I think this is because your normal equation solver doesn't standardize the label, so the discrepancies occur.

SparkQA · 2016-01-31T04:30:36Z

Test build #50454 has finished for PR 10702 at commit 0b16353.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-31T05:00:19Z

Test build #50455 has finished for PR 10702 at commit c0744d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iyounus · 2016-02-01T19:12:03Z

For the case (3), I'm assuming that the label and features are not standardized. So, in that case, the solution exists. Here is my perspective on this.

The normal equation X^T X \beta = X^T y has a unique solution if matrix X is full rank. If its not full rank, the X^T X becomes singular and hence not invertible. With regularization, the equation (X^T X + \lambda I) \beta = X^T y still has unique solution if the matrix (X^T X + \lambda I) is invertible. This is true even if elements of y are constant. For the sample data I'm using above, I can solve this equation by hand and obtain \beta (with and without intercept). If I'm using any software, it must reproduce these results. Note that standardization is completely independent operation. Its not part of normal equation. So, if the user demands no standardization, then the any software should produce the analytical solution.

…hen label is constant)

SparkQA · 2016-02-02T01:31:59Z

Test build #50512 has finished for PR 10702 at commit 2480dc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-02-02T08:22:53Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -74,7 +74,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
  /**
   * Set the regularization parameter.
   * Default is 0.0.


All the indentations you just added are off.

dbtsai · 2016-02-02T08:27:18Z

For the case (3), I agree with your agreement completely. Can you try your normal equation solution with L2 without any standardization (nonzero ystd data) and see if the result match GLMNET? If I remember correctly, this will not match since GLMNET internally will standardize labels even one has standardization = false. If this is true, when ystd = 0, it doesn't make sense to have different rule, and after all, GLMNET will just return error in this case.

dbtsai · 2016-02-02T08:27:42Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

@@ -398,7 +422,8 @@ class LinearRegressionModel private[ml] (

  /**
   * Evaluates the model on a testset.
-   * @param dataset Test dataset to evaluate model on.
+    *


I really don't know how this has happened. :)
I've fixed these. All the indentation should be in order now.

iyounus · 2016-02-02T19:23:32Z

GLMNET sets all coefficients to zero if yStd=0 and fitIntercept=false regardless of standardization or regularization. Thats why I cannot compare my normal equation with GLMNET.

SparkQA · 2016-02-02T19:42:34Z

Test build #50575 has finished for PR 10702 at commit fd7eb99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-02-02T19:53:52Z

I meat comparing the result with your solution when yStd != 0, and regParm != 0. I suspect that you will get different result since GLMNET one forces to standardize the labels even standardization is off.

iyounus · 2016-02-03T01:21:09Z

For yStd != 0, and regParm != 0, my solution doesn't match with GLMNET. I showed this comparison on this jira #10274.

dbtsai · 2016-02-03T04:29:20Z

Yes, that's what I meat. Without standardizing the labels, no way to match glmnet, but this makes the problem ill-defined when yStd == 0.

dbtsai · 2016-02-03T04:39:21Z

LGTM. Merged into master. Thanks.

srowen reviewed Jan 11, 2016
View reviewed changes

iyounus added 2 commits January 20, 2016 14:13

bug fix in linear regrssion for contstant label, but the test still f…

3e23479

…ails.

fixed the test case for constant label

23ce5f3

added exception for ill-defined case of constant label with regParam=…

5803bd1

…0, standardization=true and fitIntercept=false. added test for this case. Modified existing test for constant label.

dbtsai reviewed Jan 27, 2016
View reviewed changes

fixed the case with yStd==0, regParam!=0 and standardization=true/fal…

e83b822

…se. Also, if yMean==0 and yStd==0, no training is done.

dbtsai reviewed Jan 31, 2016
View reviewed changes

forgot to add this file with previous commit

0b16353

dbtsai reviewed Jan 31, 2016
View reviewed changes

minor changes.

c0744d8

added test for the case when results are produced without training (w…

2480dc1

…hen label is constant)

dbtsai reviewed Feb 2, 2016
View reviewed changes

fixed indentation

fd7eb99

asfgit closed this in 0557146 Feb 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-12732][ML] bug fix in linear regression train #10702

[Spark-12732][ML] bug fix in linear regression train #10702

iyounus commented Jan 11, 2016

srowen Jan 11, 2016

dbtsai Jan 12, 2016

srowen Jan 16, 2016

iyounus Jan 17, 2016

SparkQA commented Jan 11, 2016

mengxr commented Jan 20, 2016

iyounus commented Jan 20, 2016

dbtsai commented Jan 20, 2016

SparkQA commented Jan 20, 2016

iyounus commented Jan 21, 2016

SparkQA commented Jan 21, 2016

dbtsai commented Jan 27, 2016

dbtsai Jan 27, 2016

SparkQA commented Jan 31, 2016

iyounus commented Jan 31, 2016

dbtsai Jan 31, 2016

dbtsai Jan 31, 2016

dbtsai commented Jan 31, 2016

dbtsai commented Jan 31, 2016

SparkQA commented Jan 31, 2016

SparkQA commented Jan 31, 2016

iyounus commented Feb 1, 2016

SparkQA commented Feb 2, 2016

dbtsai Feb 2, 2016

dbtsai commented Feb 2, 2016

dbtsai Feb 2, 2016

iyounus Feb 2, 2016

iyounus commented Feb 2, 2016

SparkQA commented Feb 2, 2016

dbtsai commented Feb 2, 2016

iyounus commented Feb 3, 2016

dbtsai commented Feb 3, 2016

dbtsai commented Feb 3, 2016

[Spark-12732][ML] bug fix in linear regression train #10702

[Spark-12732][ML] bug fix in linear regression train #10702

Conversation

iyounus commented Jan 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2016

mengxr commented Jan 20, 2016

iyounus commented Jan 20, 2016

dbtsai commented Jan 20, 2016

SparkQA commented Jan 20, 2016

iyounus commented Jan 21, 2016

SparkQA commented Jan 21, 2016

dbtsai commented Jan 27, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 31, 2016

iyounus commented Jan 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbtsai commented Jan 31, 2016

dbtsai commented Jan 31, 2016

SparkQA commented Jan 31, 2016

SparkQA commented Jan 31, 2016

iyounus commented Feb 1, 2016

SparkQA commented Feb 2, 2016

Choose a reason for hiding this comment

dbtsai commented Feb 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iyounus commented Feb 2, 2016

SparkQA commented Feb 2, 2016

dbtsai commented Feb 2, 2016

iyounus commented Feb 3, 2016

dbtsai commented Feb 3, 2016

dbtsai commented Feb 3, 2016