Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-12732][ML] bug fix in linear regression train #10702

Closed
wants to merge 8 commits into from
Closed

[Spark-12732][ML] bug fix in linear regression train #10702

wants to merge 8 commits into from

Conversation

iyounus
Copy link
Contributor

@iyounus iyounus commented Jan 11, 2016

Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for fitIntercept=true or fitIntercept=false should be treated differently.

@@ -219,33 +219,41 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
}

val yMean = ySummarizer.mean(0)
val yStd = math.sqrt(ySummarizer.variance(0))
var yStd = math.sqrt(ySummarizer.variance(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put all this logic up here?

val rawYStd = math.sqrt(ySummarizer.variance(0))
val yStd = if (rawYStd > 0.0 || fitIntercept) {
  rawYStd
} else {
  logWarning(...)
  1.0
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when rawYStd == 0.0, standardization == true and regParam != 0.0, the problem will be ill-defined. We may want to throw an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iyounus what do you think of these comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I agree with these comments. I'm working on WeightedLeastSquares (#10274), and there are some commonalities between these two issues. Once I've completed all tests for WeightedLeastSquares, I'll compete this issue.

Some of the tests for LinearRegression use both normal and l-bfgs solvers. It would nice if WeightedLeastSquares issue is merged so that I can write tests in similar way.

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49163 has finished for PR 10702 at commit 944be71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

@dbtsai Do you have time to make a pass?

@iyounus
Copy link
Contributor Author

iyounus commented Jan 20, 2016

@mengxr I haven't implemented the changes suggested by @dbtsai and @srowen yet. It think the solution I proposed to this issue may not be very suitable. I'll make some updates to this PR either today or tomorrow.

@dbtsai
Copy link
Member

dbtsai commented Jan 20, 2016

@mengxr This PR is also on my radar. Working on another PR now, once @iyounus is ready, I will work on this.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49816 has finished for PR 10702 at commit 23ce5f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…0, standardization=true and fitIntercept=false. added test for this case. Modified existing test for constant label.
@iyounus
Copy link
Contributor Author

iyounus commented Jan 21, 2016

I've added an exception for the case when label is constant and standardization == true and regParam != 0.0. Also added test for this case.

I cannot test the case when standardizeLabel = false with non-zero regParam against analytic normal equation solution, because the analytic normal equation gives me quite different results.

Even for the case when regParam=0 and fitIntercept=false, the results from normal equation are slightly different from glmnet and spark. Please see my detailed comment regarding this comparison at #10274.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49889 has finished for PR 10702 at commit 5803bd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jan 27, 2016

@iyounus standardizeLabel = false/ture with non-zero regParam, let's throw the exception. I explained the mismatch against the analytic normal equation in the other PR.

Thanks.


test("regularized linear regression through origin with constant label") {
// The problem is ill-defined if fitIntercept=false, regParam is non-zero and \
// standardization=true. An exception is thrown in this case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When standardization=false, the problem is still ill-defined since GLMNET always standardizes the labels. That's why you see it in the analytical solution. Let's throw exception when fitIntercept=false and regParam != 0.0.

…se. Also, if yMean==0 and yStd==0, no training is done.
@SparkQA
Copy link

SparkQA commented Jan 31, 2016

Test build #50450 has finished for PR 10702 at commit e83b822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iyounus
Copy link
Contributor Author

iyounus commented Jan 31, 2016

I've completed this PR. I think all the tests are there. Here, I'm going to document a couple of minor issues just for future reference.

Issue 1
For the case when yStd = 0 and fitIntercept = false, we've four possibilities (reParam: zero/non-zero and standardization: true/false). Using WeightedLeastSquares (normal solver), I can get the following results:

# data used for the following results
val df = sc.parallelize(Seq(
  (17.0, Vectors.dense(0.0, 5.0)),
  (17.0, Vectors.dense(1.0, 7.0)),
  (17.0, Vectors.dense(2.0, 11.0)),
  (17.0, Vectors.dense(3.0, 13.0))
), 2).toDF("label", "features")
# coefficients obtained from WeightedLeastSquares
(1) reg: 0.0, standardization: false
--------> 0.0 [-9.508474576271158,3.457627118644062]

(2) reg: 0.0, standardization: true
--------> 0.0 [-9.508474576271158,3.457627118644062]

(3) reg: 0.1, standardization: false
--------> 0.0 [-7.134240246406588,3.010780287474336]

(4) reg: 0.1, standardization: true
--------> 0.0 [-5.730337078651679,2.7219101123595495]

This is with L2 regularization, and ignoring standardization of the label for the case (4). For the case (4), we throw an error because this is ill-defined, so the user never sees these results.

For case (3), even though the standardization is false, the label is still standardized because the standardizeLable is hardwired to be true when calling WeightedLeastSquares within LinearRegression class. Therefore, an error is thrown in this case too. Which, in my opinion, is not right thing to do because the analytical solution does exist for this case.

Issue 2
Again, for the case when yStd = 0 and fitIntercept = false, I can get the following results using l-bfgs solver:

(1) reg: 0.0, standardization: false
--------> 0.0 [-9.508474576271176,3.4576271186440652]

(2) reg: 0.0, standardization: true
--------> 0.0 [-9.508474576271176,3.4576271186440652]

(3) reg: 0.1, standardization: false
--------> 0.0 [-9.327614273741196,3.423618722197146]

(4) reg: 0.1, standardization: true
--------> 0.0 [-9.08129403505256,3.374915377479131]

Here, results (1) and (2) are identical to what we get from WeightedLeastSquares as expected. Case (4) is ill-defined and we throw an error.

Now, for case (3), the numerical values are different as compared to WeightedLeastSquares. This is because we standardize label using yMean. Otherwise, the values obtained from l-bfgs are identical to WeightedLeastSquares. Note that the user will not see these values because an error is thrown for this case instead.

Issue 3
The normal equation with regression (Ridge Regression), gives significantly different results as compared to case (3) above. Here is my R code with results:

ridge_regression <- function(A, b, lambda, intercept=TRUE){
    if (intercept) {
        A = cbind(rep(1.0, length(b)), A)
        I = diag(ncol(A))
        I[1,1] = 0.0
    } else {
        I = diag(ncol(A))
    }
    R = chol( t(A) %*% A + lambda*I )
    z = solve(t(R), t(A) %*% b)
    w = solve(R, z)
    return(w)
}
A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
b <- c(17, 17, 17, 17)
df <- as.data.frame(cbind(A, b))

ridge_regression(A, b, 0.1, intercept = FALSE)

[1,] -8.783272
[2,]  3.321237

In my opinion, when standardization=flase, the results from normal solver must match this. Even though the user doesn't see this case, it gives me less confidence in the implementation of normal equation, because it doesn't match this simple case. I also wrote about this at #10274.

}

// if y is constant (rawYStd is zero), then y cannot be scaled. In this case
// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm.
val yStd = if (rawYStd > 0) rawYStd else if (yMean != 0.0) math.abs(yMean) else 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val yStd = if (rawYStd > 0) rawYStd else math.abs(yMean)

since you already check the condition before.

// zero coefficient; as a result, training is not needed.
// Also, if yMean==0 and rawYStd==0, all the coefficients are zero regardless of
// the fitIntercept
logWarning(s"The standard deviation of the label is zero, so the coefficients will be " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to update the warning message for the second situation as well.

@dbtsai
Copy link
Member

dbtsai commented Jan 31, 2016

LGTM except minor comments. Thanks.

@dbtsai
Copy link
Member

dbtsai commented Jan 31, 2016

Commenting on your issues.

Issue 1:
With WeightedLeastSquares, we have option to standardize the label and features separately. As a result, if the label is not standardized, even yStd == 0, the problem can be solved.

As a result, in your case 4, when label is not standardized, and the features are standardized, this is not defined, so the users should get the result.

For case 3, can you elaborate why analytical solution exists even the label is standardized?

Issue 2:

In my opinion, even case 1, and case 2 are ill-defined since in GLMNET, the label is standardized by default, and GLMNET will not return any result at all. It just happens that without regularization, with/without standardization on labels will not change the solution, so we just treat them as if we don't standardize the label. This can explain your case 3.

Issue 3:

I think this is because your normal equation solver doesn't standardize the label, so the discrepancies occur.

@SparkQA
Copy link

SparkQA commented Jan 31, 2016

Test build #50454 has finished for PR 10702 at commit 0b16353.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 31, 2016

Test build #50455 has finished for PR 10702 at commit c0744d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iyounus
Copy link
Contributor Author

iyounus commented Feb 1, 2016

For the case (3), I'm assuming that the label and features are not standardized. So, in that case, the solution exists. Here is my perspective on this.

The normal equation X^T X \beta = X^T y has a unique solution if matrix X is full rank. If its not full rank, the X^T X becomes singular and hence not invertible. With regularization, the equation (X^T X + \lambda I) \beta = X^T y still has unique solution if the matrix (X^T X + \lambda I) is invertible. This is true even if elements of y are constant. For the sample data I'm using above, I can solve this equation by hand and obtain \beta (with and without intercept). If I'm using any software, it must reproduce these results. Note that standardization is completely independent operation. Its not part of normal equation. So, if the user demands no standardization, then the any software should produce the analytical solution.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50512 has finished for PR 10702 at commit 2480dc1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -74,7 +74,8 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
/**
* Set the regularization parameter.
* Default is 0.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the indentations you just added are off.

@dbtsai
Copy link
Member

dbtsai commented Feb 2, 2016

For the case (3), I agree with your agreement completely. Can you try your normal equation solution with L2 without any standardization (nonzero ystd data) and see if the result match GLMNET? If I remember correctly, this will not match since GLMNET internally will standardize labels even one has standardization = false. If this is true, when ystd = 0, it doesn't make sense to have different rule, and after all, GLMNET will just return error in this case.

@@ -398,7 +422,8 @@ class LinearRegressionModel private[ml] (

/**
* Evaluates the model on a testset.
* @param dataset Test dataset to evaluate model on.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know how this has happened. :)
I've fixed these. All the indentation should be in order now.

@iyounus
Copy link
Contributor Author

iyounus commented Feb 2, 2016

GLMNET sets all coefficients to zero if yStd=0 and fitIntercept=false regardless of standardization or regularization. Thats why I cannot compare my normal equation with GLMNET.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50575 has finished for PR 10702 at commit fd7eb99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Feb 2, 2016

I meat comparing the result with your solution when yStd != 0, and regParm != 0. I suspect that you will get different result since GLMNET one forces to standardize the labels even standardization is off.

@iyounus
Copy link
Contributor Author

iyounus commented Feb 3, 2016

For yStd != 0, and regParm != 0, my solution doesn't match with GLMNET. I showed this comparison on this jira #10274.

@dbtsai
Copy link
Member

dbtsai commented Feb 3, 2016

Yes, that's what I meat. Without standardizing the labels, no way to match glmnet, but this makes the problem ill-defined when yStd == 0.

@dbtsai
Copy link
Member

dbtsai commented Feb 3, 2016

LGTM. Merged into master. Thanks.

@asfgit asfgit closed this in 0557146 Feb 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants