Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized #6386

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented May 24, 2015

The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.

@SparkQA
Copy link

SparkQA commented May 24, 2015

Test build #33430 has finished for PR 6386 at commit a619d42.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2015

Test build #33431 has finished for PR 6386 at commit 4febcc3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

trainOnInstances(instances, handlePersistence)
}

protected[spark] def trainOnInstances(instances: RDD[(Double, Vector)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we create Dataframe in LogisticRegressionWithLBFGS, so no code changes in ml package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That an option, I figured it would be better to not round trip it through a DataFrame since we would have to create a SQLContext and the ml implementation just rips out the LabeledPoints from the DataFrame as soon as its passed in - but I can do it this way if that would be better :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may just pass in RDD[LabeledPoint] from mllib to ml since the implicitly conversion. In ALS, the mllib is calling the new implementation in ml using the same way.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think thats whats going on under the hood in ALS.scala - its passing in an RDD of NewALS.Rating case classes, which is the datatype that NewALS' train function works on (although I'm never completely sure with implicits). If I simply try and pass the RDD of LabeledPoints its a compile error (although I could be missing one of the implicit imports but I'm not sure which one).

…om tests require that feature scaling is turned on to use ml implementation.
@holdenk holdenk changed the title [Spark-7780][WIP] Intercept in logisticregressionwith lbfgs should not be regularized [Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized May 24, 2015
override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
// ml's Logisitic regression only supports binary classifcation currently.
if (numOfLinearPredictor == 1 && useFeatureScaling) {
def runWithMlLogisitcRegression(elasticNetParam: Double) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about? Then we don't have change the other side. Thanks.

    val sqlContext = new SQLContext(input.context)
    import sqlContext.implicits._
    val lor = new org.apache.spark.ml.classification.LogisticRegression()
    lor.fit(input.toDF())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, the only downside is that on the other side its ripped back out right away. This would also loose the initial weights but I could either modify the signature on the other side to take initial weights or require the the initial weights or zero (which do you think is better)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When people train LoR/LiR with multiple lambda of regularizations for cross-validation, the training algorithm will start from the largest lambda and return the model. The model will be used as the initial condition for the second largest lambda. The process will be repeated until all the lambdas are trained. By using the previous model as initial weights, the convergence rate will be a way faster. http://www.jstatsoft.org/v33/i01/paper

As a result, in order to do so, we need to have ability to specify initial weights. Feel free to add private API to set weights. If the dim of weights is different from the data, then we can use the default one as initial condition.

PS, once this private api is added, we can hook it up with CrossValidation API to train multiple lambdas efficiently. Currently, with multiple lambda, we train from scratch without using the information from previous results. No JIRA now, you can open one if you are interested in this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the passed elasticNetParam is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai that sounds fun, I've added a JIRA to track doing that. For the first part (e.g. now) I just have it defined on LogisticRegression but I could move it to params (currently no vector param exists but can add).
@viirya good point, I've added the set call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think initial weights can be part of params, but we can do it in next iteration when we work on CrossValidation stuff. This regularization path idea will be the same for Liner Regression with elastic net.

@SparkQA
Copy link

SparkQA commented May 24, 2015

Test build #33454 has finished for PR 6386 at commit e8e03a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2015

Test build #33462 has finished for PR 6386 at commit 38a024b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2015

Test build #33479 has finished for PR 6386 at commit 478b8c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Extract columns from data. If dataset is persisted, do not persist oldDataset.
private var optInitialWeights: Option[Vector] = None
/** @group setParam */
def setInitialWeights(value: Vector): this.type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have it private for now. Since for multinomial one, the initial weights will be matrix. Let's discuss the proper api using params framework later.

…with the weights when they are user supploed, validate that the user supplied weights are reasonable.
@SparkQA
Copy link

SparkQA commented May 26, 2015

Test build #33495 has finished for PR 6386 at commit 08589f5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2015

Test build #33498 has finished for PR 6386 at commit f40c401.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jun 8, 2015

So I tried to run this over the weekend with a scaling factor of 0.1 and all of the tests (both on the old and the new branches) failed with OOM. I've decreased the scaling factor and I'll re-run this.

@dbtsai
Copy link
Member

dbtsai commented Jun 8, 2015

Which scaling factor of 0.1? Thanks.

@SparkQA
Copy link

SparkQA commented Jun 8, 2015

Test build #34466 has finished for PR 6386 at commit 3ac02d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jun 9, 2015

In config.py . I also ran into some failures because the auto build of the mllib tests in spark-perf doesn't seem to pass down the version number. I'm re-doing this with the correct mllib tests built and a scale factor of 0.01

@dbtsai
Copy link
Member

dbtsai commented Jun 9, 2015

Oh, get you.

@holdenk
Copy link
Contributor Author

holdenk commented Jun 9, 2015

Ok looks like it ran ok, I'll do another run against master. Theres a bunch of FAILs but, looking at the spark-perf issues it seems like there are expected failures anyways (and tracing through config.py is not a lot of fun).

@dbtsai
Copy link
Member

dbtsai commented Jun 9, 2015

haha. or more simply, can you run LogsticRegressionWithLBFGS with/without the patch and post the run time? Thanks.

btw, you can reuse the code for generating the synthetic dataset in spark-perf or in the spark mllib test.

@holdenk
Copy link
Contributor Author

holdenk commented Jun 16, 2015

So from running with a slightly larger scaling factor than I was initially it seems all the approaches are pretty similar in terms of time usage.

origin:
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.5985, 0.175, 4.259, 4.452, 4.579
Test time: 0.1255, 0.007, 0.117, 0.125, 0.117
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.5985, 0.175, 4.259, 4.452, 4.579
Test time: 0.1255, 0.007, 0.117, 0.125, 0.117
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

current pr (rt through dataframes):

glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.382, 0.486, 3.937, 5.056, 3.937
Test time: 0.1255, 0.012, 0.114, 0.124, 0.115
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.382, 0.486, 3.937, 5.056, 3.937
Test time: 0.1255, 0.012, 0.114, 0.124, 0.115
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

pr without the rt through data frames:

glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.3305, 0.374, 4.049, 5.034, 4.049
Test time: 0.1225, 0.011, 0.119, 0.149, 0.119
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.3305, 0.374, 4.049, 5.034, 4.049
Test time: 0.1225, 0.011, 0.119, 0.149, 0.119
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36966 has finished for PR 6386 at commit d1ce12b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 6, 2015

jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40090 has finished for PR 6386 at commit d1ce12b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2015

Test build #41765 has finished for PR 6386 at commit 8ca0fa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43164 has finished for PR 6386 at commit 6f66f2c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class UnaryMathExpression(val f: Double => Double, name: String)
    • case class Ceil(child: Expression) extends UnaryMathExpression(math.ceil, "CEIL")
    • case class Floor(child: Expression) extends UnaryMathExpression(math.floor, "FLOOR")

@SparkQA
Copy link

SparkQA commented Oct 2, 2015

Test build #43167 has finished for PR 6386 at commit 0cedd50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Nov 8, 2015

Hey @dbtsai I know this one has been around for awhile and certainly not going to make 1.6, but maybe since people are starting to talk about 1.7 do you think you might have a chance to review it?

@dbtsai
Copy link
Member

dbtsai commented Nov 23, 2015

@mengxr what do you think? Should we fix the intercept issue in old mllib version of LoR or just deprecate it and educate users to use to new ml version?

@holdenk
Copy link
Contributor Author

holdenk commented Dec 1, 2015

I'd obviously like to get this one in if @mengxr or @jkbradley agree that its worth merging a fix for.

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
@jkbradley
Copy link
Member

@dbtsai @holdenk I don't think we currently have the bandwidth to review this, but if @dbtsai does, then it'd be nice to merge. But I don't think it's critical since we're moving quickly towards spark.ml being the primary API.

ghost pushed a commit to dbtsai/spark that referenced this pull request Jan 27, 2016
…not be regularized

The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.

Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review.

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants