[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized #6386

holdenk · 2015-05-24T08:06:34Z

The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.

…n 2 we use the legacy implementation. Also allow pass through of initialWeights

…ctors instead of keeping track of class variable, pass through persistence information

SparkQA · 2015-05-24T08:10:50Z

Test build #33430 has finished for PR 6386 at commit a619d42.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-24T09:30:00Z

Test build #33431 has finished for PR 6386 at commit 4febcc3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-05-24T19:34:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+    trainOnInstances(instances, handlePersistence)
+  }
+
+  protected[spark] def trainOnInstances(instances: RDD[(Double, Vector)],


Why don't we create Dataframe in LogisticRegressionWithLBFGS, so no code changes in ml package?

That an option, I figured it would be better to not round trip it through a DataFrame since we would have to create a SQLContext and the ml implementation just rips out the LabeledPoints from the DataFrame as soon as its passed in - but I can do it this way if that would be better :)

We may just pass in RDD[LabeledPoint] from mllib to ml since the implicitly conversion. In ALS, the mllib is calling the new implementation in ml using the same way.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

I don't think thats whats going on under the hood in ALS.scala - its passing in an RDD of NewALS.Rating case classes, which is the datatype that NewALS' train function works on (although I'm never completely sure with implicits). If I simply try and pass the RDD of LabeledPoints its a compile error (although I could be missing one of the implicit imports but I'm not sure which one).

…om tests require that feature scaling is turned on to use ml implementation.

dbtsai · 2015-05-24T20:39:48Z

mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
+    // ml's Logisitic regression only supports binary classifcation currently.
+    if (numOfLinearPredictor == 1 && useFeatureScaling) {
+      def runWithMlLogisitcRegression(elasticNetParam: Double) = {


How about? Then we don't have change the other side. Thanks.

val sqlContext = new SQLContext(input.context) import sqlContext.implicits._ val lor = new org.apache.spark.ml.classification.LogisticRegression() lor.fit(input.toDF())

We can do that, the only downside is that on the other side its ripped back out right away. This would also loose the initial weights but I could either modify the signature on the other side to take initial weights or require the the initial weights or zero (which do you think is better)?

When people train LoR/LiR with multiple lambda of regularizations for cross-validation, the training algorithm will start from the largest lambda and return the model. The model will be used as the initial condition for the second largest lambda. The process will be repeated until all the lambdas are trained. By using the previous model as initial weights, the convergence rate will be a way faster. http://www.jstatsoft.org/v33/i01/paper

As a result, in order to do so, we need to have ability to specify initial weights. Feel free to add private API to set weights. If the dim of weights is different from the data, then we can use the default one as initial condition.

PS, once this private api is added, we can hook it up with CrossValidation API to train multiple lambdas efficiently. Currently, with multiple lambda, we train from scratch without using the information from previous results. No JIRA now, you can open one if you are interested in this.

I don't see the passed elasticNetParam is used?

@dbtsai that sounds fun, I've added a JIRA to track doing that. For the first part (e.g. now) I just have it defined on LogisticRegression but I could move it to params (currently no vector param exists but can add).
@viirya good point, I've added the set call.

I think initial weights can be part of params, but we can do it in next iteration when we work on CrossValidation stuff. This regularization path idea will be the same for Liner Regression with elastic net.

SparkQA · 2015-05-24T21:42:27Z

Test build #33454 has finished for PR 6386 at commit e8e03a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-25T08:39:02Z

Test build #33462 has finished for PR 6386 at commit 38a024b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-25T22:01:20Z

Test build #33479 has finished for PR 6386 at commit 478b8c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-05-25T22:09:08Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

-    // Extract columns from data.  If dataset is persisted, do not persist oldDataset.
+  private var optInitialWeights: Option[Vector] = None
+  /** @group setParam */
+  def setInitialWeights(value: Vector): this.type = {


Let's have it private for now. Since for multinomial one, the initial weights will be matrix. Let's discuss the proper api using params framework later.

…with the weights when they are user supploed, validate that the user supplied weights are reasonable.

SparkQA · 2015-05-26T03:45:50Z

Test build #33495 has finished for PR 6386 at commit 08589f5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-26T06:09:05Z

Test build #33498 has finished for PR 6386 at commit f40c401.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-06-08T20:17:10Z

So I tried to run this over the weekend with a scaling factor of 0.1 and all of the tests (both on the old and the new branches) failed with OOM. I've decreased the scaling factor and I'll re-run this.

dbtsai · 2015-06-08T21:13:32Z

Which scaling factor of 0.1? Thanks.

SparkQA · 2015-06-08T22:20:46Z

Test build #34466 has finished for PR 6386 at commit 3ac02d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-06-09T00:17:38Z

In config.py . I also ran into some failures because the auto build of the mllib tests in spark-perf doesn't seem to pass down the version number. I'm re-doing this with the correct mllib tests built and a scale factor of 0.01

dbtsai · 2015-06-09T04:00:36Z

Oh, get you.

holdenk · 2015-06-09T07:25:35Z

Ok looks like it ran ok, I'll do another run against master. Theres a bunch of FAILs but, looking at the spark-perf issues it seems like there are expected failures anyways (and tracing through config.py is not a lot of fun).

dbtsai · 2015-06-09T19:38:54Z

haha. or more simply, can you run LogsticRegressionWithLBFGS with/without the patch and post the run time? Thanks.

btw, you can reuse the code for generating the synthetic dataset in spark-perf or in the spark mllib test.

holdenk · 2015-06-16T02:50:45Z

So from running with a slightly larger scaling factor than I was initially it seems all the approaches are pretty similar in terms of time usage.

origin:
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.5985, 0.175, 4.259, 4.452, 4.579
Test time: 0.1255, 0.007, 0.117, 0.125, 0.117
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.5985, 0.175, 4.259, 4.452, 4.579
Test time: 0.1255, 0.007, 0.117, 0.125, 0.117
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

current pr (rt through dataframes):

glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.382, 0.486, 3.937, 5.056, 3.937
Test time: 0.1255, 0.012, 0.114, 0.124, 0.115
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.382, 0.486, 3.937, 5.056, 3.937
Test time: 0.1255, 0.012, 0.114, 0.124, 0.115
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

pr without the rt through data frames:

glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2
Training time: 4.3305, 0.374, 4.049, 5.034, 4.049
Test time: 0.1225, 0.011, 0.119, 0.149, 0.119
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996
glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=elastic-net
Training time: 4.3305, 0.374, 4.049, 5.034, 4.049
Test time: 0.1225, 0.011, 0.119, 0.149, 0.119
Training Set Metric: 33.2738700623, 0.133, 33.0214743257, 33.4757304367, 33.2006471123
Test Set Metric: 33.1470085163, 0.274, 32.8030691787, 33.372759872, 33.0449774996

SparkQA · 2015-07-09T20:57:44Z

Test build #36966 has finished for PR 6386 at commit d1ce12b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-08-06T22:28:26Z

jenkins, retest this please.

SparkQA · 2015-08-06T23:17:25Z

Test build #40090 has finished for PR 6386 at commit d1ce12b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-28T22:47:08Z

Test build #41765 has finished for PR 6386 at commit 8ca0fa9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-01T23:16:12Z

Test build #43164 has finished for PR 6386 at commit 6f66f2c.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class UnaryMathExpression(val f: Double => Double, name: String)
- case class Ceil(child: Expression) extends UnaryMathExpression(math.ceil, "CEIL")
- case class Floor(child: Expression) extends UnaryMathExpression(math.floor, "FLOOR")

SparkQA · 2015-10-02T02:30:34Z

Test build #43167 has finished for PR 6386 at commit 0cedd50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2015-11-08T23:52:12Z

Hey @dbtsai I know this one has been around for awhile and certainly not going to make 1.6, but maybe since people are starting to talk about 1.7 do you think you might have a chance to review it?

dbtsai · 2015-11-23T21:51:49Z

@mengxr what do you think? Should we fix the intercept issue in old mllib version of LoR or just deprecate it and educate users to use to new ml version?

holdenk · 2015-12-01T20:58:11Z

I'd obviously like to get this one in if @mengxr or @jkbradley agree that its worth merging a fix for.

…withLBFGS-should-not-be-regularized

rxin · 2015-12-31T02:47:03Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

jkbradley · 2016-01-04T19:43:41Z

@dbtsai @holdenk I don't think we currently have the bandwidth to review this, but if @dbtsai does, then it'd be nice to merge. But I don't think it's critical since we're moving quickly towards spark.ml being the primary API.

…not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.

holdenk added 8 commits May 23, 2015 18:14

document plans

a529c01

Some progress.

f9e2635

Keep track of the number of requested classes so that if its more tha…

7ebbd56

…n 2 we use the legacy implementation. Also allow pass through of initialWeights

Expose a train on instances method within Spark, use numOfLinearPredi…

ef2a9b0

…ctors instead of keeping track of class variable, pass through persistence information

tests are fun

407491e

Start updating the tests to run with different updaters.

e02bf3a

get the tests compiling

8517539

style fixed

a619d42

make the test method private

4febcc3

dbtsai reviewed May 24, 2015
View reviewed changes

CR feedback, pass RDD of Labeled points to ml implemetnation. Also fr…

e8e03a1

…om tests require that feature scaling is turned on to use ml implementation.

holdenk changed the title ~~[Spark-7780][WIP] Intercept in logisticregressionwith lbfgs should not be regularized~~ [Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized May 24, 2015

dbtsai reviewed May 24, 2015
View reviewed changes

Convert it to a df and use set for the inital params

38a024b

Handle non-dense weights

478b8c5

dbtsai reviewed May 25, 2015
View reviewed changes

CR feedback: make the setInitialWeights function private, don't mess …

08589f5

…with the weights when they are user supploed, validate that the user supplied weights are reasonable.

style fix up

f40c401

Merge in master

3ac02d7

holdenk mentioned this pull request Jun 11, 2015

[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized no round trip through data frames #6771

Closed

Merge in master

d1ce12b

attempt to merge in master

8ca0fa9

Merge in master (again)

6f66f2c

Fix compile error after simple merge

0cedd50

Merge branch 'master' into SPARK-7780-intercept-in-logisticregression…

2bf289b

…withLBFGS-should-not-be-regularized

asfgit closed this in 7b4452b Dec 31, 2015

holdenk mentioned this pull request Jan 16, 2016

[SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized #10788

Closed

[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized #6386

[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized #6386

Conversation

holdenk commented May 24, 2015

SparkQA commented May 24, 2015

SparkQA commented May 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 24, 2015

SparkQA commented May 25, 2015

SparkQA commented May 25, 2015

Choose a reason for hiding this comment

SparkQA commented May 26, 2015

SparkQA commented May 26, 2015

holdenk commented Jun 8, 2015

dbtsai commented Jun 8, 2015

SparkQA commented Jun 8, 2015

holdenk commented Jun 9, 2015

dbtsai commented Jun 9, 2015

holdenk commented Jun 9, 2015

dbtsai commented Jun 9, 2015

holdenk commented Jun 16, 2015

SparkQA commented Jul 9, 2015

holdenk commented Aug 6, 2015

SparkQA commented Aug 6, 2015

SparkQA commented Aug 28, 2015

SparkQA commented Oct 1, 2015

SparkQA commented Oct 2, 2015

holdenk commented Nov 8, 2015

dbtsai commented Nov 23, 2015

holdenk commented Dec 1, 2015

rxin commented Dec 31, 2015

jkbradley commented Jan 4, 2016