-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7780][MLLIB] Intercept in logisticregressionwith lbfgs should not be regularized #6386
Conversation
…n 2 we use the legacy implementation. Also allow pass through of initialWeights
…ctors instead of keeping track of class variable, pass through persistence information
Test build #33430 has finished for PR 6386 at commit
|
Test build #33431 has finished for PR 6386 at commit
|
trainOnInstances(instances, handlePersistence) | ||
} | ||
|
||
protected[spark] def trainOnInstances(instances: RDD[(Double, Vector)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we create Dataframe
in LogisticRegressionWithLBFGS, so no code changes in ml
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That an option, I figured it would be better to not round trip it through a DataFrame
since we would have to create a SQLContext
and the ml
implementation just rips out the LabeledPoint
s from the DataFrame as soon as its passed in - but I can do it this way if that would be better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may just pass in RDD[LabeledPoint]
from mllib
to ml
since the implicitly conversion. In ALS
, the mllib
is calling the new implementation in ml
using the same way.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think thats whats going on under the hood in ALS.scala
- its passing in an RDD of NewALS.Rating
case classes, which is the datatype that NewALS'
train function works on (although I'm never completely sure with implicits). If I simply try and pass the RDD of LabeledPoints its a compile error (although I could be missing one of the implicit imports but I'm not sure which one).
…om tests require that feature scaling is turned on to use ml implementation.
override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { | ||
// ml's Logisitic regression only supports binary classifcation currently. | ||
if (numOfLinearPredictor == 1 && useFeatureScaling) { | ||
def runWithMlLogisitcRegression(elasticNetParam: Double) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about? Then we don't have change the other side. Thanks.
val sqlContext = new SQLContext(input.context)
import sqlContext.implicits._
val lor = new org.apache.spark.ml.classification.LogisticRegression()
lor.fit(input.toDF())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, the only downside is that on the other side its ripped back out right away. This would also loose the initial weights but I could either modify the signature on the other side to take initial weights or require the the initial weights or zero (which do you think is better)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When people train LoR/LiR with multiple lambda of regularizations for cross-validation, the training algorithm will start from the largest lambda and return the model. The model will be used as the initial condition for the second largest lambda. The process will be repeated until all the lambdas are trained. By using the previous model as initial weights, the convergence rate will be a way faster. http://www.jstatsoft.org/v33/i01/paper
As a result, in order to do so, we need to have ability to specify initial weights. Feel free to add private API to set weights. If the dim of weights is different from the data, then we can use the default one as initial condition.
PS, once this private api is added, we can hook it up with CrossValidation API to train multiple lambdas efficiently. Currently, with multiple lambda, we train from scratch without using the information from previous results. No JIRA now, you can open one if you are interested in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the passed elasticNetParam
is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think initial weights can be part of params, but we can do it in next iteration when we work on CrossValidation stuff. This regularization path idea will be the same for Liner Regression with elastic net.
Test build #33454 has finished for PR 6386 at commit
|
Test build #33462 has finished for PR 6386 at commit
|
Test build #33479 has finished for PR 6386 at commit
|
// Extract columns from data. If dataset is persisted, do not persist oldDataset. | ||
private var optInitialWeights: Option[Vector] = None | ||
/** @group setParam */ | ||
def setInitialWeights(value: Vector): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have it private for now. Since for multinomial one, the initial weights will be matrix. Let's discuss the proper api using params
framework later.
…with the weights when they are user supploed, validate that the user supplied weights are reasonable.
Test build #33495 has finished for PR 6386 at commit
|
Test build #33498 has finished for PR 6386 at commit
|
So I tried to run this over the weekend with a scaling factor of 0.1 and all of the tests (both on the old and the new branches) failed with OOM. I've decreased the scaling factor and I'll re-run this. |
Which scaling factor of 0.1? Thanks. |
Test build #34466 has finished for PR 6386 at commit
|
In config.py . I also ran into some failures because the auto build of the mllib tests in spark-perf doesn't seem to pass down the version number. I'm re-doing this with the correct mllib tests built and a scale factor of 0.01 |
Oh, get you. |
Ok looks like it ran ok, I'll do another run against master. Theres a bunch of FAILs but, looking at the spark-perf issues it seems like there are expected failures anyways (and tracing through config.py is not a lot of fun). |
haha. or more simply, can you run LogsticRegressionWithLBFGS with/without the patch and post the run time? Thanks. btw, you can reuse the code for generating the synthetic dataset in spark-perf or in the spark mllib test. |
So from running with a slightly larger scaling factor than I was initially it seems all the approaches are pretty similar in terms of time usage. origin: current pr (rt through dataframes): glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2 pr without the rt through data frames: glm-regression, glm-regression --num-trials=10 --inter-trial-wait=3 --num-partitions=6 --random-seed=5 --num-examples=50000 --num-features=10000 --num-iterations=20 --step-size=0.001 --reg-type=l2 --reg-param=0.1 --optimizer=lbfgs --intercept=0.0 --epsilon=0.1 --loss=l2 |
Test build #36966 has finished for PR 6386 at commit
|
jenkins, retest this please. |
Test build #40090 has finished for PR 6386 at commit
|
Test build #41765 has finished for PR 6386 at commit
|
Test build #43164 has finished for PR 6386 at commit
|
Test build #43167 has finished for PR 6386 at commit
|
Hey @dbtsai I know this one has been around for awhile and certainly not going to make 1.6, but maybe since people are starting to talk about 1.7 do you think you might have a chance to review it? |
@mengxr what do you think? Should we fix the intercept issue in old mllib version of LoR or just deprecate it and educate users to use to new ml version? |
I'd obviously like to get this one in if @mengxr or @jkbradley agree that its worth merging a fix for. |
…withLBFGS-should-not-be-regularized
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
…not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through
Updater
, and theUpdater
penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.