[SPARK-8601][ML] Add an option to disable standardization for linear regression #7037

holdenk · 2015-06-26T09:40:56Z

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

SparkQA · 2015-06-26T09:45:11Z

Test build #35853 has finished for PR 7037 at commit e47c574.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-26T20:28:14Z

Test build #35879 has finished for PR 7037 at commit e54a8a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-06-30T07:17:28Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8601][ML] Add an option to disable standardization for linear regression (#7037)> 이 다음과 같은 이유로 전송 실패했습니다.

받는 사람이 회원님의 메일을 수신차단 하였습니다.

dbtsai · 2015-06-30T07:25:17Z

You don't cover all the test cases including with/without intercept. Also, for regParam = 0, they should converge to the same solution.

SparkQA · 2015-06-30T19:05:36Z

Test build #36173 has finished for PR 7037 at commit 99ce053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… an option to disable standardization (but for LoR).

SparkQA · 2015-06-30T19:54:40Z

Test build #36177 has finished for PR 7037 at commit b83a41e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingLinearAlgorithm(object):
- class StreamingLinearRegressionWithSGD(StreamingLinearAlgorithm):
- class SpecificOrdering extends $
- class SpecificProjection extends $
- final class SpecificRow extends $

holdenk · 2015-07-01T21:24:15Z

@dbtsai I've extended the test coverage.

dbtsai · 2015-07-01T21:32:51Z

@holdenk Cool. I'll work on this tonight. Thanks.

SparkQA · 2015-07-09T22:43:10Z

Test build #36975 has finished for PR 7037 at commit b83a41e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-10T00:47:35Z

Test build #36983 has finished for PR 7037 at commit eebe10a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaRDD(RDD):
- class KafkaDStream(DStream):
- class KafkaTransformedDStream(TransformedDStream):
- class GenericInternalRowWithSchema(values: Array[Any], override val schema: StructType)
- case class StreamInputInfo(

holdenk · 2015-07-10T01:06:12Z

jenkins, retest this please

SparkQA · 2015-07-10T01:46:27Z

Test build #36985 has finished for PR 7037 at commit eebe10a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GenericInternalRowWithSchema(values: Array[Any], override val schema: StructType)

holdenk · 2015-07-17T03:49:28Z

jenkins, retest this please

SparkQA · 2015-07-17T04:31:11Z

Test build #29 has finished for PR 7037 at commit eebe10a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-17T04:38:30Z

Test build #37572 has finished for PR 7037 at commit eebe10a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2015-07-21T06:51:17Z

@holdenk can you merge master? Thanks.

…park-8601-in-Linear_regression

SparkQA · 2015-07-21T08:04:23Z

Test build #37926 has finished for PR 7037 at commit 6b1dc09.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk added 3 commits June 26, 2015 01:23

Add the param to the linearregression impl

00a1dc5

Add standardization param for linear regression

55d3a66

Add support for L2 without standardization.

e47c574

Fix long line

e54a8a9

holdenk changed the title ~~[SPARK-8601][ML][WIP] Add an option to disable standardization for linear regression~~ [SPARK-8601][ML] Add an option to disable standardization for linear regression Jun 29, 2015

dbtsai reviewed Jun 30, 2015
View reviewed changes

holdenk added 2 commits June 30, 2015 11:16

merge in master

99ce053

Remove extra line

0c334a2

Expand the tests and make them similar to the other PR also providing…

b83a41e

… an option to disable standardization (but for LoR).

holdenk added 2 commits July 9, 2015 17:04

merge

3f92935

Use same comparision operator throughout the test

eebe10a

Merge in master

332f140

Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-S…

6b1dc09

…park-8601-in-Linear_regression

holdenk closed this Aug 12, 2015

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7037

[SPARK-8601][ML] Add an option to disable standardization for linear regression #7037

Uh oh!

Conversation

holdenk commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

SparkQA commented Jun 26, 2015

Uh oh!

dbtsai Jun 30, 2015

Choose a reason for hiding this comment

Uh oh!

sujkh85 Jun 30, 2015

Choose a reason for hiding this comment

NAVER - http://www.naver.com/

Uh oh!

dbtsai commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

SparkQA commented Jun 30, 2015

Uh oh!

holdenk commented Jul 1, 2015

Uh oh!

dbtsai commented Jul 1, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

holdenk commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

holdenk commented Jul 17, 2015

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

SparkQA commented Jul 17, 2015

Uh oh!

dbtsai commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants