Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Jun 26, 2015

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.

In R, there is an option for this.
standardize
Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35853 has finished for PR 7037 at commit e47c574.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35879 has finished for PR 7037 at commit e54a8a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk changed the title [SPARK-8601][ML][WIP] Add an option to disable standardization for linear regression [SPARK-8601][ML] Add an option to disable standardization for linear regression Jun 29, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAVER - http://www.naver.com/

sujkh@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-8601][ML] Add an option to disable standardization for linear regression (#7037)> 이 다음과 같은 이유로 전송 실패했습니다.


받는 사람이 회원님의 메일을 수신차단 하였습니다.


@dbtsai
Copy link
Member

dbtsai commented Jun 30, 2015

You don't cover all the test cases including with/without intercept. Also, for regParam = 0, they should converge to the same solution.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36173 has finished for PR 7037 at commit 99ce053.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

… an option to disable standardization (but for LoR).
@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36177 has finished for PR 7037 at commit b83a41e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingLinearAlgorithm(object):
    • class StreamingLinearRegressionWithSGD(StreamingLinearAlgorithm):
    • class SpecificOrdering extends $
    • class SpecificProjection extends $
    • final class SpecificRow extends $

@holdenk
Copy link
Contributor Author

holdenk commented Jul 1, 2015

@dbtsai I've extended the test coverage.

@dbtsai
Copy link
Member

dbtsai commented Jul 1, 2015

@holdenk Cool. I'll work on this tonight. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36975 has finished for PR 7037 at commit b83a41e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2015

Test build #36983 has finished for PR 7037 at commit eebe10a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KafkaRDD(RDD):
    • class KafkaDStream(DStream):
    • class KafkaTransformedDStream(TransformedDStream):
    • class GenericInternalRowWithSchema(values: Array[Any], override val schema: StructType)
    • case class StreamInputInfo(

@holdenk
Copy link
Contributor Author

holdenk commented Jul 10, 2015

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 10, 2015

Test build #36985 has finished for PR 7037 at commit eebe10a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GenericInternalRowWithSchema(values: Array[Any], override val schema: StructType)

@holdenk
Copy link
Contributor Author

holdenk commented Jul 17, 2015

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #29 has finished for PR 7037 at commit eebe10a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37572 has finished for PR 7037 at commit eebe10a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jul 21, 2015

@holdenk can you merge master? Thanks.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37926 has finished for PR 7037 at commit 6b1dc09.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk holdenk closed this Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants