Add normalizeByCol method to mllib.util.MLUtils. #1698

andy327 · 2014-07-31T18:42:28Z

Adds the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one.

AmplabJenkins · 2014-07-31T18:47:30Z

Can one of the admins verify this patch?

andy327 · 2014-07-31T19:08:53Z

See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776

mengxr · 2014-07-31T20:30:16Z

@andy327 This is covered in @dbtsai's PR: #1207 , which is in review.

andy327 · 2014-07-31T21:27:38Z

I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it not a lazy transformation. Would there be a benefit to implementing feature scaling without calling any RDD actions?

mengxr · 2014-08-01T05:08:57Z

Your implementation calls reduceByKey and cartesian. Those are not cheap streamline operations. map(x => (1, x)).reduceByKey is the same as reduce except that it reduces to some executor instead of the driver. Then cartesian is the same as broadcast but broadcast is more efficient with TorrentBroadcast. You can compare the performance and see the difference. OnlineSummarizer also uses a more accurate approach to compute the variance.

koertkuipers · 2014-08-01T14:30:13Z

redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian are evaluated lazily.

eager evaluation is often unexpected to the end user and can lead to duplicate calculations (since the user does not anticipate them and deal with them using rdd.cache calls)

mengxr · 2014-08-01T14:43:59Z

They are not the same. We use treeReduce to avoid having all executors sending data to the driver, which is not available in reduceByKey. Broadcast is also different from cartesian. This solution cannot avoid having duplicate calculations to rdd. When the computation is triggered, we still need to visit rdd twice. One difference is if someone calls normalizeByCol but never uses the normalized rdd.

koertkuipers · 2014-08-01T15:16:20Z

why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition

i think we calculate 3 numbers per column in the vectors. so for vectors of size 100 we only need to send 300 values back per partition....

also reduceByKey is guaranteed to only send data in cluster, not to driver (which could be not on cluster). Seems like a win to me?

mengxr · 2014-08-01T15:43:17Z

What if you have 10M columns? I agree that not sending data to the driver is a good practice. But the current operations reduceByKey and cartesian are not optimized for very big data. Please test it on a cluster with many partitions and you should see the bottleneck.

koertkuipers · 2014-08-01T15:49:08Z

i can see your point of 10M columns.

would be really nice if we have a lazy and efficient allReduce(RDD[T], (T,
T) => T): RDD[T]

a RDD transform not being lazy leading to multiple spark actions that the
user did not explicitly start is tricky to me. its already difficult enough
to get the cache and unpersist logic correct without unexpected actions.

On Fri, Aug 1, 2014 at 11:43 AM, Xiangrui Meng notifications@github.com
wrote:

What if you have 10M columns? I agree that not sending data to the driver
is a good practice. But the current operations reduceByKey and cartesian
are not optimized for very big data. Please test it on a cluster with many
partitions and you should see the bottleneck.

—
Reply to this email directly or view it on GitHub
#1698 (comment).

mengxr · 2014-08-01T15:52:45Z

Yes, I tried to implement AllReduce without having driver in the middle in #506 but it introduced complex dependencies. So I fall back to the treeReduce + torrent broadcast approach. I hope this can be improved, maybe in v1.2.

mengxr · 2014-08-30T00:53:44Z

@andy327 Do you mind closing this PR for now? I'm definitely buying the idea of freeing up the master, but the current set of Core APIs doesn't provide an easy and efficient way to do it. We could re-visit this and other implementations once we have the right set of tools. Thanks @andy327 @koertkuipers for the PR and the discussion!

…n tests too (apache#1698)

Add normalizeByCol method to mllib.util.MLUtils.

89e538b

asfgit closed this in 9b8c228 Aug 31, 2014

sunchao added a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://106222577 SQLConf.LEGACY_PARQUET_NANOS_AS_LONG should be set i…

1c91e06

…n tests too (apache#1698)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalizeByCol method to mllib.util.MLUtils. #1698

Add normalizeByCol method to mllib.util.MLUtils. #1698

andy327 commented Jul 31, 2014

AmplabJenkins commented Jul 31, 2014

andy327 commented Jul 31, 2014

mengxr commented Jul 31, 2014

andy327 commented Jul 31, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

mengxr commented Aug 30, 2014

Add normalizeByCol method to mllib.util.MLUtils. #1698

Add normalizeByCol method to mllib.util.MLUtils. #1698

Conversation

andy327 commented Jul 31, 2014

AmplabJenkins commented Jul 31, 2014

andy327 commented Jul 31, 2014

mengxr commented Jul 31, 2014

andy327 commented Jul 31, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

koertkuipers commented Aug 1, 2014

mengxr commented Aug 1, 2014

mengxr commented Aug 30, 2014