Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalizeByCol method to mllib.util.MLUtils. #1698

Closed

Conversation

andy327
Copy link

@andy327 andy327 commented Jul 31, 2014

Adds the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andy327
Copy link
Author

andy327 commented Jul 31, 2014

See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2014

@andy327 This is covered in @dbtsai's PR: #1207 , which is in review.

@andy327
Copy link
Author

andy327 commented Jul 31, 2014

I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it not a lazy transformation. Would there be a benefit to implementing feature scaling without calling any RDD actions?

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

Your implementation calls reduceByKey and cartesian. Those are not cheap streamline operations. map(x => (1, x)).reduceByKey is the same as reduce except that it reduces to some executor instead of the driver. Then cartesian is the same as broadcast but broadcast is more efficient with TorrentBroadcast. You can compare the performance and see the difference. OnlineSummarizer also uses a more accurate approach to compute the variance.

@koertkuipers
Copy link
Contributor

redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian are evaluated lazily.

eager evaluation is often unexpected to the end user and can lead to duplicate calculations (since the user does not anticipate them and deal with them using rdd.cache calls)

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

They are not the same. We use treeReduce to avoid having all executors sending data to the driver, which is not available in reduceByKey. Broadcast is also different from cartesian. This solution cannot avoid having duplicate calculations to rdd. When the computation is triggered, we still need to visit rdd twice. One difference is if someone calls normalizeByCol but never uses the normalized rdd.

@koertkuipers
Copy link
Contributor

why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition

i think we calculate 3 numbers per column in the vectors. so for vectors of size 100 we only need to send 300 values back per partition....

also reduceByKey is guaranteed to only send data in cluster, not to driver (which could be not on cluster). Seems like a win to me?

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

What if you have 10M columns? I agree that not sending data to the driver is a good practice. But the current operations reduceByKey and cartesian are not optimized for very big data. Please test it on a cluster with many partitions and you should see the bottleneck.

@koertkuipers
Copy link
Contributor

i can see your point of 10M columns.

would be really nice if we have a lazy and efficient allReduce(RDD[T], (T,
T) => T): RDD[T]

a RDD transform not being lazy leading to multiple spark actions that the
user did not explicitly start is tricky to me. its already difficult enough
to get the cache and unpersist logic correct without unexpected actions.

On Fri, Aug 1, 2014 at 11:43 AM, Xiangrui Meng notifications@github.com
wrote:

What if you have 10M columns? I agree that not sending data to the driver
is a good practice. But the current operations reduceByKey and cartesian
are not optimized for very big data. Please test it on a cluster with many
partitions and you should see the bottleneck.


Reply to this email directly or view it on GitHub
#1698 (comment).

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

Yes, I tried to implement AllReduce without having driver in the middle in #506 but it introduced complex dependencies. So I fall back to the treeReduce + torrent broadcast approach. I hope this can be improved, maybe in v1.2.

@mengxr
Copy link
Contributor

mengxr commented Aug 30, 2014

@andy327 Do you mind closing this PR for now? I'm definitely buying the idea of freeing up the master, but the current set of Core APIs doesn't provide an easy and efficient way to do it. We could re-visit this and other implementations once we have the right set of tools. Thanks @andy327 @koertkuipers for the PR and the discussion!

@asfgit asfgit closed this in 9b8c228 Aug 31, 2014
sunchao added a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants