-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalizeByCol method to mllib.util.MLUtils. #1698
Conversation
Can one of the admins verify this patch? |
See Jira issue: https://issues.apache.org/jira/browse/SPARK-2776 |
I see that #1207 covers re-scaling in mllib.util.FeatureScaling, but from what I can tell, it calls RowMatrix.computeColumnSummaryStatistics, making it not a lazy transformation. Would there be a benefit to implementing feature scaling without calling any RDD actions? |
Your implementation calls |
redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian are evaluated lazily. eager evaluation is often unexpected to the end user and can lead to duplicate calculations (since the user does not anticipate them and deal with them using rdd.cache calls) |
They are not the same. We use treeReduce to avoid having all executors sending data to the driver, which is not available in reduceByKey. Broadcast is also different from cartesian. This solution cannot avoid having duplicate calculations to |
why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition i think we calculate 3 numbers per column in the vectors. so for vectors of size 100 we only need to send 300 values back per partition.... also reduceByKey is guaranteed to only send data in cluster, not to driver (which could be not on cluster). Seems like a win to me? |
What if you have 10M columns? I agree that not sending data to the driver is a good practice. But the current operations |
i can see your point of 10M columns. would be really nice if we have a lazy and efficient allReduce(RDD[T], (T, a RDD transform not being lazy leading to multiple spark actions that the On Fri, Aug 1, 2014 at 11:43 AM, Xiangrui Meng notifications@github.com
|
Yes, I tried to implement AllReduce without having driver in the middle in #506 but it introduced complex dependencies. So I fall back to the treeReduce + torrent broadcast approach. I hope this can be improved, maybe in v1.2. |
@andy327 Do you mind closing this PR for now? I'm definitely buying the idea of freeing up the master, but the current set of Core APIs doesn't provide an easy and efficient way to do it. We could re-visit this and other implementations once we have the right set of tools. Thanks @andy327 @koertkuipers for the PR and the discussion! |
Adds the ability to compute the mean and standard deviations of each vector (LabeledPoint) component and normalize each vector in the RDD, using only RDD transformations. The result is an RDD of Vectors where each column has a mean of zero and standard deviation of one.