New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18946][ML] sliceAggregate which is a new aggregate operator for high-dimensional data #17000
Conversation
Just to be clear - this is essentially just splitting an array up into smaller chunks so that overall communication is more efficient? It would be good to look at why Spark is not doing a good job with one big array. Is the bottleneck really the executor communication (shuffle part)? Or is it collecting the big array back at the end of tree aggregation (ie this patch sort of allows more concurrency in the cc @dbtsai @sethah @yanboliang who were looking at linear model scalability recently. |
Hi @ZunwenYou Thanks for sharing the implementation with us. |
Hi, @MLnick |
Hi, @hhbyyh In our experiment, the class MultivariateOnlineSummarizer contains 8 arrays, if the dimension reaches 20 million, the memory of MultivariateOnlineSummarizer is 1280M(8Bit* 20M * 8). The experiment configuration as follows: RDD and aggregate parameter: |
Is the speedup coming mostly from the See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting this operation to use DataFrame UDAF and for computing only the required metrics (instead of forcing computing all as is done currently). I wonder how that will compare? |
Hi, @MLnick This is a good improvement for |
@ZunwenYou yes I understand that the So my point would probably be to try to see how much benefit accrues from (a) using UDAF mechanism and (b) not computing unnecessary things. Then we can compare to the benefit here and decide. |
I'm not totally certain there will be some huge benefit with porting vector summary to UDAF framework. But there are API-level benefits to doing so. Perhaps there is a way to incorporate the |
cc @yanboliang - it seems actually similar in effect to the VL-BFGS work with RDD-based coefficients? |
ping @yanboliang , please has a look at this improvement. |
@MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the vectors will be too large to store in driver memory, so we slice the vectors into different machines (stored by `RDD[Vector], and the use partitionID as slice key, and one RDD only store one vector). and, about VF-LBFGS, the training dataset, each instance feaure is a high dimension but very sparse vector, the features data, in VF-LBFGS, will be transformed into |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
In many machine learning cases, driver has to aggregate high-dimensional vectors/arrays from executors.
TreeAggregate is good solution for aggregating vectors to driver, and you can increase depth of tree when data is large.
However, treeAggregate would still failed, when the parition number of RDD and the dimension of vector grows up.
We propose a new operator of RDD, named sliceAggregate, which split the vector into n slices and each slice is assigned a key(from 0 to n-1). The RDD[key, slice] will be transform to RDD[slice] by using reduceByKey operator.
Finally driver will collect and compose the n slices to obtain result.
I run an experiment which calculate the statistic values of features.
The number of samples is 1000. The feature dimension ranges from 10k to 20m, the comparition of time cost between treeAggregate and sliceAggregate is shown as follows. When feature dimension reach 20 million, treeAggregate was failed.
The table of time cost(ms) between sliceAggregate and treeAggregate.
The code relate to this experiment is here .
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-18946