Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[SPARK-19634][ML] Multivariate summarizer - dataframes API #17419
What changes were proposed in this pull request?
This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of
Furthermore, it uses the BLAS API to the extent possible, so that the given code should be efficient for the dense case.
How was this patch tested?
This patch includes most of the tests of the RDD-based. It compares results against the existing
This patch also includes some documentation for some low-level constructs such as
I have not run tests against the existing implementation. However, this patch uses the recommended low-level SQL APIs, so it should be interesting to compare both implementation in that respect.
Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL.
I have added a small perf test to find the performance bottlenecks. Note that this test works on the worst case (vectors of size 1) from the perspective of overhead. Here are the numbers I currently get. I will profile the code to see if there are some obvious targets for optimization:
I looked a bit deeper into the performance aspect. Here are some quick insights:
That benchmark focuses on the overhead of catalyst. I will do another benchmark with dense vectors to see how it fares in practice with more real data.
As the dataframe version is much slower than RDD version (currently test against vector of size 1)
@WeichenXu123 and I did some profiling using
@cloud-fan, any ideas about improving
…ash aggregate ## What changes were proposed in this pull request? In #18483 , we fixed the data copy bug when saving into `InternalRow`, and removed all workarounds for this bug in the aggregate code path. However, the object hash aggregate was missed, this PR fixes it. This patch is also a requirement for #17419 , which shows that DataFrame version is slower than RDD version because of this issue. ## How was this patch tested? existing tests Author: Wenchen Fan <email@example.com> Closes #18712 from cloud-fan/minor.
## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in apache#17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in apache#18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000 ----|------|----|---|----|---- Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.