SPARK-1786: Edge Partition Serialization #724

jegonzal · 2014-05-10T22:56:49Z

This appears to address the issue with edge partition serialization. The solution appears to be just registering the PrimitiveKeyOpenHashMap. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to GraphXPrimitiveKeyOpenHashMap. We should consider dropping that and using the one in Spark if possible.

AmplabJenkins · 2014-05-10T22:57:58Z

Build triggered.

AmplabJenkins · 2014-05-10T22:58:06Z

Build started.

AmplabJenkins · 2014-05-10T23:07:57Z

Merged build triggered.

AmplabJenkins · 2014-05-10T23:08:05Z

Merged build started.

AmplabJenkins · 2014-05-11T00:13:44Z

Build finished.

AmplabJenkins · 2014-05-11T00:13:44Z

Merged build finished.

AmplabJenkins · 2014-05-11T00:13:44Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14875/

AmplabJenkins · 2014-05-11T00:13:44Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14877/

mateiz · 2014-05-11T21:02:17Z

@jegonzal should this be added in 1.0?

jegonzal · 2014-05-11T23:49:34Z

I would like to get it into 1.0 if possible. Otherwise, we could run into issues if the user persists graphs to disk or straggler mitigation is used. @ankurdave do you see any issues with trying to get this into 1.0?

mateiz · 2014-05-12T01:36:48Z

Alright, sounds good. @ankurdave or @rxin can you take a quick look?

ankurdave · 2014-05-12T02:14:03Z

This looks good to me. Re-enabling Kryo reference tracking will have a performance penalty, but we can easily fix that after the release.

mateiz · 2014-05-12T02:19:44Z

Alright, then I'll merge this as is. You guys should add some docs in both the GraphX programming guide and GraphXKryoSerializer to mention that it's recommended to turn off reference tracking. Just send a separate PR for that. (Doc changes can also go in after 1.0 is officially cut, we can update the website).

rxin · 2014-05-12T02:25:52Z

btw as far as I can tell Kryo reference should always be disabled in the
spark repl. Should we just do that in the future?

On Sunday, May 11, 2014, Matei Zaharia notifications@github.com wrote:

Alright, then I'll merge this as is. You guys should add some docs in both
the GraphX programming guide and GraphXKryoSerializer to mention that it's
recommended to turn off reference tracking. Just send a separate PR for
that. (Doc changes can also go in after 1.0 is officially cut, we can
update the website).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/724#issuecomment-42791443
.

This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible. Author: Ankur Dave <ankurdave@gmail.com> Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits: b0a525a [Ankur Dave] Disable reference tracking to fix serialization test bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization 67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable. (cherry picked from commit a6b02fb) Signed-off-by: Matei Zaharia <matei@databricks.com>

rxin · 2014-05-12T02:26:23Z

Alternatively found a way to work around that in the repl so it can safely
turned on.

On Sunday, May 11, 2014, Matei Zaharia notifications@github.com wrote:

Alright, then I'll merge this as is. You guys should add some docs in both
the GraphX programming guide and GraphXKryoSerializer to mention that it's
recommended to turn off reference tracking. Just send a separate PR for
that. (Doc changes can also go in after 1.0 is officially cut, we can
update the website).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/724#issuecomment-42791443
.

mateiz · 2014-05-12T02:29:31Z

I think we can warn if it's on or something. I wouldn't add code to disable it since we might be able to fix it to work there too.

jegonzal · 2014-05-12T03:19:45Z

My only concern is that I would prefer things work slowly than fail. With reference tracking disabled it is not possible to serialize user defined types from the spark-shell.

A second concern is that it will be difficult for the user to enable reference tracking if we disable it in the GraphX Kryo registrar.

pwendell · 2014-05-12T17:47:04Z

This was merged without ever passing jenkins. I've reverted this because it's causing all other PR's to break. We need to add an exclude file to the binary check.

pwendell · 2014-05-12T17:53:44Z

To fix this we can just add the org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap class here:
https://github.com/apache/spark/blob/master/project/MimaBuild.scala#L77

Joey - mind re-opening this?

This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible. Author: Ankur Dave <ankurdave@gmail.com> Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes apache#724 from jegonzal/edge_partition_serialization and squashes the following commits: b0a525a [Ankur Dave] Disable reference tracking to fix serialization test bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization 67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.

### What changes were proposed in this pull request? Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. The affected part of the query plan for TPCDS q87: Before: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#785] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` After: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#751] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` ### Why are the changes needed? Performance improvements - few TPCDS queries have these kinds of duplicate aggregates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Benchmarks (sf=5): OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1 Intel(R) Core(TM) i5-6500 CPU 3.20GHz | Query | Before | After | Speedup | | ------| ------- | ------| ------- | | q14a | 44s | 44s | 1x | | q14b | 41s | 41s | 1x | | q38 | 6.5s | 5.9s | 1.1x | | q87 | 7.2s | 6.8s | 1.1x | | q14a-v2.7 | 55s | 53s | 1x | Closes #30018 from tanelk/SPARK-33122. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

* KE-43300 use crontab expression to periodicGC in KE * KE-43300 fix review

jegonzal and others added 3 commits May 10, 2014 16:04

Making EdgePartition serializable.

67dac22

Add failing test for EdgePartition Kryo serialization

bb7f548

Disable reference tracking to fix serialization test

b0a525a

mateiz mentioned this pull request May 12, 2014

SPARK-1577: Enabling reference tracking by default in GraphX KryoRegistrator. #499

Closed

asfgit closed this in a6b02fb May 12, 2014

Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022

MapR [SPARK-793] Update Zookeeper to 4 digits version (apache#724)

bcffc85

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Jan 16, 2024

KE-43300 use crontab expression to periodicGC in KE (apache#724)

c43b45f

* KE-43300 use crontab expression to periodicGC in KE * KE-43300 fix review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1786: Edge Partition Serialization #724

SPARK-1786: Edge Partition Serialization #724

jegonzal commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

mateiz commented May 11, 2014

jegonzal commented May 11, 2014

mateiz commented May 12, 2014

ankurdave commented May 12, 2014

mateiz commented May 12, 2014

rxin commented May 12, 2014

rxin commented May 12, 2014

mateiz commented May 12, 2014

jegonzal commented May 12, 2014

pwendell commented May 12, 2014

pwendell commented May 12, 2014

SPARK-1786: Edge Partition Serialization #724

SPARK-1786: Edge Partition Serialization #724

Conversation

jegonzal commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 10, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

AmplabJenkins commented May 11, 2014

mateiz commented May 11, 2014

jegonzal commented May 11, 2014

mateiz commented May 12, 2014

ankurdave commented May 12, 2014

mateiz commented May 12, 2014

rxin commented May 12, 2014

rxin commented May 12, 2014

mateiz commented May 12, 2014

jegonzal commented May 12, 2014

pwendell commented May 12, 2014

pwendell commented May 12, 2014