Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1786: Edge Partition Serialization #724

Closed
wants to merge 3 commits into from

Conversation

jegonzal
Copy link
Contributor

This appears to address the issue with edge partition serialization. The solution appears to be just registering the PrimitiveKeyOpenHashMap. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to GraphXPrimitiveKeyOpenHashMap. We should consider dropping that and using the one in Spark if possible.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14875/

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14877/

@mateiz
Copy link
Contributor

mateiz commented May 11, 2014

@jegonzal should this be added in 1.0?

@jegonzal
Copy link
Contributor Author

I would like to get it into 1.0 if possible. Otherwise, we could run into issues if the user persists graphs to disk or straggler mitigation is used. @ankurdave do you see any issues with trying to get this into 1.0?

@mateiz
Copy link
Contributor

mateiz commented May 12, 2014

Alright, sounds good. @ankurdave or @rxin can you take a quick look?

@ankurdave
Copy link
Contributor

This looks good to me. Re-enabling Kryo reference tracking will have a performance penalty, but we can easily fix that after the release.

@mateiz
Copy link
Contributor

mateiz commented May 12, 2014

Alright, then I'll merge this as is. You guys should add some docs in both the GraphX programming guide and GraphXKryoSerializer to mention that it's recommended to turn off reference tracking. Just send a separate PR for that. (Doc changes can also go in after 1.0 is officially cut, we can update the website).

@rxin
Copy link
Contributor

rxin commented May 12, 2014

btw as far as I can tell Kryo reference should always be disabled in the
spark repl. Should we just do that in the future?

On Sunday, May 11, 2014, Matei Zaharia notifications@github.com wrote:

Alright, then I'll merge this as is. You guys should add some docs in both
the GraphX programming guide and GraphXKryoSerializer to mention that it's
recommended to turn off reference tracking. Just send a separate PR for
that. (Doc changes can also go in after 1.0 is officially cut, we can
update the website).


Reply to this email directly or view it on GitHubhttps://github.com//pull/724#issuecomment-42791443
.

@asfgit asfgit closed this in a6b02fb May 12, 2014
asfgit pushed a commit that referenced this pull request May 12, 2014
This appears to address the issue with edge partition serialization.  The solution appears to be just registering the `PrimitiveKeyOpenHashMap`.  However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing).  I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`.  We should consider dropping that and using the one in Spark if possible.

Author: Ankur Dave <ankurdave@gmail.com>
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits:

b0a525a [Ankur Dave] Disable reference tracking to fix serialization test
bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.

(cherry picked from commit a6b02fb)
Signed-off-by: Matei Zaharia <matei@databricks.com>
@rxin
Copy link
Contributor

rxin commented May 12, 2014

Alternatively found a way to work around that in the repl so it can safely
turned on.

On Sunday, May 11, 2014, Matei Zaharia notifications@github.com wrote:

Alright, then I'll merge this as is. You guys should add some docs in both
the GraphX programming guide and GraphXKryoSerializer to mention that it's
recommended to turn off reference tracking. Just send a separate PR for
that. (Doc changes can also go in after 1.0 is officially cut, we can
update the website).


Reply to this email directly or view it on GitHubhttps://github.com//pull/724#issuecomment-42791443
.

@mateiz
Copy link
Contributor

mateiz commented May 12, 2014

I think we can warn if it's on or something. I wouldn't add code to disable it since we might be able to fix it to work there too.

@jegonzal
Copy link
Contributor Author

My only concern is that I would prefer things work slowly than fail. With reference tracking disabled it is not possible to serialize user defined types from the spark-shell.

A second concern is that it will be difficult for the user to enable reference tracking if we disable it in the GraphX Kryo registrar.

@pwendell
Copy link
Contributor

This was merged without ever passing jenkins. I've reverted this because it's causing all other PR's to break. We need to add an exclude file to the binary check.

@pwendell
Copy link
Contributor

To fix this we can just add the org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap class here:
https://github.com/apache/spark/blob/master/project/MimaBuild.scala#L77

Joey - mind re-opening this?

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This appears to address the issue with edge partition serialization.  The solution appears to be just registering the `PrimitiveKeyOpenHashMap`.  However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing).  I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`.  We should consider dropping that and using the one in Spark if possible.

Author: Ankur Dave <ankurdave@gmail.com>
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes apache#724 from jegonzal/edge_partition_serialization and squashes the following commits:

b0a525a [Ankur Dave] Disable reference tracking to fix serialization test
bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization
67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
maropu pushed a commit that referenced this pull request Mar 20, 2021
### What changes were proposed in this pull request?

Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.

The affected part of the query plan for TPCDS q87:

Before:
```
== Physical Plan ==
*(26) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#785]
   +- *(25) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
         +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
            +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
               +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                  +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                     +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724]
                        +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
                           +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti
                              :- ...
```

After:
```
== Physical Plan ==
*(26) HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition, true, [id=#751]
   +- *(25) HashAggregate(keys=[], functions=[partial_count(1)])
      +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
         +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694]
            +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[])
               +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti
                  :- ...
```

### Why are the changes needed?

Performance improvements - few TPCDS queries have these kinds of duplicate aggregates.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Benchmarks (sf=5):

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1
Intel(R) Core(TM) i5-6500 CPU  3.20GHz

| Query | Before  | After | Speedup |
| ------| ------- | ------| ------- |
| q14a | 44s | 44s | 1x |
| q14b | 41s | 41s | 1x |
| q38  | 6.5s | 5.9s | 1.1x |
| q87  | 7.2s | 6.8s | 1.1x |
| q14a-v2.7 | 55s | 53s | 1x |

Closes #30018 from tanelk/SPARK-33122.

Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Jan 16, 2024
* KE-43300 use crontab expression to periodicGC in KE

* KE-43300 fix review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants