-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720 #2168
Conversation
… change in commit 894ecde
…ating graphs randomly using optional seed.
… seeding of RNGs. Add user-defined optional seed for deterministic behavior
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
@rnowling Thanks for this change! I merged #720 hastily, so it's good to fix any problems it introduced. A few comments:
@jegonzal Could you take a look at the formula change? |
…was used to control number of edges to generate
…omly generate a seed
… on RDD before printing
QA tests have started for PR 2168 at commit
|
QA tests have started for PR 2168 at commit
|
thanks @ankurdave . I'd also like to recommend @srowen as a reviewer since he made a few contributions to the GraphGenerator in the past. I've made the following changes based on your feedback:
I also found and fixed the following bugs:
and I started unit tests for GraphGenerators -- only covers the logNormalGraph, generateRandomEdges, and sampleLogNormal functions for now but it's a start. I understand the concern about seeding and benchmarks. However, I also think the use of optional parameters for seeds can be subtle and problematic. Most random functions (RNGs, etc.) generate a random seed when one is not given, so users expect that. I'm often tripped up when the default is a fixed seed. Since users may use logNormalGraph for statistical testing and other work, I would highly prefer that the default is a random seed. As for SynthBenchmark, I followed the same principle of least surprise. When no seed is given, one is randomly generated. If you feel the seed should be hard-coded in the benchmark by default, however, I'd be happy to change that. At this stage, I don't want to introduce backwards-incompatible API changes unless we know the new API will be there for some time. When a new API is introduced, the seed parameter should be required (rather than optional) to remove the ambiguity and subtly of the issue. |
* @param mu | ||
* @param sigma | ||
* @return | ||
* If the seed is set to 0 (default), the seeds are set deterministically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rnowling Oh I don't think I know much of anything about GraphX -- the one PR I submitted was a minor math issue. I looked anyway and the only comment I have is that I don't see handling for the case of seed == 0 in the code, but the scaladoc suggests this does something special.
Thanks, @srowen ! I forgot to update the docs as I was making changes. :( Fixed now! |
QA tests have finished for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
QA tests have started for PR 2168 at commit
|
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
The code changes look good to me (and were badly needed). Thanks for fixing it! |
I fixed comparison of Edge objects. I hadn't been using testOnly properly before so I didn't catch this. |
Thanks, @jegonzal ! |
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
These test failures are occurring in PySpark and seem unrelated. |
Jenkins, retest this please. |
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
It seems 3 tests failed but they were unrelated to my changes. The Are other PRs having similar issues? On Friday, August 29, 2014, Apache Spark QA notifications@github.com
em rnowling@gmail.com |
Might as well try again: Jenkins, retest this please. |
Thanks @ankurdave Tests didn't seem to run, though... |
Jenkins, retest this please. |
QA tests have started for PR 2168 at commit
|
QA tests have finished for PR 2168 at commit
|
Hooray! :) Thanks @ankurdave and @JoshRosen |
Great! Merging into master. |
…h in PR apache#720 PR apache#720 made multiple changes to GraphGenerator.logNormalGraph including: * Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect. * Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated. * Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter. * Failed to update scala docs and programming guide for API changes * Added a Synthetic Benchmark in the examples. This PR: * Removes the in-line calls and calls original vertex / edge generation functions again * Adds an optional seed parameter for deterministic behavior (when desired) * Keeps the number of partitions parameter that was added. * Keeps compatibility with the synthetic benchmark example * Maintains backwards-compatible API Author: RJ Nowling <rnowling@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes apache#2168 from rnowling/graphgenrand and squashes the following commits: f1cd79f [Ankur Dave] Style fixes e11918e [RJ Nowling] Fix bad comparisons in unit tests 785ac70 [RJ Nowling] Fix style error c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed 41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed 799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal 43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges 2faf75f [RJ Nowling] Added unit test for logNormalGraph 82f2239 [RJ Nowling] Add unit test for sampleLogNormal b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing 6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges 1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed 1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate 98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters d40141a [RJ Nowling] Fix style error 684804d [RJ Nowling] revert PR apache#720 which introduce errors in logNormalGraph and messed up seeding of RNGs. Add user-defined optional seed for deterministic behavior c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed. 015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde
PR #720 made multiple changes to GraphGenerator.logNormalGraph including:
This PR: