Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720 #2168

Closed
wants to merge 19 commits into from
Closed

Conversation

rnowling
Copy link
Contributor

PR #720 made multiple changes to GraphGenerator.logNormalGraph including:

  • Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect.
  • Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated.
  • Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter.
  • Failed to update scala docs and programming guide for API changes
  • Added a Synthetic Benchmark in the examples.

This PR:

  • Removes the in-line calls and calls original vertex / edge generation functions again
  • Adds an optional seed parameter for deterministic behavior (when desired)
  • Keeps the number of partitions parameter that was added.
  • Keeps compatibility with the synthetic benchmark example
  • Maintains backwards-compatible API

@rnowling rnowling changed the title [SPARK-3263][GraphX] Fix changed [SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR #720 Aug 27, 2014
@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2168 at commit 684804d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2168 at commit 684804d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ankurdave
Copy link
Contributor

@rnowling Thanks for this change! I merged #720 hastily, so it's good to fix any problems it introduced. A few comments:

  1. SynthBenchmark is most useful if it's deterministic, so what do you think about making seed default to 0 instead of -1, or changing SynthBenchmark to set the seed to 0? That way the generated graph should by default be identical to what it was before, aside from the log-normal formula change.
  2. It would be good to document the parameters to logNormalGraph briefly.
  3. Eventually we should probably switch from Scala optional parameters to the binary-compatible (but much more verbose) builder pattern as in MLlib. For now I think optional parameters are fine, though.

@jegonzal Could you take a look at the formula change?

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit dfbb6dd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit 1c8fc44.

  • This patch merges cleanly.

@rnowling
Copy link
Contributor Author

thanks @ankurdave . I'd also like to recommend @srowen as a reviewer since he made a few contributions to the GraphGenerator in the past.

I've made the following changes based on your feedback:

  • Added documentation for logNormalGraph parameters
  • Add optional seed parameter to SynthBenchmark with a default of generating random seeds (more below)

I also found and fixed the following bugs:

  • Fixed a bug in generateRandomEdges where the number of edges produced was not the out-degree but the total number of vertices -- now produces the correct number of edges.
  • Updated SynthBenchmark docs to correct nverts parameter name
  • Fixed failure to call count() before printing result of connected components in SynthBenchmark

and I started unit tests for GraphGenerators -- only covers the logNormalGraph, generateRandomEdges, and sampleLogNormal functions for now but it's a start.

I understand the concern about seeding and benchmarks. However, I also think the use of optional parameters for seeds can be subtle and problematic. Most random functions (RNGs, etc.) generate a random seed when one is not given, so users expect that. I'm often tripped up when the default is a fixed seed. Since users may use logNormalGraph for statistical testing and other work, I would highly prefer that the default is a random seed.

As for SynthBenchmark, I followed the same principle of least surprise. When no seed is given, one is randomly generated. If you feel the seed should be hard-coded in the benchmark by default, however, I'd be happy to change that.

At this stage, I don't want to introduce backwards-incompatible API changes unless we know the new API will be there for some time. When a new API is introduced, the seed parameter should be required (rather than optional) to remove the ambiguity and subtly of the issue.

* @param mu
* @param sigma
* @return
* If the seed is set to 0 (default), the seeds are set deterministically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnowling Oh I don't think I know much of anything about GraphX -- the one PR I submitted was a minor math issue. I looked anyway and the only comment I have is that I don't see handling for the case of seed == 0 in the code, but the scaladoc suggests this does something special.

@rnowling
Copy link
Contributor Author

Thanks, @srowen ! I forgot to update the docs as I was making changes. :( Fixed now!

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit dfbb6dd.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit 1c8fc44.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit 799f002.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit 43949ad.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit c70868d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit 799f002.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit 785ac70.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit 785ac70.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jegonzal
Copy link
Contributor

The code changes look good to me (and were badly needed). Thanks for fixing it!

@rnowling
Copy link
Contributor Author

I fixed comparison of Edge objects. I hadn't been using testOnly properly before so I didn't catch this.

@rnowling
Copy link
Contributor Author

Thanks, @jegonzal !

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2168 at commit e11918e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2168 at commit e11918e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rnowling
Copy link
Contributor Author

These test failures are occurring in PySpark and seem unrelated.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2168 at commit e11918e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2168 at commit e11918e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rnowling
Copy link
Contributor Author

It seems 3 tests failed but they were unrelated to my changes. The
GraphGeneratorsSuite I added passed without issue.

Are other PRs having similar issues?

On Friday, August 29, 2014, Apache Spark QA notifications@github.com
wrote:

QA tests have finished
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19450/consoleFull
for PR 2168 at commit e11918e
e11918e
.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


Reply to this email directly or view it on GitHub
#2168 (comment).

em rnowling@gmail.com
c 954.496.2314

@ankurdave
Copy link
Contributor

Might as well try again: Jenkins, retest this please.

@rnowling
Copy link
Contributor Author

rnowling commented Sep 3, 2014

Thanks @ankurdave Tests didn't seem to run, though...

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2168 at commit e11918e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2168 at commit e11918e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rnowling
Copy link
Contributor Author

rnowling commented Sep 3, 2014

Hooray! :) Thanks @ankurdave and @JoshRosen

@ankurdave
Copy link
Contributor

Great! Merging into master.

@asfgit asfgit closed this in e5d3768 Sep 3, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…h in PR apache#720

PR apache#720 made multiple changes to GraphGenerator.logNormalGraph including:

* Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect.
* Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated.
* Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter.
* Failed to update scala docs and programming guide for API changes
* Added a Synthetic Benchmark in the examples.

This PR:
* Removes the in-line calls and calls original vertex / edge generation functions again
* Adds an optional seed parameter for deterministic behavior (when desired)
* Keeps the number of partitions parameter that was added.
* Keeps compatibility with the synthetic benchmark example
* Maintains backwards-compatible API

Author: RJ Nowling <rnowling@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>

Closes apache#2168 from rnowling/graphgenrand and squashes the following commits:

f1cd79f [Ankur Dave] Style fixes
e11918e [RJ Nowling] Fix bad comparisons in unit tests
785ac70 [RJ Nowling] Fix style error
c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed
41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed
799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal
43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges
2faf75f [RJ Nowling] Added unit test for logNormalGraph
82f2239 [RJ Nowling] Add unit test for sampleLogNormal
b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing
6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges
1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing
dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs
b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed
1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate
98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters
d40141a [RJ Nowling] Fix style error
684804d [RJ Nowling] revert PR apache#720 which introduce errors in logNormalGraph and messed up seeding of RNGs.  Add user-defined optional seed for deterministic behavior
c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed.
015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde
@rnowling rnowling deleted the graphgenrand branch September 8, 2014 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants