Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3768] [gelly] Clustering Coefficient #1896

Closed

Conversation

greghogan
Copy link
Contributor

Provides an algorithm for local clustering coefficient and dependent functions for degree annotation, algorithm caching, and graph translation.

I worked to improve the performance of TriangleEnumerator. Perhaps the API has changed since Edge.reverse() is not in-place and the edges were not being sorted by degree. The JoinHint is also important so that the Triads are not spilled to disk.

On an AWS ec2.4xlarge (16 vcores, 30 GiB) I am seeing for the following timings of 5s, 29s, and 183s for TriangleListing. With TriangleEnumerator the timings are 7s, 45s, and 281s. Without the JoinHint the latter TriangleEnumerator timings are 58s and 347s.

Scale ChecksumHashCode Count
16 0x0000d9086985f4ce 15616010
18 0x0010eeb32a441365 82781436
20 0x014a9434bb57ddef 423780284

The command I had used to run the tests:

./bin/flink run -class org.apache.flink.graph.examples.TriangleListing ~/flink-gelly-examples_2.10-1.1-SNAPSHOT.jar --clip_and_flip false --output print --output hash --scale 16 --listing

@vasia
Copy link
Contributor

vasia commented Apr 15, 2016

Hi @greghogan,
thank you for the PR. This is a big addition! Are all the changes related to the clustering coefficient JIRA? Could it maybe be split into smaller PRs in order to make reviewing easier?
For big changes like this, it usually a good idea to create a design document to show the high-level approach and functionality to be added and discuss it with the community. Do you think it would make sense to do this?
Thank you!

@greghogan
Copy link
Contributor Author

It's certainly nicer to review small PRs, but I also like to present the big picture and context in which the features will be used. What if I leave this here, break out the dependencies into new PRs, and then rebase this PR down if those features are accepted? I'm not quite sure what a design document would cover. There's not much sophisticated here as with SG or GSA.


import java.text.NumberFormat;

public class TriangleListing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add JavaDoc to ALL new Java or Scala classes added to source? Explanation why and how it is interact with other code is something we are looking for. Thanks.

@vasia
Copy link
Contributor

vasia commented Apr 16, 2016

Ideally, each PR should make a single change and reference a single JIRA issue. If you could break this into smaller changes, that'd be great. Otherwise, it would really help if you could give a more detailed description of changes and additions. Thanks!

@greghogan greghogan force-pushed the 3768_clustering_coefficient branch 3 times, most recently from 9489061 to 306390e Compare May 10, 2016 15:15
@greghogan
Copy link
Contributor Author

I have rebased this PR against the newly committed dependent features so it should be good for discussion.

@hsaputra
Copy link
Contributor

hsaputra commented May 10, 2016

Please address my comments about adding Javadoc header info to ALL new Java class added to source repo. I understand some of the classes are just for examples, but would be good to also have Javdoc explaining why the examples are added.
Thx much!


import java.text.NumberFormat;

public class LocalClusteringCoefficient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just an example, but would be nice to add Javadoc information to give high level information why this class exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@greghogan greghogan force-pushed the 3768_clustering_coefficient branch 2 times, most recently from 8a58b41 to 3c31601 Compare May 10, 2016 17:23
@hsaputra
Copy link
Contributor

Thanks for the docs update.

@greghogan
Copy link
Contributor Author

If there are no further review comments I'll look at merging this today.

@greghogan greghogan force-pushed the 3768_clustering_coefficient branch from 3c31601 to 1fdb056 Compare May 16, 2016 13:25
@vasia
Copy link
Contributor

vasia commented May 16, 2016

Hi @greghogan,
I haven't had time to review this PR. If this is blocking you or is urgent for some reason, please go ahead. Otherwise, I could take a look later this week.

The local clustering coefficient measures the connectedness of each
vertex's neighborhood. Scores range from 0.0 (no edges between
neighbors) to 1.0 (neighborhood is a clique).
@greghogan greghogan force-pushed the 3768_clustering_coefficient branch 2 times, most recently from a76bb32 to 453e642 Compare May 16, 2016 17:47
@greghogan greghogan force-pushed the 3768_clustering_coefficient branch from 453e642 to dd42d7b Compare May 16, 2016 17:52
@asfgit asfgit closed this in c71675f May 16, 2016
mbode pushed a commit to mbode/flink that referenced this pull request May 27, 2016
The local clustering coefficient measures the connectedness of each
vertex's neighborhood. Scores range from 0.0 (no edges between
neighbors) to 1.0 (neighborhood is a clique).

This closes apache#1896
godfreyhe pushed a commit that referenced this pull request Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants