Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-2634] [gelly] Added a vertex-centric Triangle Count Lib Method #1105

Closed
wants to merge 1 commit into from

Conversation

andralungu
Copy link
Contributor

This PR adds a vertex centric version of the TriangleCount algorithm as discussed in #1054.

The main difference is that the reduceOn* calls have been replaced with groupReduceOn* calls.

Depending on the input graph, one may need to use the GAS version (skewed networks) or the vertex-centric version.

* Tuple1 which contains a single integer representing the number of triangles.
*/
public class TriangleCount implements
GraphAlgorithm<Long, NullValue, NullValue, DataSet<Tuple1<Integer>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return Integer instead of Tuple1<Integer>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the algorithm typed to Long keys. Shouldn't it work with any key type?
Couldn't we also allow any vertex and edge value type and replace them internally with NullValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Tuple1 comment: I wanted to be consistent with the other examples. If you run it on a cluster, you'd like the result printed in a file. You could print it with regular Java functions, but writeAsCsv and print were more appealing (to me at least) :)

For the key type: At some point, you check that your id is lower than the neighbors' ids... How would that comparison be made if you don't know the type. If let's say you received String, you'd need to do Long.parseLong.... If you know a way, I'd gladly change it, but I don't think this is the only library method that has the key type problem... Even ConnectedComponents uses Long. Maybe we can open a JIRA for that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key must implement the org.apache.flink.types.Key interface. Otherwise, you could not group, join, sort on this type. Key implements Comparable, so you can simply do key1.compareTo(key2).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSet<Integer>.writeAsText() prints the result of Integer.toString(). print() does also work for Integer.

@fhueske
Copy link
Contributor

fhueske commented Sep 11, 2015

I did not go through the code in detail. Maybe I am missing something, but isn't this algorithm doing pretty much the same thing as the example we have already for the DataSet API examples (Java + Scala). In the existing examples (EnumTrianglesOpt) 3 reduce and 1 join to enumerate all triangles. We would need one more reduce to compute the number of triangles which makes 4 reduce and 1 join.
In your code you are using 5 reduce and 2 joins if I counted correctly.

Are those two algorithms doing different things or could you basically port the existing code to Gelly?

@andralungu
Copy link
Contributor Author

The only argument I have for this particular code (as opposed to the one in the DataSet API) is that his one simulates what Pregel is doing... But since it's a library method, it should be used as is, without caring about the model. IMO, the discussion in #1054 asked for this particular version, but if there is a consensus I can change it to use the more efficient one... or even close the PR and point people to the DataSet API...

@fhueske
Copy link
Contributor

fhueske commented Sep 11, 2015

I think having TriangleCounting (and TriangleEnumeration as well) would be very good additions for Gelly. As such they should be implemented as efficient as possible. As I said, I did not look into the code in detail so I do not exactly know what is happening inside, but the DataSet example is pretty straightforward and quite efficient AFAIK.

@andralungu
Copy link
Contributor Author

If it's okay with you, I'd like to see what @vasia has to say about adding the Triangle Count example from the DataSet API + a reduce as a library method. IMO, it's a better addition, but for some reason, we preferred the Pregel/GSA implementations at a certain point (it was because in my thesis, I take Pregel as a baseline).

Also the generic K instead of Long makes perfect sense. However, if we decide to change it, I'll have to open a JIRA to revise the entire set of library methods because apart from PageRank, I think they all restrict themselves to one common key type. I would be kind of scared of type erasure in the K implements Key case :-S

@fhueske
Copy link
Contributor

fhueske commented Sep 13, 2015

Sure, always good to get more input :-)
Btw. the DataSet implementation is a Flink port of a MapReduce algorithm described in this article: http://lintool.github.io/UMD-courses/bigdata-2015-Spring/content/Cohen_2009.pdf

@vasia
Copy link
Contributor

vasia commented Sep 14, 2015

Hi,

I agree with @fhueske that we should consider porting the existing DataSet example to the Gelly library. The algorithm is a bit different and it's not very clear (at least to me) which one would be the most efficient. The DataSet implementation applies the optimization that grouping happens on vertices with low degrees. The one we currently have in Gelly (and this one) applies the optimization that the vertex with the lowest ID will be the one detecting the triangle.

@andralungu, we have chosen these implementations for your thesis in order to test a specific method and prove a point. This doesn't necessarily mean that this is the most efficient way to implement triangle counting. Ideally, we should analyze the complexity of each implementation and run experiments to determine which one is best to add to the library.

Also, we should definitely support any Key type and value types and we should update existing library methods that assume a particular type for no good reason.

@fhueske
Copy link
Contributor

fhueske commented Sep 14, 2015

As I said, I did not look at the actual implementation of the PR besides counting expensive operators.
Since we have implementations for both, we could simply run a few benchmarks to get an idea of the trade-offs. Of course these will be just a few points in the whole space but maybe better than nothing...

@andralungu
Copy link
Contributor Author

Hey you two :)

I suggest that we add this one.
Let the JIRA that @vasia opened handle the keys.
Then add the DataSet example (another JIRA) and a JIRA for benchmarks.

I think that would more or less make everyone happy.

@fhueske
Copy link
Contributor

fhueske commented Sep 14, 2015

Fair enough, we can stick with your implementation. I was only guessing about the performance implications. I did not check the code in detail though.

@vasia
Copy link
Contributor

vasia commented Sep 14, 2015

Sounds good to me! A benchmark would be great, since I can also only guess about the performance ;)

@andralungu
Copy link
Contributor Author

I've created the two JIRA issues: FLINK-2714 and FLINK-2715.

@vasia, if no other objections, can you have another look at this and merge it?

* This algorithm operates in three phases. First, vertices select neighbors with id greater than theirs
* and send messages to them. Each received message is then propagated to neighbors with higher id.
* Finally, if a node encounters the target id in the list of received messages, it increments the number
* of triangles found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a link to the paper (or some other resource) where this algorithm is described?

@vasia
Copy link
Contributor

vasia commented Sep 20, 2015

Hey @andralungu,

I left some minor comments. Apart from those, I think we should also address the previous comments on (a) returning an Integer instead of Tuple1, (b) generic vertex keys and (c) allowing any vertex and edge value type and replace them internally with NullValue.
I know that other library methods also have these problems (and there's a JIRA), but this PR is open, so we can at least fix them for this one.

@andralungu
Copy link
Contributor Author

Hi,

I agree with the issues raised in this discussion. They most definitely should be adressed, but unfortunately, I cannot do it. I am handling many bureaucratic tasks these days and the next months will be a bit hectic.

This being said, can someone please pick the issue up? I've never had to do this before, so I don't know how it works. Do I close the PR or leave it open and hope that someone will continue on it?

Thanks!

@vasia
Copy link
Contributor

vasia commented Sep 23, 2015

Hi @andralungu,

if you're not planning to continue on this PR, then I believe you should close it. However, you might want to keep your local branch, in case someone else wants to pick up where you left.

This is not an urgent issue, it's a nice-to-have feature, so you can come back to it later :)
If you think that it might take too much time, then it might also be a good idea to un-assign the JIRA.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants