New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1514][Gelly] Add a Gather-Sum-Apply iteration method #408
Conversation
public Tuple2<Long, Double> gather(Tuple3<Vertex<Long, Double>, | ||
Edge<Long, Long>, Vertex<Long, Double>> triplet) { | ||
|
||
if (triplet.f2.getValue() == 0.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally this is not okay, but we never change the initial 0.0 value.
Should we still make an epsilon check here, even though it would result in (a bit) less readable code?
Hi @balidani! Thanks a lot for this PR! Gather-Sum-Apply will be an awesome addition to Gelly ^^
Let me know if you have any doubts! Thanks again ☀️ |
Hi @vasia! Thanks for the review! I started making the changes, but I got stuck. I get an exception about Here is the full stack trace: https://gist.github.com/balidani/e6f716d3562d2aa131fb The GatherUdf is at the end. I didn't bother fully converting the Sum and Apply UDFs, because I already get an exception for gather. What exactly is Cheers! |
I had an offline discussion with @balidani and we figured that the exception was actually coming from the test, not the GatherSumApplyIteration. So, no issue here. Regarding the |
/** | ||
* This is an implementation of the Single Source Shortest Paths algorithm, using a gather-sum-apply iteration | ||
*/ | ||
public class GSASingleSourceShortestPathsExample implements ProgramDescription, GraphAlgorithm<Long, Double, Double> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's not a library method, there's no need to implement GraphAlgorithm or have a run() method. You can move the body to main().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I was blindly copying this part and didn't realize it's not really needed :)
This looks great already! We just need a few more changes:
Of course the wrappers inside |
Graph<Long, Double, Double> graph = Graph.fromDataSet(vertices, edges, env); | ||
|
||
// The path from src to trg through edge e costs src + e | ||
// If the target's distance is 0 (the target is the actual source), return 0 instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't the case anymore, you can delete this comment :)
Thanks for the revision @balidani! I added some minor inline comments. |
} | ||
}); | ||
} else { | ||
return ExampleUtils.getLongDoubleVertexData(env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a cosmetic bug, but I think it could be useful to mention it...
Could you, instead of ExampleUtils, take the data from a separate class? For SSSP for example, you have SingleSourceShortestPathsData in example/utils. This would be more legible(i.e. it would look nicer than getLongLongVertexData :) ).
Also @vasia , I see nobody else is using that method in ExampleUtils so maybe after this we can get rid of it :D
Not sure why the last Travis check failed because the pom.xml seems to be the correct one. But apart from that, this looks very nice and compliant with the vertex centric approach. Apart from that. Good to merge from my side. And as future work, once I finish the vertexCentricIteration extensions which will allow the user to choose the messaging direction, we should exend GSA to do the same :D |
In the core API, we have started to define all functions (map, join, ... - here it would be SumFunction, ApplyFunction) as single-abstract-method interfaces. That allows to use Java 8 lambas (Scala may pick this up as well). For each function, there is a |
</dependency> | ||
</dependencies> | ||
</profile> | ||
<profile> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we need the hadoop2 profile here. It doesn't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the pom, we've actually merged this change as part of #429. We added the guava dependency after this discussion in the dev mailing list. If there's a better way to deal with this, please let me know! The hadoop2 profile seems to be there by mistake, I'll fix that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
I've rebased #454 onto your changes in master and adopted it to our new approach of handling dependencies in Flink. There, I've added guava again as a regular dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks @rmetzger!
I am just dropping some more thoughts here ;-) The interface seems a bit confusing. Right now, one has to first create the iteration from the graph and then tell the graph to run it: GatherSumApplyIteration iteration = graph.createGatherSumApplyIteration(gather, sum, apply);
Graph result = graph.runGatherSumApplyIteration(iteration); What was the reason to not go simply for Graph result = graph.runGatherSumApplyIteration(gather, sum, apply); If the reason was additional parameterization of the iteration, would it be nicer do go for the following? GatherSumApplyIteration iteration = graph.createGatherSumApplyIteration(gather, sum, apply);
iteration.configureInSomeWay(...);
Graph result = iteration.result(); |
Regarding the rest of the comments:
|
The vertex-centric iterations use semantic annotations a lot to save sorting/partitioning where possible. This results in a plan for the vertex-centric iterations that partitions one and sorts one, despite the fact that it uses wither two coGroups or 2 joins plus a reduce. I think this would help big time here as well. In general, plan optimization may be something that Gelly could take a look at. |
Ona more thought: The triplet re-building in every iteration seems quite expensive. Does it make sense to have the triplets (or quadruplets, with the vertex value) as the data in the solution set? |
|
Hi! I pushed some changes. @vasia told me that she and @StephanEwen even decided that iterating on triplets is unfeasible, so we should operate on Regarding the other suggestions, @vasia and I think it would be a good idea to push them in a separate PR, because they affect both GSA and Vertex centric iterations. Regarding the cosmetic changes that @andralungu suggested: I changed SSSP to use Cheers! |
Hi @balidani, thanks a lot for the changes! I'll try to run some tests on a cluster soon and test the new version. |
@vasia was completely right about the GGC algorithm, I misunderstood what it was supposed to do. I implemented the correct version, but it turns out, after reading the GAS paper that this algorithm has problems when executed synchronously (e.g., values can oscillate between 0 and 1). We had a discussion and decided to add a Connected Components example instead, as well as a PageRank example, which I will implement shortly. |
Hi @balidani :) This PR has not been updated in a while now. Are you facing any issues? In that case I would like to help. Otherwise, if you're busy maybe it makes sense to pass the remainder of the implementation to someone else? |
Hi @andralungu! I just talked to Vasia, and I'll do a rebase today. |
Okay, I updated the PR and Travis passed. |
Thanks @balidani! I'll try to review asap :-) |
@@ -57,4 +57,35 @@ under the License. | |||
<version>${guava.version}</version> | |||
</dependency> | |||
</dependencies> | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @balidani. I think there's no need for these profiles anymore.
Thanks @balidani for the update! I left some inline comments which should be fairly easy to address. |
Hi @vasia! I fixed the problems, sorry about that. I had to resolve some merge conflicts and it appears that I did not notice some things. Next time I'll check the diff before I commit :) |
Hi @balidani! I pulled your changes and made some improvements, mainly to the examples, so that they are consistent with the rest of Gelly's examples. You can see my changes in this branch. I've been running some experiments on a small cluster and so far everything runs smoothly. GSA SSSP actually seems to be faster than vertex-centric SSSP for both input datasets I'm using 😄 I will run a few more experiments in the next days and if I find no problems and there are no objections, I will merge by the end of the week. |
I noticed that there is no GSAPageRank example in the latest version of the code. |
Hey Andra, there are several issues we'll need to fix/add after this is merged. A Pagerank example is one of them :) I have gathered and will open all related JIRAs once this is merged. Thanks for volunteering to take care of one already :)) |
…e the output vertex This closes apache#408
…e the output vertex This closes apache#408
Hi!
I implemented the Gather-Sum-Apply iteration method, it is much nicer now than the previous (
flink-graph
) version. I also added a SSSP test case to the already existing graph coloring one.