[FLINK-2254] Add BipartiateGraph class #2564

mushketyk · 2016-09-28T21:59:50Z

This PR implements BipartiteGraph class with support classes and lays foundation for future work on bipartite graphs support. I didn't add documentation because I would like to make sure that this approach is in line with what community has in mind regarding bipartite graph support. If this PR is good, I'll continue with documentation and other related tasks.

General
- The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
- The pull request addresses only one issue
- Each commit in the PR has a meaningful commit message (including the JIRA id)
Documentation
- Documentation has been added for new functionality
- Old documentation affected by the pull request has been updated
- JavaDoc for public methods has been added
Tests & Build
- Functionality added by the pull request is covered by tests
- mvn clean verify has been executed successfully locally or a Travis build has passed

vasia

Hi @mushketyk,
thank you for the PR!
The class, types, dataset look OK, but I think we should look into the projection methods more carefully. A projection transformation is an expensive operation and might increase the graph edges by an order of magnitude. In its naive form, a top-(bottom-)projection can be generated if every bottom-(top-)vertex creates an edge for each pair of neighbors. That's an operation quadratic on the vertex degree. However, we might be able to re-use some of the optimizations that @greghogan has implemented in the Jaccard coefficient algorithm since the main computation is the same: finding common neighbors. @greghogan what do you think?

vasia · 2016-09-29T19:39:38Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+
+/**
+ *
+ * Bipartite graph is a graph that contains two sets of vertices: top vertices and bottom vertices. Edges can only exist


I would rephrase that to "... a graph whose vertices can be divided into two disjoint sets"

vasia · 2016-09-29T19:40:29Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+ * between a pair of vertices from different vertices sets. E.g. there can be no vertices between a pair
+ * of top vertices.
+ *
+ * <p>Bipartite graph is useful to represent graphs with two sets of objects, like researchers and their publications,


Bipartite graphs are useful...

vasia · 2016-09-29T19:42:29Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+
+	/**
+	 * Convert a bipartite into a graph that contains only top vertices. An edge between two vertices in the new
+	 * graph will exist only if the original bipartite graph contains a bottom vertex they both connected to.


they *are both connected to

vasia · 2016-09-29T19:42:58Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+	 * Convert a bipartite into a graph that contains only top vertices. An edge between two vertices in the new
+	 * graph will exist only if the original bipartite graph contains a bottom vertex they both connected to.
+	 *
+	 * <p>Caller should provide a function that will create an edge between two top vertices. This function will receive


I'm not sure whether this is a good idea. Why leave this to the user?

I thought that different datasets would require different algorithms to consolidate a number of connections into a single edge value. Hence the callback function. But I think Greg's idea is better.

vasia · 2016-09-29T19:43:15Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/Edge.java

-		this.f0 = src;
-		this.f1 = trg;
-		this.f2 = val;
+	public Edge(K source, K target, V value) {


Why did you change these?

To make them consistent with naming style in other classes.
Do you suggest to revert this?

greghogan

Yes, near the top of the list is FLINK-1267 to add a GroupedCross operator. It is nice to have this as an additional use case.

What if instead of passing in a user function we simply return a flattened tuple to the user? The user can then apply a (chainable) MapFunction to interpret the data as desired.

We could have multiple project methods. The full set is a Tuple8 (three labels, three vertex values, and two edge values). There could also be a variant without joining on the vertex sets that would return a Tuple5 (three labels, two edge values) and perhaps another variant to return a Graph with null edge values (NullValue).

I'd also look to add a middle variant as Tuple7 which only joins on one vertex set.

greghogan · 2016-09-29T20:19:16Z

...st-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/TestBaseUtils.java

@@ -480,7 +480,8 @@ protected static File asFile(String path) {
 			}
 		}

-		assertEquals("Wrong number of elements result", expectedStrings.length, resultStrings.length);
+		assertEquals(String.format("Wrong number of elements result. Expected: %s. Result: %s.", Arrays.toString(expectedStrings), Arrays.toString(resultStrings)),


Doesn't IntelliJ offer to view the different results?

The issue here is that it compares lengths of objects and therefore JUnit only prints compared numbers (say 2 and 0) and not content of arrays.

The array contents are compared in the assertions that follow the test for length.

What if we moved String.format to its own line, included in the string both the array lengths and contents, and added a comment to describe why we are also printing the full arrays?

Also, should the arrays be printed on new lines such that they would line up until the diverging element? We'll need to move the sorting of the arrays before the length check.

greghogan · 2016-09-29T20:27:26Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+								Edge<KT, NEV>> edgeFactory) {
+
+		DataSet<Edge<KT, NEV>> newEdges = edges.join(edges)
+			.where(new TopProjectionKeySelector<KT, KB, EV>())


Using the field index should be faster than a key selector, and allows the optimizer to reuse sorted fields.

mushketyk · 2016-09-29T22:16:09Z

Hi @greghogan,

I like your ideas about providing different API for projections. This should be better than my approach.
@vasia What do you think about this?

vasia · 2016-09-30T07:40:26Z

Providing a flattened tuple is certainly better than having the user implement the reduce, but I think we should provide separate methods for the default and custom operations. A projection is a very well-defined operation: create a graph where there is an edge between any pair of vertices with a common neighbor in the bipartite graph. If the user wants to apply mappers or other transformations on the vertices and edges, they can do so afterwards, using the graph methods. The problem is that with a projection, some information is lost, e.g. the edge values. For these cases, we can provide a custom projection method where we give the labels in a flattened tuple as @greghogan suggested, but I'm afraid the API will look ugly with a Tuple8 there. Another, maybe friendlier solution, would be attaching the labels on the projection graph edges. What do you think?

mushketyk · 2016-09-30T10:28:52Z

Tuple8 does not seem to friendly to me either. What do you mean by "attaching the labels"? Is it something similar to what we do with Edge/Vertex classes right, inheriting Tuple and providing getters and setters to access values in it? Or is there some other way to attach labels to tuples?

vasia · 2016-09-30T11:38:24Z

What I meant is simply creating an edge with a Tuple2 label containing the labels of the two edges in the bipartite graph. Makes sense?

greghogan · 2016-09-30T11:43:30Z

Agreed, I would amend my earlier suggestion to say we only need to start with two projection methods (for each of top and bottom), something like
public Graph<TK, TVV, Tuple2<EV, EV>> topProjectionSimple() {
and
public Graph<TK, TVV, TopProjection<TVV, BK, BVV, EV>> topProjection() {

TopProjection (we can find better names) would be a Tuple6 with POJO accessors as with Vertex, Edge, etc.

mushketyk · 2016-09-30T12:10:14Z

Makes sense to me. I'll implement this during weekend.

mushketyk · 2016-10-04T09:04:55Z

@greghogan @vasia I've update the code according to your suggestion.
The only thing that I did differently: I return Tuple4 from a more complete version of a bottom/top projections it contains vertex key, vertex value and values of two vertices. I assumed that to get values of two other vertices I would need to perform two other joins which will make the method much slower, while a user can do with the result of the method if needed.

greghogan · 2016-10-04T14:10:18Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+ * @param <BV> the bottom vertices value type
+ * @param <EV> the edge value type
+ */
+public class BipartiteGraph<TK, BK, TV, BV, EV> {


Would the generic parameters be easier to read as public class BipartiteGraph<KT, KB, VVT, VVB, EV> {?

Sure, I'll rename them.

Let's check with @vasia first. She may prefer the current type parameter names or have a better suggestion.

greghogan · 2016-10-04T14:25:39Z

The advantage to joining on vertex values before the grouped cross is that the number of projected vertices is quadratic in the vertex degree. The projected graphs will usually be much larger than the bipartite graph.

mushketyk · 2016-10-04T14:52:19Z

Hi @greghogan. Thank you for clarification. I'll update code accordingly. Do you have any other comments regarding the PR?

greghogan

I added a few more comments. Let's discuss with @vasia before reworking too much code.

greghogan · 2016-10-04T14:57:36Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+ * @param <BV> the bottom vertices value type
+ * @param <EV> the edge value type
+ */
+public class BipartiteGraph<TK, BK, TV, BV, EV> {


Let's check with @vasia first. She may prefer the current type parameter names or have a better suggestion.

greghogan · 2016-10-04T15:00:26Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteEdge.java

+
+	private static final long serialVersionUID = 1L;
+
+	public BipartiteEdge(){}


Extra space for () {}.

Good catch.

greghogan · 2016-10-04T15:10:43Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+			.map(new MapFunction<Tuple2<BipartiteEdge<TK, BK, EV>, BipartiteEdge<TK, BK, EV>>, Edge<TK, Tuple2<EV, EV>>>() {
+				@Override
+				public Edge<TK, Tuple2<EV, EV>> map(Tuple2<BipartiteEdge<TK, BK, EV>, BipartiteEdge<TK, BK, EV>> value) throws Exception {
+					return new Edge<>(


The Edge and nested Tuple2 can be reused.

I don't think I understand what you mean. Could you elaborate please?

We don't need to create new objects for each call to map. The Edge and Tuple2 can be fields on the class. For examples look in DegreeAnnotationFunctions.java.

Thank you. Got it.

greghogan · 2016-10-04T15:14:07Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/Projection.java

+		return this.f1;
+	}
+
+	public EV getEdgeValue1() {


Can we now call this the "source" value (and Value2 the "target" value??

Sure. Good point.

greghogan · 2016-10-04T15:17:04Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+	 * @return top projection of the bipartite graph where every edge contains a tuple with values of two edges that
+	 * connect top vertices in the original graph
+	 */
+	public Graph<TK, TV, Tuple2<EV, EV>> simpleTopProjection() {


Would it be preferable for IDE command completion to call this method projectTopSimple (and then have projectTopFull / projectBottomSimple / projectBottomFull)?

Good point. Will update.

greghogan · 2016-10-04T15:18:42Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+	 * @return top projection of the bipartite graph where every edge contains a tuple with values of two edges that
+	 * connect top vertices in the original graph
+	 */
+	public Graph<TK, TV, Tuple2<EV, EV>> simpleTopProjection() {


Also, and it was discussed only have a Tuple2 of edge values, but double checking that we don't also want to include the (here: bottom) vertex ID in the new edge value.

Sorry, I didn't get your point. Could you please elaborate on this please?

vasia · 2016-10-06T13:34:14Z

Thanks for the update @mushketyk and for the review @greghogan. I agree with your suggestions. For the type parameters I would go for <KT, KB, VVT, VVB, EV>. Let me know if there's any other issue you'd like my opinion on.

mushketyk · 2016-11-06T21:39:00Z

@vasia @greghogan I've updated the PR. Could you please give it another look?

mushketyk · 2016-11-09T19:06:34Z

New gelly tests failed with errors like:

Caused by: java.io.IOException: Insufficient number of network buffers: required 32, but only 3 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

Do you know what is causing this error? Should I update the code somehow?

greghogan · 2016-11-09T19:23:31Z

Try switching to ExecutionEnvironment.createCollectionsEnvironment().

mushketyk · 2016-11-10T20:33:15Z

@greghogan
Thank you for the suggestion.
The build is passing now.

vasia · 2016-11-21T11:31:17Z

Thank @mushketyk. @greghogan are you shepherding this PR or shall I?

mushketyk · 2016-11-24T15:10:44Z

@vasia I don't think anybody is shepherding this PR :)

greghogan

@mushketyk, thanks for your contribution! In addition to some comments, if you could rebase the PR we may avoid later difficulties. I don't expect any conflicts since you are mostly adding new code.

greghogan · 2016-12-05T18:07:13Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteEdge.java

+import org.apache.flink.api.java.tuple.Tuple3;
+
+/**
+ *


Empty line.

greghogan · 2016-12-05T18:08:38Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteEdge.java

+
+/**
+ *
+ * A BipartiteEdge represents a link between a top and bottom vertices


"between a top" -> "between top", or similar.

greghogan · 2016-12-05T18:14:52Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteEdge.java

+/**
+ *
+ * A BipartiteEdge represents a link between a top and bottom vertices
+ * in a {@link BipartiteGraph}. It is similar to the {@link Edge} class


"It is generalized form of {@link Edge} where the source and target vertex IDs can be of different types.", or similar?

greghogan · 2016-12-05T18:18:03Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteEdge.java

+		return this.f0;
+	}
+
+	public void setTopId(KT i) {


Parameter name "i" -> "topId"? Also, below for "i" -> "bottomId" and "newValue" -> "value"?

greghogan · 2016-12-05T18:18:15Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+import org.apache.flink.api.java.tuple.Tuple2;
+
+/**
+ *


Empty line.

greghogan · 2016-12-05T19:27:50Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/BipartiteGraph.java

+				}
+			})
+			// Join with top vertices to preserve top vertices values
+			.join(topVertices)


Is it not more efficient to do the Cartesian join last if we assume that the |vertices| << |bipartite edges| << |simple edges|? The code can be reused between the full projection functions: first join the bipartite edge with top vertex, then join the result with the bottom vertex (perhaps using the Projection class below with NullValue where appropriate).

greghogan · 2016-12-05T19:29:44Z

flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/Projection.java

+ * @param <VV> the value type of vertices of an opposite set
+ * @param <EV> the edge value type
+ */
+public class Projection<VK, VV, EV, VVC> extends Tuple6<VK, VV, EV, EV, VVC, VVC> {


Missing comment for documenting VVC. Should EV be placed before VVC? And before VK and VV?

greghogan · 2016-12-05T19:30:44Z

flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/BipartiteEdgeTest.java

+	@Test
+	public void testSetBottomId() {
+		edge.setBottomId(100);
+		assertEquals(Integer.valueOf(100), edge.getBottomId());


Does auto-boxing not work here?

It works but a compiler can't decide between assertEquals(Object, Object) and assertEquals(long, long).
Anyway I replaced it with:

assertEquals(100, (long) edge.getBottomId());

greghogan · 2016-12-05T19:33:16Z

flink-libraries/flink-gelly/src/test/java/org/apache/flink/graph/ProjectionTest.java

+
+import static org.junit.Assert.assertEquals;
+
+public class ProjectionTest {


Is this test class necessary?

I know that testing getters/setters is considered wasting of CPU cycles, but since getters/setters are not auto-generated here and need to access proper tuple's values I decided to add them.

greghogan · 2016-12-05T19:36:23Z

...st-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/TestBaseUtils.java

@@ -480,7 +480,8 @@ protected static File asFile(String path) {
 			}
 		}

-		assertEquals("Wrong number of elements result", expectedStrings.length, resultStrings.length);
+		assertEquals(String.format("Wrong number of elements result. Expected: %s. Result: %s.", Arrays.toString(expectedStrings), Arrays.toString(resultStrings)),


The array contents are compared in the assertions that follow the test for length.

mushketyk · 2016-12-05T20:10:59Z

Hi @greghogan , thank you for your review.
I'll try to fix them in the next couple of days.

Best regards,
Ivan.

mushketyk · 2016-12-07T22:27:19Z

Hi @greghogan,

I've updated the PR according to your review and rebased it on top of the master branch.

The only thing that I didn't change is the message in the assertEquals you pointed to since it is not very helpful to receive an error message like: "Wrong number of elements result. Expected 4, actual 3." I think it is much more helpful for the debugging purposes to see contents of the arrays to figure out why their lengths are different.

greghogan · 2016-12-08T19:37:58Z

@mushketyk @vasia, thoughts on package naming? Should we create a new org.apache.flink.bigraph package? Another option would be org.apache.flink.graph.bidirectional which would suggest future package names like org.apache.flink.graph.multi and org.apache.flink.graph.temporal.

vasia · 2016-12-08T19:45:49Z

I would go for org.apache.flink.graph.bipartite. I think that bidirectional simply suggests that each edge exists in both directions.

greghogan · 2016-12-08T19:56:12Z

Yes, you are right, bipartite.

mushketyk · 2016-12-08T21:19:13Z

@vasia , @greghogan I've created a new package, moved new classes there and update PR according to your latest comments.

Best regards,
Ivan.

greghogan · 2016-12-08T21:45:35Z

...st-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/TestBaseUtils.java

-
-		assertEquals("Wrong number of elements result", expectedStrings.length, resultStrings.length);
+
+		//


Looks like you were intending to add a comment here. The if (sort) ... block will need to go before msg is created. Should we add newlines to the message string, something like "Expected %d elements but received %d.\n expected array: %s\n received array: %s"? That should line up.

greghogan · 2016-12-08T21:46:06Z

...st-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/TestBaseUtils.java


 		if (sort) {
 			Arrays.sort(expectedStrings);
 			Arrays.sort(resultStrings);
 		}

 		for (int i = 0; i < expectedStrings.length; i++) {
-			assertEquals(expectedStrings[i], resultStrings[i]);
+			assertEquals(msg, expectedStrings[i], resultStrings[i]);


Do we need to include msg here?

I think it will give more context just as in the comparing lengths case.

mushketyk · 2016-12-08T22:10:24Z

Hi @greghogan , I've fixed the PR according to your review.

vasia · 2016-12-09T20:23:33Z

Thank you both for your work @mushketyk and @greghogan!
Please, keep in mind that we should always add documentation for every new feature; especially a big one such as supporting a new graph type. We've added the checklist template for each new PR so that we don't forget about it :)
Can you please open a JIRA to track that docs for bipartite graphs are missing? Thank you!

greghogan · 2016-12-09T21:59:32Z

Thanks for the reminder @vasia. The separate JIRA sub-task does allow for a discussion of how best to document the full set of proposed bipartite functionality.

mushketyk · 2016-12-09T22:22:38Z

Hi @vasia , thank you for merging my PR.
Thank you for the reminder about the documentation. I've created the JIRA for it: https://issues.apache.org/jira/browse/FLINK-5311

This closes apache#2564

vasia reviewed Sep 29, 2016

View reviewed changes

greghogan reviewed Sep 29, 2016

View reviewed changes

greghogan reviewed Oct 4, 2016

View reviewed changes

mushketyk force-pushed the bipartite-graph branch from f627fc4 to acadfa4 Compare November 6, 2016 21:38

greghogan reviewed Dec 5, 2016

View reviewed changes

mushketyk added 2 commits December 7, 2016 22:21

[FLINK-2254] Add BipartiateGraph class

812fc37

[FLINK-2254] Implement simple and full projection methods

7fcb456

mushketyk force-pushed the bipartite-graph branch from ed1b46a to 7fcb456 Compare December 7, 2016 22:22

[FLINK-2254] Created new "graph.bipartite" package

298a52a

greghogan reviewed Dec 8, 2016

View reviewed changes

[FLINK-2254] Fixed according to review

79e24a3

asfgit closed this in 365cd98 Dec 9, 2016

static-max pushed a commit to static-max/flink that referenced this pull request Dec 13, 2016

[FLINK-4646] [gelly] Add BipartiateGraph

998aa1f

This closes apache#2564

joseprupi pushed a commit to joseprupi/flink that referenced this pull request Feb 12, 2017

[FLINK-4646] [gelly] Add BipartiateGraph

fbee36d

This closes apache#2564

rmetzger added the component=Library/GraphProcessing(Gelly) label Mar 14, 2019


		private static final long serialVersionUID = 1L;

		public BipartiteEdge(){}


		import static org.junit.Assert.assertEquals;

		public class ProjectionTest {


		assertEquals("Wrong number of elements result", expectedStrings.length, resultStrings.length);

		//

[FLINK-2254] Add BipartiateGraph class #2564

[FLINK-2254] Add BipartiateGraph class #2564

Conversation

mushketyk commented Sep 28, 2016

vasia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghogan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghogan Sep 29, 2016 • edited Loading

Choose a reason for hiding this comment

mushketyk commented Sep 29, 2016

vasia commented Sep 30, 2016 • edited Loading

mushketyk commented Sep 30, 2016

vasia commented Sep 30, 2016

greghogan commented Sep 30, 2016

mushketyk commented Sep 30, 2016

mushketyk commented Oct 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghogan commented Oct 4, 2016

mushketyk commented Oct 4, 2016

greghogan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasia commented Oct 6, 2016

mushketyk commented Nov 6, 2016

mushketyk commented Nov 9, 2016

greghogan commented Nov 9, 2016

mushketyk commented Nov 10, 2016

vasia commented Nov 21, 2016

mushketyk commented Nov 24, 2016

greghogan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mushketyk commented Dec 5, 2016

mushketyk commented Dec 7, 2016

greghogan commented Dec 8, 2016

vasia commented Dec 8, 2016

greghogan commented Dec 8, 2016

mushketyk commented Dec 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mushketyk commented Dec 8, 2016

vasia commented Dec 9, 2016

greghogan Sep 29, 2016 •

edited

Loading

vasia commented Sep 30, 2016 •

edited

Loading