Graph: #105

orhankislal · 2017-02-23T18:53:48Z

Create generic graph validation and help message to standardize
future graph algorithm development.
Expand the design document with more detail on the graph
representation as well as the SSSP implementation.

Closes #105

- Create generic graph validation and help message to standardize future graph algorithm development. - Expand the design document with more detail on the graph representation as well as the SSSP implementation. Closes apache#105

iyerr3 · 2017-02-27T22:08:15Z

doc/design/modules/graph.tex

-% \section{Graph Representation} \label{sec:graph:rep}
+\section{Graph Framework} \label{sec:graph:fw}
+
+MADlib graph representation depends on two structures, a \emph{vertex} table and an \emph{edge} table. The vertex table has to have a column of vertex ids. The edge table has to have 2 columns: source vertex id, destination vertex id. For most algorithms an edge weight column is required as well. The representation assumes a directed graph, an edge from $x$ to $y$ does \emph{not} guarantee the existence of an edge from $y$ to $x$. Both of the tables may have additional columns as required. Multi-edges (multiple edges from a vertex to the same destination) and loops (edge from a vertex to itself) are allowed. For ideal performance, vertex and edge tables should be distributed on vertex id and source id respectively. This representation does not impose any ordering of vertices or edges. An example graph is given in Figure~\ref{sssp:example} and its representative tables are given in Table~\ref{sssp:rep}.


Using an example considerably helps in understanding the algorithms.

Please reformat to width of 80 characters.

The note about performance is alogrithm-specific and would not necessarily generalize for all graphs. I suggest not talking about it here. Maybe make it a note within each algorithm description.

iyerr3 · 2017-02-27T22:20:29Z

doc/design/modules/graph.tex

+\end{lstlisting}
+\end{algorithm}
+
+We begin our analysis of Find Updates function from its innermost subquery. This subquery (lines 11-16) takes a set of vertices (in the table $old\_update$) and finds the reachable vertices. In case a vertex is reachable by multiple vertices, only the path that has the minimum cost is considered. This means the input vertices need the value of their path as well. In our example, both $v_1$ and $v_2$ can reach $v_3$. In this case, we would have to use $v_2$ -> $v_3$ edge since that gives the lowest possible path value. Please note that we are aggregating the rows using the $min$ operator for each destination vertex and we are unable to return the source vertex at the same time. This means we know the value of $v_3$ should be $2$ but we cannot know its parent ($v_2$) at the same time. To solve this limitation, we combine the result with $edge$ and $old\_update$ tables (lines 7-10) and get the rows that has the same minimum value. At this point, we would have to tackle the problem of tie-breaking. Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>. The inner subquery will return <5,2> and it will match both of these edges. However, it is redundant to keep both of them in the update list as that would require updating the same vertex multiple times in a given iteration. By using the $DISTINCT$ clause at line 2, we allow the underlying system to accept only a single one of them. Finally, we want to make sure these updates are actually leading us to shortest paths. Line 21 ensures that the values stored in the $out\_table$ does not increase and the solution does not regress throughout the iterations.


Couple of suggestions on simplifying:

Use meaningful names for the query aliases and refer to those in the text explanation.

Make the explanation a list that gives (shorter) explanation of each set of lines.

Again, hard-wrapping to 80 characters would help.

orhankislal · 2017-02-28T22:35:55Z

Thanks for the comments @iyerr3. I tried to reorganize the algorithm explanation a bit, please let me know what you think.

njayaram2 · 2017-03-06T20:08:50Z

src/ports/postgres/modules/graph/graph_utils.py_in

+    			-- named arguments of the form "name=value".
+    {other_text}
+    out_table     TEXT  -- Name of the table to store the result of SSSP.
+);


Having a mandatory param such as output_table after {other_text} might not work well if we have optional params. It seems fine in sssp since other_text is essentially the starting vertex id (a mandatory param), but that might not be true for other modules. I suggest we move other_text to after out_table.

An example of a graph-based algorithm that uses optional params is PageRank. We will have optional params such as max_iter and threshold that will be listed after out_table.

I was trying to avoid changing the SSSP notation but I guess it is inevitable. Do you think separating other_text into two (mandatory_params and optional_params) could work for future graph algorithms?

Yes, that might work better. We can have other_madatory_params and optional_params, before and after out_table respectively. We may have to follow this rule for other graph modules: out_table must be our last mandatory param, to maintain some consistency.

But this might not be in line with our existing modules. For example, I checked elastic_net and the output table is one of the mandatory params specified early on. There are several algorithm specific mandatory params following the output table name.
We should also put a comment in the code specifying the reason, else it will look confusing.

I think we might want to move out_table to the other parameters as well. For some functions like graph diameter, we don't have to create an output table. That will allow the pagerank to place its optional parameters after the out_table.

orhankislal added 2 commits February 23, 2017 10:46

Graph:

90ce8f1

- Create generic graph validation and help message to standardize future graph algorithm development. - Expand the design document with more detail on the graph representation as well as the SSSP implementation. Closes apache#105

Fix vertex table typo.

4e20b7a

iyerr3 reviewed Feb 28, 2017

View reviewed changes

Graph: Update the design doc for clarity.

9df97c9

njayaram2 reviewed Mar 6, 2017

View reviewed changes

Graph: Update the generic help message.

f0a16bf

asfgit closed this in 01586c0 Mar 13, 2017

orhankislal deleted the graph/fw_take1 branch June 28, 2017 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph: #105

Graph: #105

orhankislal commented Feb 23, 2017

iyerr3 Feb 27, 2017

iyerr3 Feb 27, 2017

orhankislal commented Feb 28, 2017

njayaram2 Mar 6, 2017

orhankislal Mar 6, 2017

njayaram2 Mar 6, 2017

orhankislal Mar 6, 2017

Graph: #105

Graph: #105

Conversation

orhankislal commented Feb 23, 2017

iyerr3 Feb 27, 2017

Choose a reason for hiding this comment

iyerr3 Feb 27, 2017

Choose a reason for hiding this comment

orhankislal commented Feb 28, 2017

njayaram2 Mar 6, 2017

Choose a reason for hiding this comment

orhankislal Mar 6, 2017

Choose a reason for hiding this comment

njayaram2 Mar 6, 2017

Choose a reason for hiding this comment

orhankislal Mar 6, 2017

Choose a reason for hiding this comment