-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph: #105
Graph: #105
Conversation
- Create generic graph validation and help message to standardize future graph algorithm development. - Expand the design document with more detail on the graph representation as well as the SSSP implementation. Closes apache#105
doc/design/modules/graph.tex
Outdated
% \section{Graph Representation} \label{sec:graph:rep} | ||
\section{Graph Framework} \label{sec:graph:fw} | ||
|
||
MADlib graph representation depends on two structures, a \emph{vertex} table and an \emph{edge} table. The vertex table has to have a column of vertex ids. The edge table has to have 2 columns: source vertex id, destination vertex id. For most algorithms an edge weight column is required as well. The representation assumes a directed graph, an edge from $x$ to $y$ does \emph{not} guarantee the existence of an edge from $y$ to $x$. Both of the tables may have additional columns as required. Multi-edges (multiple edges from a vertex to the same destination) and loops (edge from a vertex to itself) are allowed. For ideal performance, vertex and edge tables should be distributed on vertex id and source id respectively. This representation does not impose any ordering of vertices or edges. An example graph is given in Figure~\ref{sssp:example} and its representative tables are given in Table~\ref{sssp:rep}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an example considerably helps in understanding the algorithms.
- Please reformat to width of 80 characters.
- The note about performance is alogrithm-specific and would not necessarily generalize for all graphs. I suggest not talking about it here. Maybe make it a note within each algorithm description.
doc/design/modules/graph.tex
Outdated
\end{lstlisting} | ||
\end{algorithm} | ||
|
||
We begin our analysis of Find Updates function from its innermost subquery. This subquery (lines 11-16) takes a set of vertices (in the table $old\_update$) and finds the reachable vertices. In case a vertex is reachable by multiple vertices, only the path that has the minimum cost is considered. This means the input vertices need the value of their path as well. In our example, both $v_1$ and $v_2$ can reach $v_3$. In this case, we would have to use $v_2$ -> $v_3$ edge since that gives the lowest possible path value. Please note that we are aggregating the rows using the $min$ operator for each destination vertex and we are unable to return the source vertex at the same time. This means we know the value of $v_3$ should be $2$ but we cannot know its parent ($v_2$) at the same time. To solve this limitation, we combine the result with $edge$ and $old\_update$ tables (lines 7-10) and get the rows that has the same minimum value. At this point, we would have to tackle the problem of tie-breaking. Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>. The inner subquery will return <5,2> and it will match both of these edges. However, it is redundant to keep both of them in the update list as that would require updating the same vertex multiple times in a given iteration. By using the $DISTINCT$ clause at line 2, we allow the underlying system to accept only a single one of them. Finally, we want to make sure these updates are actually leading us to shortest paths. Line 21 ensures that the values stored in the $out\_table$ does not increase and the solution does not regress throughout the iterations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of suggestions on simplifying:
- Use meaningful names for the query aliases and refer to those in the text explanation.
- Make the explanation a list that gives (shorter) explanation of each set of lines.
Again, hard-wrapping to 80 characters would help.
Thanks for the comments @iyerr3. I tried to reorganize the algorithm explanation a bit, please let me know what you think. |
-- named arguments of the form "name=value". | ||
{other_text} | ||
out_table TEXT -- Name of the table to store the result of SSSP. | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a mandatory param such as output_table
after {other_text}
might not work well if we have optional params. It seems fine in sssp since other_text is essentially the starting vertex id (a mandatory param), but that might not be true for other modules. I suggest we move other_text
to after out_table
.
An example of a graph-based algorithm that uses optional params is PageRank. We will have optional params such as max_iter
and threshold
that will be listed after out_table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid changing the SSSP notation but I guess it is inevitable. Do you think separating other_text
into two (mandatory_params
and optional_params
) could work for future graph algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that might work better. We can have other_madatory_params
and optional_params
, before and after out_table
respectively. We may have to follow this rule for other graph modules: out_table
must be our last mandatory param, to maintain some consistency.
But this might not be in line with our existing modules. For example, I checked elastic_net and the output table is one of the mandatory params specified early on. There are several algorithm specific mandatory params following the output table name.
We should also put a comment in the code specifying the reason, else it will look confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to move out_table
to the other parameters as well. For some functions like graph diameter, we don't have to create an output table. That will allow the pagerank to place its optional parameters after the out_table
.
future graph algorithm development.
representation as well as the SSSP implementation.
Closes #105