Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph: #105

Closed
wants to merge 4 commits into from
Closed

Graph: #105

wants to merge 4 commits into from

Conversation

orhankislal
Copy link
Contributor

  • Create generic graph validation and help message to standardize
    future graph algorithm development.
  • Expand the design document with more detail on the graph
    representation as well as the SSSP implementation.

Closes #105

- Create generic graph validation and help message to standardize
future graph algorithm development.
- Expand the design document with more detail on the graph
representation as well as the SSSP implementation.

Closes apache#105
% \section{Graph Representation} \label{sec:graph:rep}
\section{Graph Framework} \label{sec:graph:fw}

MADlib graph representation depends on two structures, a \emph{vertex} table and an \emph{edge} table. The vertex table has to have a column of vertex ids. The edge table has to have 2 columns: source vertex id, destination vertex id. For most algorithms an edge weight column is required as well. The representation assumes a directed graph, an edge from $x$ to $y$ does \emph{not} guarantee the existence of an edge from $y$ to $x$. Both of the tables may have additional columns as required. Multi-edges (multiple edges from a vertex to the same destination) and loops (edge from a vertex to itself) are allowed. For ideal performance, vertex and edge tables should be distributed on vertex id and source id respectively. This representation does not impose any ordering of vertices or edges. An example graph is given in Figure~\ref{sssp:example} and its representative tables are given in Table~\ref{sssp:rep}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an example considerably helps in understanding the algorithms.

  1. Please reformat to width of 80 characters.
  2. The note about performance is alogrithm-specific and would not necessarily generalize for all graphs. I suggest not talking about it here. Maybe make it a note within each algorithm description.

\end{lstlisting}
\end{algorithm}

We begin our analysis of Find Updates function from its innermost subquery. This subquery (lines 11-16) takes a set of vertices (in the table $old\_update$) and finds the reachable vertices. In case a vertex is reachable by multiple vertices, only the path that has the minimum cost is considered. This means the input vertices need the value of their path as well. In our example, both $v_1$ and $v_2$ can reach $v_3$. In this case, we would have to use $v_2$ -> $v_3$ edge since that gives the lowest possible path value. Please note that we are aggregating the rows using the $min$ operator for each destination vertex and we are unable to return the source vertex at the same time. This means we know the value of $v_3$ should be $2$ but we cannot know its parent ($v_2$) at the same time. To solve this limitation, we combine the result with $edge$ and $old\_update$ tables (lines 7-10) and get the rows that has the same minimum value. At this point, we would have to tackle the problem of tie-breaking. Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>. The inner subquery will return <5,2> and it will match both of these edges. However, it is redundant to keep both of them in the update list as that would require updating the same vertex multiple times in a given iteration. By using the $DISTINCT$ clause at line 2, we allow the underlying system to accept only a single one of them. Finally, we want to make sure these updates are actually leading us to shortest paths. Line 21 ensures that the values stored in the $out\_table$ does not increase and the solution does not regress throughout the iterations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of suggestions on simplifying:

  1. Use meaningful names for the query aliases and refer to those in the text explanation.
  2. Make the explanation a list that gives (shorter) explanation of each set of lines.

Again, hard-wrapping to 80 characters would help.

@orhankislal
Copy link
Contributor Author

Thanks for the comments @iyerr3. I tried to reorganize the algorithm explanation a bit, please let me know what you think.

-- named arguments of the form "name=value".
{other_text}
out_table TEXT -- Name of the table to store the result of SSSP.
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a mandatory param such as output_table after {other_text} might not work well if we have optional params. It seems fine in sssp since other_text is essentially the starting vertex id (a mandatory param), but that might not be true for other modules. I suggest we move other_text to after out_table.

An example of a graph-based algorithm that uses optional params is PageRank. We will have optional params such as max_iter and threshold that will be listed after out_table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid changing the SSSP notation but I guess it is inevitable. Do you think separating other_text into two (mandatory_params and optional_params) could work for future graph algorithms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that might work better. We can have other_madatory_params and optional_params, before and after out_table respectively. We may have to follow this rule for other graph modules: out_table must be our last mandatory param, to maintain some consistency.

But this might not be in line with our existing modules. For example, I checked elastic_net and the output table is one of the mandatory params specified early on. There are several algorithm specific mandatory params following the output table name.
We should also put a comment in the code specifying the reason, else it will look confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to move out_table to the other parameters as well. For some functions like graph diameter, we don't have to create an output table. That will allow the pagerank to place its optional parameters after the out_table.

@asfgit asfgit closed this in 01586c0 Mar 13, 2017
@orhankislal orhankislal deleted the graph/fw_take1 branch June 28, 2017 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants