Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feautre: Add grouping support for PageRank #112

Closed
wants to merge 3 commits into from

Conversation

njayaram2
Copy link
Contributor

MADLIB-1082

  • Add grouping support for pagerank, which will compute a PageRank
    probability distribution for the graph represented by each group.
  • Add convergence test, so that PageRank computation terminates
    if the pagerank value of no node changes beyond a threshold across
    two consecutive iterations (or max_iters number of iterations are
    done, whichever happens first). In case of grouping, the algorithm
    terminates only after all groups have converged.
  • Create a summary table apart from the output table that records
    the number of iterations required for convergence. Iterations
    required for convergence of each group is recorded when grouping
    is used. This implementation also ensures that we don't compute
    PageRank for groups that have already converged.

MADLIB-1082

- Add grouping support for pagerank, which will compute a PageRank
probability distribution for the graph represented by each group.
- Add convergence test, so that PageRank computation terminates
if the pagerank value of no node changes beyond a threshold across
two consecutive iterations (or max_iters number of iterations are
done, whichever happens first). In case of grouping, the algorithm
terminates only after all groups have converged.
- Create a summary table apart from the output table that records
the number of iterations required for convergence. Iterations
required for convergence of each group is recorded when grouping
is used. This implementation also ensures that we don't compute
PageRank for groups that have already converged.
@asfbot
Copy link

asfbot commented Apr 6, 2017

Can one of the admins verify this patch?

- This commits adds a section for PageRank in design docs.
- This also contains minor code changes. It also includes changes
for not doing any convergence test if threshold=0.
Copy link
Contributor

@orhankislal orhankislal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap the code to 80 characters per line.

{vertices_per_group_inner_join}
{ignore_group_clause}
GROUP BY {grouping_cols_select} {edge_temp_table}.{dest}
""".format(grouping_cols_select=edge_grouping_cols_select+', '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to move these grouping_cols if checks out of the loop. You can initialize them to empty strings (similar to lines 122 - 127) and overwrite them when you have grouping (similar to lines 169 - 184). I am sure a C compiler can take care of it automatically but I am not sure about python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea, will change this.

def update_result_tables(temp_summary_table, i, summary_table, out_table,
res_table, grouping_cols_list, cur_unconv, message_unconv=None):
"""
This function updates the summary and output tables only for thouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thouse -> those

I think this explanation is somewhat unclear as an introduction of the function. You might want to move the If check explanation to the code itself.

WHERE {vertex_id} NOT IN (
SELECT {grouping_cols_select} {cur}.{vertex_id}, {random_jump_prob} AS pagerank
FROM {cur} {vpg_from_clause}
WHERE {vpg_where_clause} {vertex_id} NOT IN (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select clause gives you every vertex that exists in the message table regardless of its group. Assume message has 1,2,3 for group 'a' and 1,2,3,4 for group 'b'. This check will remove 4 for group 'a' as well. You can use NOT EXISTS clause to check for multiple columns simultaneously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query does takes care of the scenario you mentioned. The {message_grp_where} clause inside NOT IN checks if cur.grouping_cols=message.grouping_cols. I also tried an example to confirm it does work as expected. So leaving this as is.

plpy.execute("SET client_min_messages TO %s" % old_msg_level)

def update_result_tables(temp_summary_table, i, summary_table, out_table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the temp_summary_table an input to the function? It is only used here and in the drop table line which is unnecessary since you already drop it at line 494. If we want to avoid calling the unique_string function multiple times, we might as well create the table at the beginning and reuse the table using TRUNCATE clauses instead of dropping and recreating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using truncate instead of drop and create.

if message_unconv is None:
plpy.execute("""
DROP TABLE IF EXISTS {temp_summary_table};
CREATE TABLE {temp_summary_table} AS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can overwrite the temp_summary_table variable instead of creating a new table (as long as we make sure not to drop it at the end).

Copy link
Contributor

@orhankislal orhankislal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asfgit asfgit closed this in c694893 Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants