# Homework 2, Advanced Part: Iterative Spark Computations
## Due October 15, 2018 by 10pm
### Worth 20 points in total

Building upon your experiences with graph data, we will now use Spark to compute PageRank.  Following our discussion of graphs, we will use an edge list, which is a variation of the adjacency list.

# Step 5: PageRank

Many of you may already know PageRank computation by its reputation:  it is used to measure the importance of a Web page.  (Contrary to popular belief, PageRank is named after Larry Page, not Web pages…)  PageRank is actually a tweaked version of a network centrality measure called *eigenvector centrality*.  One way to implement PageRank is as an iterative computation.  We take each graph node $x$ and in iteration 0 assign it a corresponding PageRank $p_x$:

$p_x^0= 1 / N$

where $N$ is the total number of nodes.

Now in each iteration $i$ we recompute:

$p_x^{(i)} = \alpha * \Sigma_{j \in B(x)} (1 / N_j) p_j^{(i-1)} + \beta$

![Graph](graph.png)

Where $B(x)$ is the set of nodes linking to node $x$, and $N_j$ is the outdegree of each such node $j$.  Typically, repeating the PageRank computation for a number of iterations (15 or so) results in convergence within an acceptable tolerance.  For this assignment we’ll assume $\beta = 0.15$ and $\alpha = 0.85$ (anecdoctally these are the most common values used in practice).

*Example*. In the figure to the right, nodes $j_1$ and $j_2$ represent the back-link set $B(x)$ for node $x$.  $N_{j1}$ is 3 and $N_{j2}$ is 2.  Thus in each iteration $i$, we recompute the PageRank score for $x$ by adding half of the PageRank score for $j_2$ and a third of the PageRank score of $j_3$ (both from the previous iteration $i-1$).

*Hint*.  Build some “helper” DataFrames.  We suggest at least 2 DataFrames, where the first is used the build the second, and the second is used in your solution:
1. a DataFrame with each from_node and the proportion of weight it transfers to each outgoing edge.  For instance, if the from_node is node j then the proportion of weight should be $1/N_j$.
2. a DataFrame, again with the from_node, each node it transfers weight to, and the proportion of weight computed in (1).  For instance, if the `from_node` is $j$ and the to_node is $x$, then the tuple should be $(j, x, 1/N_j)$.

*Submission*. See the external document for submission information.  Remember to first do the basic part of **Homework 2**.

## 5.1  Initialization and Marshalling

### 5.1.1 Spark setup and data load

Initialize PySpark as in the basic Homework 2.  Load `pr_graph.txt` as a text file with a single column.

In [95]:
import numpy as np
import pandas as pd
import networkx as nx

# TODO: Connect to Spark session as in Homework 2.
# Then load the file

# Worth 5 points

# YOUR CODE HERE
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('Graphs-HW2-adv').getOrCreate()

pr_sdf = spark.read.text('pr_graph.txt')

In [96]:
if pr_sdf.count() != 19:
    raise ValueError('Unexpected graph size')
    

### 5.1.2 Wrangling the Graph Data

There are 3 columns in the file:
* `from_node`
* `to_node`
* `reserved`

You can ignore `reserved`.  Use Spark's *split()* function to update `pr_sdf` to have two columns, `from_node` and `to_node`.  Make each an integer.

The split function in Spark, works similarly to the one in Python.  It can be called directly from Spark SQL (`select split(x,’ ’) …`) or by `import`ing `pyspark.sql.functions` and referring to the function in Python.

You may need to cast your columns since they start off as strings.  In Python, you can call `my_sdf.column.cast(‘type’)` to convert data types.  In SQL it’s `SELECT CAST(my_sdf.column AS type).`  

In [97]:
# TODO: Convert pr_sdf into (from_node, to_node) with integer fields
# Worth 5 points

# YOUR CODE HERE
split_col = F.split(pr_sdf['value'], ' ')
pr_sdf = pr_sdf.withColumn('from_node', split_col.getItem(0).cast('int'))
pr_sdf = pr_sdf.withColumn('to_node', split_col.getItem(1).cast('int'))
pr_sdf = pr_sdf[['from_node','to_node']]

In [98]:
results = pr_sdf.take(20)

if 'from_node' not in pr_sdf.columns:
    raise KeyError('Unexpected column names')
if 'to_node' not in pr_sdf.columns:
    raise KeyError('Unexpected column names')


## 5.2 Basic PageRank

Write the function `pagerank(G, num_iter)` which takes a graph DataFrame G corresponding to your graph, and runs for `num_iter` steps.  It should return a DataFrame with columns (`node_id`, `pagerank`).

Initialize your PageRank values for each node in the “base case”.  Then, in each iteration, use the helper DataFrames to compute PageRank scores for each node in the next iteration.

You will likely find it easier to express some of the computations in SparkSQL.  If you want to use spark.select, you may find it useful to use the Spark F.udf function to create functions that can be called over each row in the DataFrame.  You can create a function that returns a double as follows:

```
my_fn = F.udf(lambda x: f(x), DoubleType())
```

Then you can call it like:
```
	my_sdf.select(my_fn(my_arg)).alias(‘col_name’)
```

In [101]:
# TODO: write the function
# Worth 10 points
def pagerank(G, num_iter):
# YOUR CODE HERE

    alpha = 0.85
    beta = 0.15
    G.createOrReplaceTempView('edges')

    # calculate 1/n based on all unique nodes in either column
    from_node = G.select('from_node')
    to_node = G.select('to_node')
    all_nodes = from_node.union(to_node)
    num_nodes = all_nodes.distinct().count()
    init_pr = 1/num_nodes

    # initialize sdf of pageranks per node with value 1/n
    temp_pr = np.repeat(init_pr,num_nodes)
    temp_pr_df = pd.DataFrame(data={'node_id': list(range(1,num_nodes+1)),'pagerank':temp_pr})
    mySchema = StructType([StructField('node_id', StringType(),False), StructField('pagerank', FloatType(), False)])
    temp_pr_sdf = spark.createDataFrame(temp_pr_df,schema=mySchema)

    # initialize weight matrix
    weights = spark.sql('SELECT 1/count(*) as count FROM edges GROUP BY from_node ORDER BY from_node')
    weights = weights.take(num_nodes)

    # run for num_iter iterations
    for i in range(0,num_iter):
        temp_pr = np.zeros((num_nodes))
        for a in range(0,num_nodes):
            # first, find all nodes that connect to this node
            nodes = spark.sql('SELECT from_node FROM edges WHERE to_node == '+str(a+1))

            # for these nodes, calculate (1/N * PR) with PRs from last iteration
            pr = np.zeros((nodes.count()))
            for n in range(0,nodes.count()):
                nodeID = nodes.take(nodes.count())[n][0]
                pr[n] = (weights[nodeID-1][0]) * (temp_pr_sdf.take(num_nodes)[nodeID-1][1])
                
            # calculate PR for this iteration
            temp_pr[a] = alpha * sum(pr) + beta
        
        # update PR sdf
        temp_pr_df = pd.DataFrame(data={'node_id': list(range(1,num_nodes+1)),'pagerank':temp_pr})
        mySchema = StructType([StructField('node_id', StringType(),False), StructField('pagerank', FloatType(), False)])
        temp_pr_sdf = spark.createDataFrame(temp_pr_df,schema=mySchema)

    return temp_pr_sdf


In [100]:
pagerank(pr_sdf, 5).orderBy("pagerank").show()


+-------+----------+
|node_id|  pagerank|
+-------+----------+
|      4|  0.360847|
|      1|0.44950113|
|      6|0.48955232|
|      3| 0.6585104|
|      2| 0.6985616|
|      5|0.81856054|
|      7| 0.8622351|
+-------+----------+



### 5.3 Removal of Self-Loops

The existing graph has a few self-loops.  Let's see what happens if you remove them.  For this one, take `pr_sdf` and remove all self-edges, creating `pr_no_loops_sdf`.  Run `pagerank(pr_no_loops_sdf, 5)`, sort in decreasing order by pagerank, and put the results in a list `pageranks`.

In [102]:
# TODO: create pr_no_loops_sdf and feed it into pagerank.  
# The final result should be an ordered list of Rows (nodes and pageranks) called pageranks.

# YOUR CODE HERE
# remove self-referencing edges
pr_sdf.createOrReplaceTempView('pr')
pr_no_loops_sdf = spark.sql('SELECT * FROM pr WHERE from_node != to_node')

# run pagerank on resulting sdf
pageranks_sdf = pagerank(pr_no_loops_sdf, 5).orderBy("pagerank", ascending=False)

# convert output sdf to a list
pageranks = list(pageranks_sdf.collect())

pageranks

[Row(node_id='2', pagerank=0.8792241811752319),
 Row(node_id='5', pagerank=0.7750296592712402),
 Row(node_id='7', pagerank=0.7306856513023376),
 Row(node_id='6', pagerank=0.5711990594863892),
 Row(node_id='1', pagerank=0.5176590085029602),
 Row(node_id='3', pagerank=0.46857768297195435),
 Row(node_id='4', pagerank=0.39539289474487305)]

In [103]:
if len(pageranks) != 7:
    raise ValueError('Should have 7 nodes!')
    