<br><br><br>
<span style="color:red;font-size:60px">Graph Algorithms</span>
<br><br>


<li>breadth-first search</li>
<li>connected components</li>
<li>label propagation</li>
<li>shortest path</li>
<li>triangle count</li>


<br><br><br>
<span style="color:green;font-size:xx-large">Bulk Synchronous Parallel Model</span>
<br><br>

<li>Both GraphX and GraphFrames use the "Bulk Synchronous Parallel" model of processing</li>
<li>BSP model uses 3 supersteps for computation:</li>
<ul>
    <li>Do local computation concurrently for each vertex (or set of vertices)</li>
    <li>Communicate results from one process to another directly (communication and message passing)</li>
    <li>Synchronize activities using <span style="color:red">barrier synchronization</span> (identify barrier tasks that must complete before subsequent processing is possible)</li>
</ul>
<br><br>
<li>Comparison with MapReduce</li>
<ul>
    <li>in-memory state persistence between iterations</li>
    <li>synchronization is restricted to state updates (reduced communication)</li>
    <li>many iterations are possible (good for graphs where iterations may be a factor of the number of vertices) but each iteration is less intensive (since it only deals with updates)</li>
    <li>message passing (between processors) rather than routing through a master node</li>
    <li>each computation in a super step is independent (barrier synchronization ensures this)</li>
</ul>

In [None]:
%%init_spark
launcher.packages= ["graphframes:graphframes:0.8.2-spark3.2-s_2.12"]

In [None]:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._

<img src="social_graph2.png">

In [None]:
val vertexArray = Array(
  (1, "Alice", 28),
  (2, "Bob", 27),
  (3, "Charlie", 65),
  (4, "David", 42),
  (5, "Ed", 55),
  (6, "Fran", 50),
    (7, "Qing",27),
    (8, "Sarika",78),
    (9, "Olafson",17),
    (10, "Birgit",33)
)

val edgeArray = Array(
  (2, 1, 7),
  (1, 2, 13),
  (2, 4, 2),
  (3, 2, 4),
  (3, 6, 3),
  (4, 1, 1),
  (5, 2, 2),
  (5, 3, 8),
  (5, 6, 3),
    (7, 8, 14),
    (7, 9, 2),
    (8, 10, 8),
    (9, 10, 6)
)

val vertex_df = spark.createDataFrame(vertexArray).toDF("id","name","age")
val edge_df = spark.createDataFrame(edgeArray).toDF("src","dst","attr")

val g = GraphFrame(vertex_df, edge_df)

<br><br><br>
<span style="color:green;font-size:xx-large">Breadth-first search</span>
<li>Shortest path between two nodes</li>
<li>The shortest path can depend on node attributes</li>
<li>Or we can find shortest paths from multiple nodes to multiple other nodes</li>
<li>bfs computes path lengths based on the number of edges and does not use edge weights</li>

In [None]:
g.bfs.fromExpr("name='Ed'").toExpr("name='Alice'").run().show

In [None]:
g.bfs.fromExpr("age>45").toExpr("age<45").run().show

In [None]:
g.bfs.fromExpr("name='Ed' or name='Bob'").toExpr("name='Alice'").run().show

<br><br><br>
<span style="color:green;font-size:xx-large">Connected components</span>

<li>GraphFrames connected components function requires a <span style="color:red">checkpoint</span> directory</li>
<li>The algorithm returns a component number for each vertex</li>
<li>The number of distinct component numbers is the number of components of the graph</li>
<li><span style="color:red">connectedComponents</span> returns a list of nodes with along with the component number</li>
<li><span style="color:red">stronglyConnectedComponents</span> returns a list of nodes with along with the component number for each <a href="https://en.wikipedia.org/wiki/Strongly_connected_component">strongly connected component</a></li>

In [None]:
sc.setCheckpointDir("checkpoint")
val cc = g.connectedComponents.run()
cc.show

In [None]:
val result = g.stronglyConnectedComponents.maxIter(10).run()
result.select("id", "component").orderBy("component").show()

<br><br><br>
<span style="color:green;font-size:xx-large">Label propagation</span>
<li>
<li>The label propagation algorithm is a clustering algorithm</li>
<li>Finds "similar" nodes in the graph</li>
<li>the algorithm is iterative but converges very quickly</li>
<li>Roughly:</li>
<ul>
    <li>assign labels randomly to vertices (depending on the size of the graph, this could be to all nodes or just a few)</li>
    <li>update labels based on the frequency of labels in adjacent nodes</li>
    <li>repeat updates</li>
    <li>stop after n iterations</li>
    <li>label propagation is done on the canonical undirected graph</li>
</ul>
<li>Label propagation is used to group nodes in very large graphs - mostly because an exhaustive grouping is computationally infeasible</li>    

In [None]:

val result = g.labelPropagation.maxIter(3).run()
result.select("id", "label").show()


<br><br><br>
<span style="color:green;font-size:xx-large">Shortest path</span>
<li>Compute the shortest path (length) from each vertex to a set of "landmark" vertices </li>
<li>Unfortunately, the shortest path algorithm needs vertex ids to be strings, so we'll convert them to strings!</li>
<li>In the example below, we compute the shortest path from every vertex to vertices 3, 6, and 10</li>
<li>For example, if a company has several factories and several distribution warehouses, it might want to find the shortest path from each factory to each warehouse</li>

In [None]:
val vertexArray = Array(
  (1, "Alice", 28),
  (2, "Bob", 27),
  (3, "Charlie", 65),
  (4, "David", 42),
  (5, "Ed", 55),
  (6, "Fran", 50),
    (7, "Qing",27),
    (8, "Sarika",78),
    (9, "Olafson",17),
    (10, "Birgit",33)
).map(l=>(l._1.toString,l._2,l._3))

val edgeArray = Array(
  (2, 1, 7),
  (1, 2, 13),
  (2, 4, 2),
  (3, 2, 4),
  (3, 6, 3),
  (4, 1, 1),
  (5, 2, 2),
  (5, 3, 8),
  (5, 6, 3),
    (7, 8, 14),
    (7, 9, 2),
    (8, 10, 8),
    (9, 10, 6)
).map(l=>(l._1.toString,l._2.toString,l._3))

val vertex_df = spark.createDataFrame(vertexArray).toDF("id","name","age")
val edge_df = spark.createDataFrame(edgeArray).toDF("src","dst","attr")

val g = GraphFrame(vertex_df, edge_df)

In [None]:
g.shortestPaths.landmarks(Seq("3","6","10")).run().show

<br><br><br>
<span style="color:green;font-size:xx-large">Page Rank</span>
<br><br>
<li>An implementation of Google's page ranking algorithm</li>
<li>Web pages = nodes; links = edges</li>
<li>See <a href="https://en.wikipedia.org/wiki/PageRank">wikipedia</a> for details but the rough idea is:</li>
<ul>
    <li>the rank of a page is higher if it has more incoming links</li>
    <li>the rank of a page is higher if the pages that link to it have higher ranks</li>
</ul>
<li>pagerank takes three arguments</li>
<ul>
    <li><span style="color:red">resetProbability</span>: random walk reset probability (the probability that a page will move to a random page in the network rather than follow a link </li>
    <li><span style="color:red">tol</span>: algorithm stops when it converges to the tol level  </li>
    <li><span style="color:red">maxIter</span>:  stop after the specified number of iterations</li>
</ul>

In [None]:
val results = g.pageRank.resetProbability(0.15).maxIter(10).run()
results.vertices.show

<span style="color:blue;font-size:large">parallelized version of page rank</span>
<li>specify a list of verttices from which to run pagerank in parallel</li>

In [None]:
val results = g.parallelPersonalizedPageRank.resetProbability(0.01).maxIter(100).sourceIds(Array("1","2")).run()
results.vertices.show(false)

<br><br><br>
<span style="color:green;font-size:xx-large">triangle count</span>
<li>the number of triangles that each vertice belongs to</li>
<li>for example, Bob belongs to two triangles: (Bob, David, Alice) and (Bob, Charlie, Ed)</li>
<li>triangles assume an undirected graph</li>

In [None]:
val results = g.triangleCount.run()
results.show

<br><br>
<span style="color:green;font-size:xx-large">Clustering coefficient using triangle count</span>
<li>clustering coefficient = number of triangles a vertex belongs to divided by the number of possible triangles</li>
<li>for example, Alice belongs to 1 triangle (Alice, Bob, David) and, since she has only two adjacent vertices, the number of possible triangles is also 1. Alice's clustering coefficient is 1/1 = 1.0</li>
<li>Bob belongs to two triangles. The possible triangles are: 
    <ul>
        <li>(Bob, David, Alice)</li>
        <li>(Bob, David, Charlie)</li>
        <li>(Bob, Alice, Charlie)</li>
        <li>(Bob, David, Ed)</li>
        <li>(Bob, Alice, Ed)</li>
        <li>(Bob, Charlie, Ed)</li>
    </ul>
<li>thus, Bob's clustering coefficient is 2/6 = 0.33</li>


In [None]:
//Copied from previous notebook
def make_undirected_graph(g: GraphFrame) = {
    val u_edge_df = g.find("(a)-[]->(b)")
        .select($"a.id".as("src"),$"b.id".as("dst"))
        .withColumn("swap",when(col("src")<col("dst"),col("dst")))
        .withColumn("dst",
                    when(col("swap").isNotNull,col("src"))
                    .otherwise(col("dst")))
        .withColumn("src",
                    when(col("swap").isNotNull,col("swap"))
                   .otherwise(col("src")))
        .drop(col("swap"))
        .distinct
    val u_vertices_df = g.vertices
    val u_g = GraphFrame(u_vertices_df,u_edge_df)    
    u_g
}

In [None]:
val triangles = g.triangleCount.run().withColumnRenamed("id","t_id") //Get the number of triangles each vertex belongs to
val degrees = make_undirected_graph(g).degrees //Get the number of adjacent vertices for each vertex
val possible = degrees.withColumn("possible",col("degree")*(col("degree")-1)/lit(2)) //Calculate possible triangles
val joined = triangles.select("t_id","count").join(possible,triangles("t_id")===possible("id")) 
val coeff = joined.withColumn("coeff",col("count")/col("possible"))
coeff.select("id","coeff").show

<br><br><br>
<span style="color:green;font-size:xx-large">aggregateMessages</span>

<li>Neighborhood aggregation function through messaging</li>
<li>Messages</li>
<ul>
    <li>source data (e.g., AggregateMessages.src("age"))</li>
    <li>destination data (e.g., AggregateMessages.dst("age))</li>
    <li>edge data (e.g., AggregateMessages.edge("attr")</li> 
    <li>A message is sent from each vertex either from a src to a dst (sendToDst) or from a dst to a source (sendToSrc)</li>
    <li>A function then processes the received message (agg)</li>
</ul>

In [None]:
import org.apache.spark.sql.functions
import org.graphframes.lib.AggregateMessages

<span style="color:blue;font-size:large">Calculate total incoming likes for every node</span>
<li>For example, Bob has 3 incoming edges, each with attr values 13, 2, 4</li>


In [None]:
g.aggregateMessages
    .sendToDst(AggregateMessages.edge("attr")) //Send the edge attr to value to the destination
    .agg(sum(AggregateMessages.msg).as("alllikes")).show //Aggregate messages by summing them up

<span style="color:blue;font-size:large">Try this: Calculate total outgoing likes for each person</span>


In [None]:
g.aggregateMessages
    .sendToSrc(AggregateMessages.edge("attr")) //Send the edge attr to value to the destination
    .agg(sum(AggregateMessages.msg).as("alllikes")) //Aggregate messages by summing them up
    .show

<span style="color:blue;font-size:large">for each person, who likes them the most and how much?</span>
<li>This is going to be embarrassingly complicated!</li>

In [None]:
val max_df = g.aggregateMessages
    .sendToDst(AggregateMessages.edge("attr"))
    .agg(max(AggregateMessages.msg))
    .withColumnRenamed("max(MSG)","maxval")

In [None]:
max_df.printSchema

In [None]:
max_df.createOrReplaceTempView("max_db")

In [None]:
spark.sql("select max(maxval) from max_db").show

In [None]:
spark.sql("select id, maxval from max_db where maxval = (select max(maxval) from max_db)").show


In [None]:
()