<br><br><br>
<span style="color:red;font-size:60px">GraphX</span>
<br><br>

<li>GraphX provides an RDD level implementation of graphs</li>
<li>GraphFrames graph algorithm implementations are done in GraphX</li>
<li>Building custom algorithms is easier using two GraphX building blocks but graphframes, since the work at the dataframe level, provide a higher level interface</li>
<ul>
    <li><span style="color:red">aggregateMessages</span>: An implementation of an asynchronous message passing algorithm on a graph</li>
    <li><span style="color:red">pregel</span>: Google's parallel graph algorithm building block (parallel, graph, google)</li> 
</ul>
<li>A GraphFrames graph is convertible to a GraphX graph, implement an algorithm, and convert the result back into a dataframe or a GraphFrame graph</li>

In [1]:
%%init_spark
launcher.packages= ["graphframes:graphframes:0.8.2-spark3.2-s_2.12"]

In [2]:
//GraphFrame imports
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._


//GraphX imports
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD



Intitializing Scala interpreter ...

Spark Web UI available at http://dyn-160-39-133-139.dyn.columbia.edu:4040
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1666042981864)
SparkSession available as 'spark'


import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


<br><br>
<span style="color:blue;font-size:large">Creating a GraphFrame graph</span>

In [7]:
val vertexArray = Array(
  (1, "Alice", 28),
  (2, "Bob", 27),
  (3, "Charlie", 65),
  (4, "David", 42),
  (5, "Ed", 55),
  (6, "Fran", 50),
    (7, "Qing",27),
    (8, "Sarika",78),
    (9, "Olafson",17),
    (10, "Birgit",33)
)

val edgeArray = Array(
  (2, 1, 7),
  (1, 2, 13),
  (2, 4, 2),
  (3, 2, 4),
  (3, 6, 3),
  (4, 1, 1),
  (5, 2, 2),
  (5, 3, 8),
  (5, 6, 3),
    (7, 8, 14),
    (7, 9, 2),
    (8, 10, 8),
    (9, 10, 6)
)

val vertex_df = spark.createDataFrame(vertexArray).toDF("id","name","age")
val edge_df = spark.createDataFrame(edgeArray).toDF("src","dst","attr")

val g = GraphFrame(vertex_df, edge_df)

vertexArray: Array[(Int, String, Int)] = Array((1,Alice,28), (2,Bob,27), (3,Charlie,65), (4,David,42), (5,Ed,55), (6,Fran,50), (7,Qing,27), (8,Sarika,78), (9,Olafson,17), (10,Birgit,33))
edgeArray: Array[(Int, Int, Int)] = Array((2,1,7), (1,2,13), (2,4,2), (3,2,4), (3,6,3), (4,1,1), (5,2,2), (5,3,8), (5,6,3), (7,8,14), (7,9,2), (8,10,8), (9,10,6))
vertex_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
edge_df: org.apache.spark.sql.DataFrame = [src: int, dst: int ... 1 more field]
g: org.graphframes.GraphFrame = GraphFrame(v:[id: int, name: string ... 1 more field], e:[src: int, dst: int ... 1 more field])


In [8]:
g.filterEdges("attr>2").filterVertices("age < 50").edges.show


+---+---+----+
|src|dst|attr|
+---+---+----+
|  2|  1|   7|
|  1|  2|  13|
|  9| 10|   6|
+---+---+----+



<br><br><br>
<span style="color:blue;font-size:large">Creating a GraphX graph</span>
<br><br>
<li>Vertex ids in GraphX must be of type Long (convert ids to Long)</li>
<li>Vertex attributes must be a single object (convert attributes to a tuple)</li>
<li>Edge objects must be of type GraphX.Edge (convert the edgeArray tuples into Edge objects while also converting vertex ids to Long</li>

In [10]:
val vertexArray = Array(
  (1, "Alice", 28),
  (2, "Bob", 27),
  (3, "Charlie", 65),
  (4, "David", 42),
  (5, "Ed", 55),
  (6, "Fran", 50),
    (7, "Qing",27),
    (8, "Sarika",78),
    (9, "Olafson",17),
    (10, "Birgit",33)
)

val edgeArray = Array(
  (2, 1, 7),
  (1, 2, 13),
  (2, 4, 2),
  (3, 2, 4),
  (3, 6, 3),
  (4, 1, 1),
  (5, 2, 2),
  (5, 3, 8),
  (5, 6, 3),
    (7, 8, 14),
    (7, 9, 2),
    (8, 10, 8),
    (9, 10, 6)
)
val vertexArrayX = vertexArray.map(r => (r._1.toLong,(r._2,r._3)))
val edgeArrayX = edgeArray.map(r => Edge(r._1.toLong,r._2.toLong,r._3))

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArrayX)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArrayX)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

vertexArray: Array[(Int, String, Int)] = Array((1,Alice,28), (2,Bob,27), (3,Charlie,65), (4,David,42), (5,Ed,55), (6,Fran,50), (7,Qing,27), (8,Sarika,78), (9,Olafson,17), (10,Birgit,33))
edgeArray: Array[(Int, Int, Int)] = Array((2,1,7), (1,2,13), (2,4,2), (3,2,4), (3,6,3), (4,1,1), (5,2,2), (5,3,8), (5,6,3), (7,8,14), (7,9,2), (8,10,8), (9,10,6))
vertexArrayX: Array[(Long, (String, Int))] = Array((1,(Alice,28)), (2,(Bob,27)), (3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)), (7,(Qing,27)), (8,(Sarika,78)), (9,(Olafson,17)), (10,(Birgit,33)))
edgeArrayX: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(2,1,7), Edge(1,2,13), Edge(2,4,2), Edge(3,2,4), Edge(3,6,3), Edge(4,1,1), Edge(5,2,2), Edge(5,3,8), Edge(5,6,3), Edge(7,8,14), Edge(7,9,2), Edge(8,10,8), Edge(9,10,6))
ve...


In [11]:
graph

res4: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@61c0ca35


<br><br><br>
<span style="color:blue;font-size:large">Convert from GraphFrame to GraphX</span>
<br><br>
<li>Method 1: Call the function toGraphX</li>

In [14]:
val gx: Graph[Row, Row] = g.toGraphX

gx: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql.Row] = org.apache.spark.graphx.impl.GraphImpl@1c11965b


In [15]:
gx.vertices.collect

res6: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.sql.Row)] = Array((10,[10,Birgit,33]), (1,[1,Alice,28]), (2,[2,Bob,27]), (3,[3,Charlie,65]), (4,[4,David,42]), (5,[5,Ed,55]), (6,[6,Fran,50]), (7,[7,Qing,27]), (8,[8,Sarika,78]), (9,[9,Olafson,17]))


<br><br><br>
<span style="color:blue;font-size:large">Convert from GraphFrame to GraphX</span>
<br><br>
<li>Method 2: By unpacking row objects and then creating a GraphX object</li>

In [16]:
g.vertices.rdd.map(r => (r(0).toString.toLong,(r(1).toString,r(2).toString.toInt)))

res7: org.apache.spark.rdd.RDD[(Long, (String, Int))] = MapPartitionsRDD[79] at map at <console>:39


In [17]:
g.vertices.rdd.first()(0).toString.toLong

res8: Long = 1


In [18]:
(g.vertices.rdd.first()(1).toString,g.vertices.rdd.first()(2).toString.toInt)

res9: (String, Int) = (Alice,28)


In [19]:
val v = g.vertices.rdd.map(r => (r(0).toString.toLong,(r(1).toString,r(2).toString.toInt)))
val e = g.edges.rdd.map(r => Edge(r(0).toString.toLong,r(1).toString.toLong,r(2).toString.toInt))
val gx: Graph[(String, Int), Int] = Graph(v, e)

v: org.apache.spark.rdd.RDD[(Long, (String, Int))] = MapPartitionsRDD[80] at map at <console>:37
e: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[86] at map at <console>:38
gx: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@539facf8


<br><br><br>
<span style="color:blue;font-size:large">Convert from GraphX to GraphFrame</span>
<br><br>

In [20]:
import org.apache.spark.graphx.Graph
import org.apache.spark.sql.Row
val g2: GraphFrame = GraphFrame.fromGraphX(graph)

import org.apache.spark.graphx.Graph
import org.apache.spark.sql.Row
g2: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: struct<_1: string, _2: int>], e:[src: bigint, dst: bigint ... 1 more field])


In [21]:
g2.vertices.printSchema

root
 |-- id: long (nullable = false)
 |-- attr: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: integer (nullable = false)



<br><br><br>
<span style="color:blue;font-size:large">Convert from GraphX to GraphFrame</span>
<br><br>
<li>Method 2: With schema</li>

In [22]:
gx.vertices.collect

res11: Array[(org.apache.spark.graphx.VertexId, (String, Int))] = Array((10,(Birgit,33)), (1,(Alice,28)), (2,(Bob,27)), (3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)), (7,(Qing,27)), (8,(Sarika,78)), (9,(Olafson,17)))


In [23]:
gx.vertices.map(a => (a._1.toInt,a._2._1,a._2._2))

res12: org.apache.spark.rdd.RDD[(Int, String, Int)] = MapPartitionsRDD[99] at map at <console>:41


In [24]:
val v1 = gx.vertices.map(a => (a._1.toInt,a._2._1,a._2._2)).toDF("id","name","age")
val e1 = gx.edges.map(e => (e.srcId.toInt,e.dstId.toInt,e.attr)).toDF("src","dst","attr")
val g = GraphFrame(vertex_df, edge_df)


v1: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
e1: org.apache.spark.sql.DataFrame = [src: int, dst: int ... 1 more field]
g: org.graphframes.GraphFrame = GraphFrame(v:[id: int, name: string ... 1 more field], e:[src: int, dst: int ... 1 more field])


<br><br><br>
<span style="color:green;font-size:xx-large">Algorithm building blocks</span>
<br><br>

<br><br><br>
<span style="color:green;font-size:xx-large">aggregateMessages</span>
<br><br>


<span style="color:blue;font-size:large">Calculate the total incoming “likes” on each vertex</span>

In [25]:
val total_incoming_likes = gx.aggregateMessages[Int](ec => ec.sendToDst(ec.attr),(x,y) => x+y)
total_incoming_likes.collect

total_incoming_likes: org.apache.spark.graphx.VertexRDD[Int] = VertexRDDImpl[109] at RDD at VertexRDD.scala:57
res13: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((10,14), (1,8), (2,19), (3,8), (4,2), (6,6), (8,14), (9,2))


<span style="color:blue;font-size:large">try this: total outgoing likes for each person</span> 

In [26]:
val total_outgoing_likes = gx.aggregateMessages[Int](ec => ec.sendToSrc(ec.attr),(x,y) => x+y)
total_outgoing_likes.collect

total_outgoing_likes: org.apache.spark.graphx.VertexRDD[Int] = VertexRDDImpl[113] at RDD at VertexRDD.scala:57
res14: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((1,13), (2,9), (3,7), (4,1), (5,13), (7,16), (8,8), (9,6))


<span style="color:blue;font-size:large">for each person, who likes them the most and how much?</span>

In [27]:
gx.aggregateMessages[(String,Int)](ec => ec.sendToDst((ec.srcAttr._1,ec.attr)),(x,y) => if (x._2 > y._2) x else y).foreach(println)

(8,(Qing,14))
(9,(Qing,2))
(1,(Bob,7))
(2,(Alice,13))
(4,(Bob,2))
(3,(Ed,8))
(10,(Sarika,8))
(6,(Ed,3))


<span style="color:blue;font-size:large">

<span style="color:blue;font-size:large">try this: return the age of the oldest person who likes each user</span>

In [32]:
gx.aggregateMessages[Int](ec => ec.sendToDst((ec.srcAttr._2)),(x,y)=>Math.max(x,y)).collect

res20: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((10,78), (1,42), (2,65), (3,55), (4,27), (6,65), (8,27), (9,27))


<br><br><br>
<span style="color:green;font-size:xx-large">Pregel</span>
<br><br>

<li>Pregel works by sending messages along the edges of the graph in parallel</li>
<li>the messages are then used to compute "the state" of a node</li>
<li>Roughly:</li>
<ul>
    <li>pregel is applied in multiple iterations known as supersteps</li>
    <li>at each iteration, vertices send messages to adjacent vertices</li>
    <li>at each iteration, vertices update their state by processing messages received in the previous superstep</li>
    <li>the algorithm terminates when it converges or after a fixed number of steps
    

<span style="color:blue;font-size:large">Calculate the shortest path from a given vertex to every other vertex</span>

<li>pregel is run on a copy of the graph</li>
<li>Since pregel processes messages in vertices, the vertex attribute of the copy contains the statistic being calculated</li>
<li>Edge attributes are copied from the original graph</li>
<li>pregel takes three function arguments:
    <ul>
        <li>a vertex program: </li>
        <li>a send message program: </li>
        <li>a merge message program: </li>
        

In [34]:
val sourceId: VertexId = 3
val initialGraph = gx.mapVertices((id, _) =>
    if (id == sourceId) 0.0 else Double.PositiveInfinity)

sourceId: org.apache.spark.graphx.VertexId = 3
initialGraph: org.apache.spark.graphx.Graph[Double,Int] = org.apache.spark.graphx.impl.GraphImpl@45e5edef


In [35]:
initialGraph.edges.collect

res21: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(2,1,7), Edge(1,2,13), Edge(2,4,2), Edge(3,2,4), Edge(3,6,3), Edge(4,1,1), Edge(5,2,2), Edge(5,3,8), Edge(5,6,3), Edge(7,8,14), Edge(7,9,2), Edge(8,10,8), Edge(9,10,6))


In [36]:
initialGraph.vertices.collect

res22: Array[(org.apache.spark.graphx.VertexId, Double)] = Array((10,Infinity), (1,Infinity), (2,Infinity), (3,0.0), (4,Infinity), (5,Infinity), (6,Infinity), (7,Infinity), (8,Infinity), (9,Infinity))


<span style="color:blue;font-size:large">Vertex program</span>
<li>At each vertex, when a new distance arrives, replace the current shortest distance by the lesser of the current distance and the new distance</li>

In [37]:
val vertex_program = (id: VertexId, dist: Double, newDist: Double) => math.min(dist, newDist)

vertex_program: (org.apache.spark.graphx.VertexId, Double, Double) => Double = $Lambda$5618/0x0000000801e3a440@78ef7ba9


<span style="color:blue;font-size:large">The message</span>


<li>add the current shortest path at the source of each triplet to the distance to the destination</li>
<li>if this sum is less than the current shortest path at destination, send a message to the destination with this sum
<li>otherwise don't send a message



<span style="color:blue;font-size:large">Triplets</span>
<li>A triplet is a combination of (source vertex data, destination vertex dat, aedge data)</li>


In [39]:
gx.triplets.collect()

res24: Array[org.apache.spark.graphx.EdgeTriplet[(String, Int),Int]] = Array(((2,(Bob,27)),(1,(Alice,28)),7), ((1,(Alice,28)),(2,(Bob,27)),13), ((2,(Bob,27)),(4,(David,42)),2), ((3,(Charlie,65)),(2,(Bob,27)),4), ((3,(Charlie,65)),(6,(Fran,50)),3), ((4,(David,42)),(1,(Alice,28)),1), ((5,(Ed,55)),(2,(Bob,27)),2), ((5,(Ed,55)),(3,(Charlie,65)),8), ((5,(Ed,55)),(6,(Fran,50)),3), ((7,(Qing,27)),(8,(Sarika,78)),14), ((7,(Qing,27)),(9,(Olafson,17)),2), ((8,(Sarika,78)),(10,(Birgit,33)),8), ((9,(Olafson,17)),(10,(Birgit,33)),6))


In [40]:
initialGraph.triplets.collect()

res25: Array[org.apache.spark.graphx.EdgeTriplet[Double,Int]] = Array(((2,Infinity),(1,Infinity),7), ((1,Infinity),(2,Infinity),13), ((2,Infinity),(4,Infinity),2), ((3,0.0),(2,Infinity),4), ((3,0.0),(6,Infinity),3), ((4,Infinity),(1,Infinity),1), ((5,Infinity),(2,Infinity),2), ((5,Infinity),(3,0.0),8), ((5,Infinity),(6,Infinity),3), ((7,Infinity),(8,Infinity),14), ((7,Infinity),(9,Infinity),2), ((8,Infinity),(10,Infinity),8), ((9,Infinity),(10,Infinity),6))


In [41]:
val sendMsg = (triplet: EdgeTriplet[Double,Int]) => { 
    if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
      Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
    } else {
      Iterator.empty
    }
  }

sendMsg: org.apache.spark.graphx.EdgeTriplet[Double,Int] => Iterator[(org.apache.spark.graphx.VertexId, Double)] = $Lambda$5656/0x0000000801e5e040@2f907156


<span style="color:blue;font-size:large">merge messages</span>
<li>When multiple messages arrive, choose the one with the lowest shortest path

In [42]:
val mrgMsg = (a: Double, b: Double) => math.min(a, b)

mrgMsg: (Double, Double) => Double = $Lambda$5657/0x0000000801e5f840@3da78d85


<span style="color:blue;font-size:large">Run pregel</span>
<li>pregel(configs)(funcs)</li>
<li>configs = initial_msg, maximum number of iterations, and the edge direction in which to send messages)</li>
<li>funcs = (vertex_program,sendMsg,mrgMsg)</li>

In [43]:
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
  vertex_program,
    sendMsg,
    mrgMsg)



sssp: org.apache.spark.graphx.Graph[Double,Int] = org.apache.spark.graphx.impl.GraphImpl@2a3af5a


In [44]:
println(sssp.vertices.collect.mkString("\n"))

(10,Infinity)
(1,7.0)
(2,4.0)
(3,0.0)
(4,6.0)
(5,Infinity)
(6,3.0)
(7,Infinity)
(8,Infinity)
(9,Infinity)


<span style="color:blue;font-size:large">Putting it all together</span>

In [45]:
val sourceId: VertexId = 3
val initialGraph = graph.mapVertices((id, _) =>
    if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
  vertex_program,
    sendMsg,
    mrgMsg)
println(sssp.vertices.collect.mkString("\n"))

(10,Infinity)
(1,7.0)
(2,4.0)
(3,0.0)
(4,6.0)
(5,Infinity)
(6,3.0)
(7,Infinity)
(8,Infinity)
(9,Infinity)


sourceId: org.apache.spark.graphx.VertexId = 3
initialGraph: org.apache.spark.graphx.Graph[Double,Int] = org.apache.spark.graphx.impl.GraphImpl@490cb996
sssp: org.apache.spark.graphx.Graph[Double,Int] = org.apache.spark.graphx.impl.GraphImpl@59d776d1


22/10/17 20:33:03 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 969338 ms exceeds timeout 120000 ms
22/10/17 20:33:03 WARN SparkContext: Killing executors is not supported by current scheduler.


<span style="color:blue;font-size:large">Walk through</span>

<img src="initialGraph.png"><br>
<img src="pass1.png"><br>
<img src="pass2.png"><br>
<img src="pass3.png">

<br><br><br>
<span style="color:red;font-size:50px">Partitioning</span>
<br><br>

<li>Partitioning graph data is complicated because vertices share edges</li>
<li>Graph partitioning can only be done using GraphX. GraphFrames does not provide partitioning support</li>
<li>Partitioning strategies:</li>
<ul>
    <li><span style="color:red">Vertex cut</span>: The graph is partitioned by edges. A vertex can end up in multiple partitions if one of its edges is in one partition and another in a different partition</li>
    <li><span style="color:red">Edge cut</span>: The graph is partitioned by vertices. Edges are split into two if their vertices end up in different partitions and a "ghost" of the missing vertex is added to the partition</li>
</ul>

<li>GraphX uses <span style="color:blue">vertex cut</span> partitioning strategies. The graph is partitioned by edges and vertices can span multiple partitions</li>
<ul>
    <li><span style="color:blue">EdgePartition1D</span>: Edges are partitioned by hashing the srcId of the edge. All edges from a vertex will end up in the same partition </li>
    <li><span style="color:blue">EdgePartition2D</span>: An extension of 1D too complicated to explain. The main focus is in constraining the number of partitions a vertex can be in </li>
    <li><span style="color:blue">RandomVertexCut</span>:  Edges are randomly distributed across partitions by hashing (srcId, dstId) of each edge. Any vertex could exist in multiple partitions but, because of hashing, multiple same direction edges between two vertices will be in the same partition</li>
    <li><span style="color:blue">CanonicalRandomVertexCut</span>:  Same as RandomVertexCut with the exception that the (srcId, dstId) pairs are ordered <span style="color:red">before</span> hashing. All edges between two vertices, regardless of direction, will end up in the same partition</li>
    <li>The choice of partitioning strategy depends on what you want to do with the data</li>

<li>A graph contains multiple edges in both directions between pairs of vertices. If you want to calculate the total outflow from a vertex, which strategy would you use?</li>
<li>A graph contains multiple edges in both directions between pairs of vertices. If you want to calculate the total traffic that goes from vertex i to vertex j, which strategy would you use?</li>
<li>A graph contains multiple edges in both directions between pairs of vertices. If you want to calculate the total flow (sum of bi-directional flows) between two vertices, which strategy would you use?</li>



In [None]:
graph.partitionBy(PartitionStrategy.RandomVertexCut)

<span style="color:blue;font-size:large">GraphFrames partitioning</span>
<li>Convert the graph into a GraphX graph, partition, convert back into GraphFrames</li>