# Graph:

## 運算子的摘要清單（Summary List Of Operators）

```scala
class Graph[VD, ED] {
  // Information about the Graph ===================================================================
  val numEdges: Long
  val numVertices: Long
  val inDegrees: VertexRDD[Int]
  val outDegrees: VertexRDD[Int]
  val degrees: VertexRDD[Int]
  
  // Views of the graph as collections =============================================================
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
  val triplets: RDD[EdgeTriplet[VD, ED]]
  
  // Functions for caching graphs ==================================================================
  def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
  def cache(): Graph[VD, ED]
  def unpersistVertices(blocking: Boolean = true): Graph[VD, ED]
  
  // Change the partitioning heuristic  ============================================================
  def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]
  
  // Transform vertex and edge attributes ==========================================================
  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])
    : Graph[VD, ED2]
    
  // Modify the graph structure ====================================================================
  def reverse: Graph[VD, ED]
  def subgraph(
      epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
      vpred: (VertexID, VD) => Boolean = ((v, d) => true))
    : Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]
  
  // Join RDDs with the graph ======================================================================
  def joinVertices[U](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, U) => VD): Graph[VD, ED]
  def outerJoinVertices[U, VD2](other: RDD[(VertexID, U)])
      (mapFunc: (VertexID, VD, Option[U]) => VD2)
    : Graph[VD2, ED]
    
  // Aggregate information about adjacent triplets =================================================
  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexID]]
  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexID, VD)]]
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A]
    
  // Iterative graph-parallel computation ==========================================================
  def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
      vprog: (VertexID, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]
    
  // Basic graph algorithms ========================================================================
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
  def connectedComponents(): Graph[VertexID, ED]
  def triangleCount(): Graph[Int, ED]
  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED]
}
```

## VertexRDD:
inherits from RDD with two parameters. 

* VertexID is the ID of vertex, and VD is the type of vertex attribute. 
* Class VertexRDD defines some methods such as mapVertexPartitions, mapValues and filter.

## EdgeRDD:
inherits from RDD with three parameters
* ED (the type of edge attribute): the attribute associated with the edge.
* sc (source vertex): the ID of the source vertex.
* deps (dependencies of the edges, e.g. destination vertices): the ID of the target vertex.

## EdgeTriplet:
* srcAttr is the source vertex attribute, dstAttr is the destination vertex attribute. 
* Therefore, EdgeTriplet contains those five (three are inherited from Edge) basic attributes.

EdgeTriplet equals to Vertex join Edge, which makes EdgeTriplet contains both information of vertices and edges. So it’s useful especially when we want to use the attributes of both the vertex and its connected edges.

# Hands On

## Step 1: Initial Spark Session

In [1]:
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21` // for cleaner logs
import $profile.`hadoop-2.6`
import $ivy.`org.apache.spark::spark-sql:2.1.0` // adjust spark version - spark >= 2.0
import $ivy.`org.apache.spark::spark-graphx:2.1.0` // adjust spark version - spark >= 2.0  // // http://blog.csdn.net/liuxuejiang158blog/article/details/37874557
import $ivy.`org.apache.hadoop:hadoop-aws:2.6.4`
import $ivy.`org.jupyter-scala::spark:0.4.2` // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)

[32mimport [39m[36m$exclude.$                        , $ivy.$                            // for cleaner logs
[39m
[32mimport [39m[36m$profile.$           
[39m
[32mimport [39m[36m$ivy.$                                   // adjust spark version - spark >= 2.0
[39m
[32mimport [39m[36m$ivy.$                                      // adjust spark version - spark >= 2.0  // // http://blog.csdn.net/liuxuejiang158blog/article/details/37874557
[39m
[32mimport [39m[36m$ivy.$                                   
[39m
[32mimport [39m[36m$ivy.$                                // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)[39m

In [5]:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import jupyter.spark.session._

[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.graphx._
[39m
[32mimport [39m[36morg.apache.spark.rdd.RDD
[39m
[32mimport [39m[36mjupyter.spark.session._[39m

In [6]:
val sparkSession = JupyterSparkSession.builder() // important - call this rather than SparkSession.builder()
                                      .jupyter() // this method must be called straightaway after builder()
                                      // .yarn("/etc/hadoop/conf") // optional, for Spark on YARN - argument is the Hadoop conf directory
                                      // .emr("2.6.4") // on AWS ElasticMapReduce, this adds aws-related to the spark jar list
                                      .master("local") // change to "yarn-client" on YARN
                                      // .config("spark.executor.instances", "10")
                                      // .config("spark.executor.memory", "3g")
                                      // .config("spark.hadoop.fs.s3a.access.key", awsCredentials._1)
                                      // .config("spark.hadoop.fs.s3a.secret.key", awsCredentials._2)
                                      .appName("jupyter")
                                      .getOrCreate()

val sc = sparkSession.sparkContext

[36msparkSession[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@326adedb
[36msc[39m: [32mSparkContext[39m = org.apache.spark.SparkContext@42545e00

## Step 2: Initial Raw Data

In [7]:
// 設置節點和邊，節點和邊皆為用 tuple 定義的 Array
// 頂點的資料類型是 VD:(String, String)
val vertexArray = Array(
  (1L, ("RC", "Supervisor")),
  (2L, ("TH", "Data Analyst")),
  (3L, ("Roger", "Data Engineer")),
  (4L, ("Miles", "Data Analyst")),
  (5L, ("Amber", "Data Analyst")),
  (6L, ("Bgg", "Data Analyst")),
  (7L, ("Alex", "Data Engineer")),
  (8L, ("Vickie", "Data Engineer")),
  (9L, ("Cathay", "Company")),
  (10L, ("Python", "Programming")),
  (11L, ("Scala", "Programming")),
  (11L, ("Scala", "Programming")),
  (12L, ("Java", "Programming")),
  (12L, ("Java", "Programming"))
)

// 邊的資料類型為 ED: String
val edgeArray = Array(
  Edge(5L, 1L, "follower"),
  Edge(5L, 2L, "follower"),
  Edge(5L, 3L, "junior"),
  Edge(5L, 4L, "junior"),
  Edge(5L, 6L, "colleague"),
  Edge(5L, 7L, "colleague"),
  Edge(5L, 8L, "colleague"),
  Edge(1L, 9L, "worked on"),
  Edge(2L, 9L, "worked on"),
  Edge(3L, 9L, "worked on"),
  Edge(4L, 9L, "worked on"),
  Edge(5L, 9L, "worked on"),
  Edge(6L, 9L, "worked on"),
  Edge(7L, 9L, "worked on"),
  Edge(8L, 9L, "worked on"),
  Edge(5L, 10L, "learning"),
  Edge(5L, 11L, "learning")
)

// 構造 vertexRDD 和 edgeRDD
val vertexRDD: RDD[(Long, (String, String))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[String]] = sc.parallelize(edgeArray)

// 構造圖 Graph[VD,ED]
val graph: Graph[(String, String), String] = Graph(vertexRDD, edgeRDD)

[36mvertexArray[39m: [32mArray[39m[([32mLong[39m, ([32mString[39m, [32mString[39m))] = [33mArray[39m(
  ([32m1L[39m, ([32m"RC"[39m, [32m"Supervisor"[39m)),
  ([32m2L[39m, ([32m"TH"[39m, [32m"Data Analyst"[39m)),
  ([32m3L[39m, ([32m"Roger"[39m, [32m"Data Engineer"[39m)),
  ([32m4L[39m, ([32m"Miles"[39m, [32m"Data Analyst"[39m)),
  ([32m5L[39m, ([32m"Amber"[39m, [32m"Data Analyst"[39m)),
  ([32m6L[39m, ([32m"Bgg"[39m, [32m"Data Analyst"[39m)),
  ([32m7L[39m, ([32m"Alex"[39m, [32m"Data Engineer"[39m)),
  ([32m8L[39m, ([32m"Vickie"[39m, [32m"Data Engineer"[39m)),
  ([32m9L[39m, ([32m"Cathay"[39m, [32m"Company"[39m)),
  ([32m10L[39m, ([32m"Python"[39m, [32m"Programming"[39m)),
  ([32m11L[39m, ([32m"Scala"[39m, [32m"Programming"[39m)),
[33m...[39m
[36medgeArray[39m: [32mArray[39m[[32mEdge[39m[[32mString[39m]] = [33mArray[39m(
  [33mEdge[39m([32m5L[39m, [32m1L[39m, [32m"follower"[39m),
  [33mEd

In [11]:
println(s"vertexArray 的資料個數 ${vertexArray.size}")
println(s"edgeArray 的資料個數 ${edgeArray.size}")
println
println("GraphX 會對 vertexRDD 去重複：")
println(s"graph 的節點個數 ${graph.numVertices}")
println(s"graph 的節點個數 ${graph.vertices.count}")
println(s"graph 的邊個數 ${graph.numEdges}")

vertexArray 的資料個數 14
edgeArray 的資料個數 17

GraphX 會對 vertexRDD 去重複：
graph 的節點個數 12
graph 的節點個數 12
graph 的邊個數 17


## Step 3: 透過 graph.vertices 和 graph.edges 將圖解構為相應的節點和邊

### (1). 基本使用方式

In [13]:
graph.vertices.collect.foreach(println(_))

(4,(Miles,Data Analyst))
(11,(Scala,Programming))
(1,(RC,Supervisor))
(6,(Bgg,Data Analyst))
(3,(Roger,Data Engineer))
(7,(Alex,Data Engineer))
(9,(Cathay,Company))
(8,(Vickie,Data Engineer))
(12,(Java,Programming))
(10,(Python,Programming))
(5,(Amber,Data Analyst))
(2,(TH,Data Analyst))


In [14]:
graph.edges.collect.foreach(println(_))

Edge(1,9,worked on)
Edge(2,9,worked on)
Edge(3,9,worked on)
Edge(4,9,worked on)
Edge(5,1,follower)
Edge(5,2,follower)
Edge(5,3,junior)
Edge(5,4,junior)
Edge(5,6,colleague)
Edge(5,7,colleague)
Edge(5,8,colleague)
Edge(5,9,worked on)
Edge(5,10,learning)
Edge(5,11,learning)
Edge(6,9,worked on)
Edge(7,9,worked on)
Edge(8,9,worked on)


In [15]:
graph.triplets.collect.foreach(println(_))

((1,(RC,Supervisor)),(9,(Cathay,Company)),worked on)
((2,(TH,Data Analyst)),(9,(Cathay,Company)),worked on)
((3,(Roger,Data Engineer)),(9,(Cathay,Company)),worked on)
((4,(Miles,Data Analyst)),(9,(Cathay,Company)),worked on)
((5,(Amber,Data Analyst)),(1,(RC,Supervisor)),follower)
((5,(Amber,Data Analyst)),(2,(TH,Data Analyst)),follower)
((5,(Amber,Data Analyst)),(3,(Roger,Data Engineer)),junior)
((5,(Amber,Data Analyst)),(4,(Miles,Data Analyst)),junior)
((5,(Amber,Data Analyst)),(6,(Bgg,Data Analyst)),colleague)
((5,(Amber,Data Analyst)),(7,(Alex,Data Engineer)),colleague)
((5,(Amber,Data Analyst)),(8,(Vickie,Data Engineer)),colleague)
((5,(Amber,Data Analyst)),(9,(Cathay,Company)),worked on)
((5,(Amber,Data Analyst)),(10,(Python,Programming)),learning)
((5,(Amber,Data Analyst)),(11,(Scala,Programming)),learning)
((6,(Bgg,Data Analyst)),(9,(Cathay,Company)),worked on)
((7,(Alex,Data Engineer)),(9,(Cathay,Company)),worked on)
((8,(Vickie,Data Engineer)),(9,(Cathay,Company)),worked on)


### (2). 搭配使用 Case Class

* graph.vertices 返回一個 VertexRDD[(String, String)]，它繼承於 RDD[(VertexID, (String, String))]，可以用 scala 的 case class 解構這個 tuple
* graph.edges 返回一個包含 Edge[String] 物件的 EdgeRDD，也可以使用 case class

In [17]:
println("找出圖中為「Data Analyst」的節點：")

graph.vertices.filter { 
  case (id, (name, title)) => title == "Data Analyst"
}.collect.foreach {
  case (id, (name, title)) => println(s"$name is a $title.")
}

graph.vertices.filter { 
  case (id, (name, title)) => title == "Data Analyst"
}.count  // 個數

找出圖中為「Data Analyst」的節點：
Miles is a Data Analyst.
Bgg is a Data Analyst.
Amber is a Data Analyst.
TH is a Data Analyst.


[36mres16_2[39m: [32mLong[39m = [32m4L[39m

In [18]:
println("[Method1] 找出圖中屬性為「worked on」的邊：")

graph.edges.filter{ 
  case Edge(src, dst, relation) => relation == "worked on"
}.collect.foreach{
  case Edge(src, dst, relation) => println(s"${src} to ${dst} att: ${relation}")
}

找出圖中屬性為「worked on」的邊：
1 to 9 att: worked on
2 to 9 att: worked on
3 to 9 att: worked on
4 to 9 att: worked on
5 to 9 att: worked on
6 to 9 att: worked on
7 to 9 att: worked on
8 to 9 att: worked on


In [20]:
println("[Method2] 找出圖中屬性為「worked on」的邊：")

// Edge 案例類別(case class)：邊有一個 srcId 和 dstId 分別對應於來源和目標節點的辨識碼。
// 另外，Edge 類別有一個 attr 成員用來儲存邊的屬性。
graph.edges.filter(e => e.attr == "worked on").collect.foreach{
    e => println(s"${e.srcId} to ${e.dstId} attr: ${e.attr}")
}

graph.edges.filter(e => e.attr == "worked on").count // 個數

[Method2] 找出圖中屬性為「worked on」的邊：
1 to 9 attr: worked on
2 to 9 attr: worked on
3 to 9 attr: worked on
4 to 9 attr: worked on
5 to 9 attr: worked on
6 to 9 attr: worked on
7 to 9 attr: worked on
8 to 9 attr: worked on


[36mres19_2[39m: [32mLong[39m = [32m8L[39m

### (3). EdgeTriplet 類別繼承於 Edge 類別，並且加入 srcAttr 和 dstAttr 成员，這兩個成員分別包含來源和目的的屬性

EdgeTriplet 對於 vertices 和 edges 的連接操作，使得 Triplet 具備「來源節點」的 ID 和屬性、「目的節點」的 ID 和屬性、edge 的屬性。

### (3.1)

In [24]:
// graph.triplets.collect.foreach(println(_))

In [26]:
// 在 tuple 中，可以用方法_1, _2, _3 訪問其中的元素
val facts: RDD[String] =
  graph.triplets.map(
    triplet =>
        // triplets 操作：((srcId, srcAttr), (dstId, dstAttr), attr)
        if (triplet.attr == "worked on" || triplet.attr == "learning") { 
            triplet.srcAttr._1 + " is " + triplet.attr + " " + triplet.dstAttr._1
        } else{
            triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
        }    
  )

facts.collect.foreach(println(_))
println
// 等價於
facts.collect.foreach(e => println(e))

RC is worked on Cathay
TH is worked on Cathay
Roger is worked on Cathay
Miles is worked on Cathay
Amber is the follower of RC
Amber is the follower of TH
Amber is the junior of Roger
Amber is the junior of Miles
Amber is the colleague of Bgg
Amber is the colleague of Alex
Amber is the colleague of Vickie
Amber is worked on Cathay
Amber is learning Python
Amber is learning Scala
Bgg is worked on Cathay
Alex is worked on Cathay
Vickie is worked on Cathay

RC is worked on Cathay
TH is worked on Cathay
Roger is worked on Cathay
Miles is worked on Cathay
Amber is the follower of RC
Amber is the follower of TH
Amber is the junior of Roger
Amber is the junior of Miles
Amber is the colleague of Bgg
Amber is the colleague of Alex
Amber is the colleague of Vickie
Amber is worked on Cathay
Amber is learning Python
Amber is learning Scala
Bgg is worked on Cathay
Alex is worked on Cathay
Vickie is worked on Cathay


[36mfacts[39m: [32mRDD[39m[[32mString[39m] = MapPartitionsRDD[43] at map at cmd25.sc:2

### (3.2) for loop 寫法

In [27]:
// triplets 操作，((srcId, srcAttr), (dstId, dstAttr), attr)
for (triplet <- graph.triplets.filter(t => t.attr == "worked on").collect) {
  println(s"${triplet.srcAttr._1} is ${triplet.attr} ${triplet.dstAttr._1}")
}

RC is worked on Cathay
TH is worked on Cathay
Roger is worked on Cathay
Miles is worked on Cathay
Amber is worked on Cathay
Bgg is worked on Cathay
Alex is worked on Cathay
Vickie is worked on Cathay


備註：

In [28]:
// The Scala spec says that ← (unicode \u2190) is reserved as is its ascii equivalent <- which as others are also pointing out, 
// is an iterator for a for loop.
for(x <- 1 to 5)  println(x)

1
2
3
4
5


## Step 4: 圖形操作（Graph Operators）

### (1). 分支度

In [29]:
println("外分支度 (outDegrees): ")
graph.outDegrees.collect.foreach(println(_))

println("內分支度 (inDegrees): ")
graph.inDegrees.collect.foreach(println(_))

println("分支度 (degrees): ")
graph.degrees.collect.foreach(println(_))

外分支度 (outDegrees): 
(4,1)
(1,1)
(6,1)
(3,1)
(7,1)
(8,1)
(5,10)
(2,1)
內分支度 (inDegrees): 
(4,1)
(11,1)
(1,1)
(6,1)
(3,1)
(7,1)
(9,8)
(8,1)
(10,1)
(2,1)
分支度 (degrees): 
(4,2)
(11,1)
(1,2)
(6,2)
(3,2)
(7,2)
(9,8)
(8,2)
(10,1)
(5,10)
(2,2)


In [30]:
println("找出圖中擁有最大 outDegree, inDegree, Degree 的節點：")

def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
  if (a._2 > b._2) a else b
}

println("max of outDegrees:" + graph.outDegrees.reduce(max))
println("max of inDegrees:" + graph.inDegrees.reduce(max))
println("max of Degrees:" + graph.degrees.reduce(max))

找出圖中擁有最大 outDegree, inDegree, Degree 的節點：
max of outDegrees:(5,10)
max of inDegrees:(9,8)
max of Degrees:(5,10)


defined [32mfunction[39m [36mmax[39m

### (2). 屬性運算子（Property Operators）
類似 RDD 的 map 運算子，如下列所示：
```scala
class Graph[VD, ED] {
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
}
```

每個運算子執行後都會產生一個新的圖形，其頂點或邊的屬性都會經過使用者所定義的 map 函數而改變。

### (2.1) mapVertices

In [43]:
graph.mapVertices{ 
  case (id, (name, title)) => (id, (name, title*2)) 
}

println("簡單的頂點轉換操作：")

graph.mapVertices{ 
  case (id, (name, title)) => (id, (name, title*2)) 
}.vertices.collect.foreach(v => println(s"${v._2._1} is ${v._2._2}"))

簡單的頂點轉換操作：
4 is (Miles,Data AnalystData Analyst)
11 is (Scala,ProgrammingProgramming)
1 is (RC,SupervisorSupervisor)
6 is (Bgg,Data AnalystData Analyst)
3 is (Roger,Data EngineerData Engineer)
7 is (Alex,Data EngineerData Engineer)
9 is (Cathay,CompanyCompany)
8 is (Vickie,Data EngineerData Engineer)
12 is (Java,ProgrammingProgramming)
10 is (Python,ProgrammingProgramming)
5 is (Amber,Data AnalystData Analyst)
2 is (TH,Data AnalystData Analyst)


[36mres42_0[39m: [32mGraph[39m[([32mVertexId[39m, ([32mString[39m, [32mString[39m)), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@60f1fc96

In [46]:
println("較複雜的頂點轉換操作，由頂點的職位計算出年薪：")
graph.mapVertices{ 
  case (id, (name, title)) => if (title == "Supervisor") (id, (name, title, 5000000)) 
                              else if (title == "Data Engineer" || title == "Data Analyst") (id, (name, title, 2000000))  
                              else (id, (name, title, 0))
}.vertices.collect.foreach(v => println(s"Vertex ID No.${v._2._1} is ${v._2._2}"))

較複雜的頂點轉換操作，由頂點的職位計算出年薪：
Vertex ID No.4 is (Miles,Data Analyst,2000000)
Vertex ID No.11 is (Scala,Programming,0)
Vertex ID No.1 is (RC,Supervisor,5000000)
Vertex ID No.6 is (Bgg,Data Analyst,2000000)
Vertex ID No.3 is (Roger,Data Engineer,2000000)
Vertex ID No.7 is (Alex,Data Engineer,2000000)
Vertex ID No.9 is (Cathay,Company,0)
Vertex ID No.8 is (Vickie,Data Engineer,2000000)
Vertex ID No.12 is (Java,Programming,0)
Vertex ID No.10 is (Python,Programming,0)
Vertex ID No.5 is (Amber,Data Analyst,2000000)
Vertex ID No.2 is (TH,Data Analyst,2000000)


In [47]:
println("較複雜的頂點轉換操作，由頂點的職位計算出年薪（另一種寫法 Pattern Matching ）：")
graph.mapVertices{ 
  case (id, (name, title)) => title match {
                                    case "Supervisor" => (id, (name, title, 5000000))
                                    case "Data Engineer" | "Data Analyst" => (id, (name, title, 2000000))
                                    case _ => (id, (name, title, 0)) 
                                }
    }.vertices.collect.foreach(v => println(s"Vertex ID No.${v._2._1} is ${v._2._2}"))

較複雜的頂點轉換操作，由頂點的職位計算出年薪（另一種寫法 Pattern Matching ）：
Vertex ID No.4 is (Miles,Data Analyst,2000000)
Vertex ID No.11 is (Scala,Programming,0)
Vertex ID No.1 is (RC,Supervisor,5000000)
Vertex ID No.6 is (Bgg,Data Analyst,2000000)
Vertex ID No.3 is (Roger,Data Engineer,2000000)
Vertex ID No.7 is (Alex,Data Engineer,2000000)
Vertex ID No.9 is (Cathay,Company,0)
Vertex ID No.8 is (Vickie,Data Engineer,2000000)
Vertex ID No.12 is (Java,Programming,0)
Vertex ID No.10 is (Python,Programming,0)
Vertex ID No.5 is (Amber,Data Analyst,2000000)
Vertex ID No.2 is (TH,Data Analyst,2000000)


### (2.2) mapEdges

In [48]:
println("簡單的邊轉換操作，由邊的屬性計算出：")
graph.mapEdges(e => e.attr.toUpperCase)
  .edges.collect.foreach(e => println(s"${e.srcId} to ${e.dstId} attr: ${e.attr}"))


簡單的邊轉換操作，由邊的屬性計算出：
1 to 9 attr: WORKED ON
2 to 9 attr: WORKED ON
3 to 9 attr: WORKED ON
4 to 9 attr: WORKED ON
5 to 1 attr: FOLLOWER
5 to 2 attr: FOLLOWER
5 to 3 attr: JUNIOR
5 to 4 attr: JUNIOR
5 to 6 attr: COLLEAGUE
5 to 7 attr: COLLEAGUE
5 to 8 attr: COLLEAGUE
5 to 9 attr: WORKED ON
5 to 10 attr: LEARNING
5 to 11 attr: LEARNING
6 to 9 attr: WORKED ON
7 to 9 attr: WORKED ON
8 to 9 attr: WORKED ON


### (2.3) 補充

In [31]:
// 注意，在經過這些操作下，是不會影響到圖形的結構。這些運算子有一個重要特色，就是它會重複利用原始圖形結構的索引值。
// 下面的兩段程式碼目的上是相同的，但是第一段並不會保存結構的索引值，這樣將無法讓 GraphX 系統優化。

// Method1: map
// 第一段並不會保存結構的索引值，將無法讓 GraphX 系統優化
val newVertices = graph.vertices.map { 
  case (id, (name, title)) => if (title == "Supervisor") (id, (name, title, 5000000)) 
                              else if (title == "Data Engineer" || title == "Data Analyst") (id, (name, title, 2000000))  
                              else (id, (name, title, 0))  
}
val newGraph1 = Graph(newVertices, graph.edges)

newGraph1.vertices.collect.foreach(v => println(s"${v._2._1} is ${v._2._2}, and his/her salary is ${v._2._3}"))

Miles is Data Analyst, and his/her salary is 2000000
Scala is Programming, and his/her salary is 0
RC is Supervisor, and his/her salary is 5000000
Bgg is Data Analyst, and his/her salary is 2000000
Roger is Data Engineer, and his/her salary is 2000000
Alex is Data Engineer, and his/her salary is 2000000
Cathay is Company, and his/her salary is 0
Vickie is Data Engineer, and his/her salary is 2000000
Java is Programming, and his/her salary is 0
Python is Programming, and his/her salary is 0
Amber is Data Analyst, and his/her salary is 2000000
TH is Data Analyst, and his/her salary is 2000000


[36mnewVertices[39m: [32mRDD[39m[([32mVertexId[39m, ([32mString[39m, [32mString[39m, [32mInt[39m))] = MapPartitionsRDD[57] at map at cmd30.sc:1
[36mnewGraph1[39m: [32mGraph[39m[([32mString[39m, [32mString[39m, [32mInt[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@7961f66a

In [42]:
// Method2:
// 另一種方法是透過 mapVertices⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]) 來保存索引。

// val newGraph2 = graph.mapVertices{
//                                   case (id, (name, title)) => if (title == "Supervisor") (name, title, 5000000)
//                                                               else if (title == "Data Engineer" || title == "Data Analyst") (name, title, 2000000) 
//                                                               else (name, title, 0)
//                                   }

val newGraph = graph.mapVertices{ 
  case (id, (name, title)) => title match {
                                            case "Supervisor" => (name, title, 5000000) //(id, (name, title, 5000000))
                                            case "Data Engineer" | "Data Analyst" => (name, title, 2000000) //(id, (name, title, 2000000))
                                            case _ => (name, title, 0) //(id, (name, title, 0)) 
                                           }
}

newGraph.vertices.collect.foreach(v => println(s"${v._2._1} is ${v._2._2}, and his/her salary is ${v._2._3}"))
println
newGraph.vertices.collect.foreach(println(_))

Miles is Data Analyst, and his/her salary is 2000000
Scala is Programming, and his/her salary is 0
RC is Supervisor, and his/her salary is 5000000
Bgg is Data Analyst, and his/her salary is 2000000
Roger is Data Engineer, and his/her salary is 2000000
Alex is Data Engineer, and his/her salary is 2000000
Cathay is Company, and his/her salary is 0
Vickie is Data Engineer, and his/her salary is 2000000
Java is Programming, and his/her salary is 0
Python is Programming, and his/her salary is 0
Amber is Data Analyst, and his/her salary is 2000000
TH is Data Analyst, and his/her salary is 2000000

(4,(Miles,Data Analyst,2000000))
(11,(Scala,Programming,0))
(1,(RC,Supervisor,5000000))
(6,(Bgg,Data Analyst,2000000))
(3,(Roger,Data Engineer,2000000))
(7,(Alex,Data Engineer,2000000))
(9,(Cathay,Company,0))
(8,(Vickie,Data Engineer,2000000))
(12,(Java,Programming,0))
(10,(Python,Programming,0))
(5,(Amber,Data Analyst,2000000))
(2,(TH,Data Analyst,2000000))


[36mnewGraph[39m: [32mGraph[39m[([32mString[39m, [32mString[39m, [32mInt[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@6af26eaf

### (3). 結構性運算子（Structural Operators）

下面列出了基本的結構性運算子：
```scala
class Graph[VD, ED] {
  def reverse: Graph[VD, ED]
  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}
```
* reverse：此運算子將會反轉圖形內所有邊的方向並回傳反轉後的圖形。例如，這個操作可以用來計算反轉後的 PageRank。由於這個操作並不會修改到頂點或是邊，也不會改變邊的數量，所以能夠在不搬移或複製資料的情況下有效率地實現。

* subgraph⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED])：此運算子會利用使用者給予的頂點和邊的條件（predicateds），回傳的圖形是滿足條件的頂點和邊，以及滿足頂點條件的相連頂點。subgraph 運算子可以在許多情況上，限制有興趣的頂點和邊或刪除受損的連結。

* groupEdges⇒ED):Graph[VD,ED])：此運算子會合併平行的邊（如一對頂點間重複的邊）。在許多應用上，會藉由將平行的邊合併（權值合併）為一條來降低圖形的大小。

### (3.1)

In [50]:
println("節點為「Data Analyst」的子圖：")
// 注意，該例子只有提供頂點的條件。如果沒有給予頂點或邊的條件，subgraph 運算子預設為 True，代表不會做任何限制。
val subGraph = graph.subgraph(vpred = (id, vd) => vd._2 == "Data Analyst")
subGraph.triplets.collect.foreach(println(_))
println
println("子圖中所有的節點：")
subGraph.vertices.collect.foreach(v => println(s"${v._2._1} is ${v._2._2}"))
println
println("子圖中所有的邊：")
subGraph.edges.collect.foreach(e => println(s"${e.srcId} to ${e.dstId} attr: ${e.attr}"))

節點為「Data Analyst」的子圖：
((5,(Amber,Data Analyst)),(2,(TH,Data Analyst)),follower)
((5,(Amber,Data Analyst)),(4,(Miles,Data Analyst)),junior)
((5,(Amber,Data Analyst)),(6,(Bgg,Data Analyst)),colleague)

子圖中所有的節點：
Miles is Data Analyst
Bgg is Data Analyst
Amber is Data Analyst
TH is Data Analyst

子圖中所有的邊：
5 to 2 attr: follower
5 to 4 attr: junior
5 to 6 attr: colleague


[36msubGraph[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@54535be4

In [51]:
println("邊為「worked on」的子圖：")
val subGraph2 = graph.subgraph(epred = e => e.attr == "worked on")  // e.srcId != e.dstId
subGraph2.triplets.collect.foreach(println(_))
println
println("子圖中所有的節點：")
subGraph2.vertices.collect.foreach(v => println(s"${v._2._1} is ${v._2._2}"))
println
println("子圖中所有的邊：")
subGraph2.edges.collect.foreach(e => println(s"${e.srcId} to ${e.dstId} attr: ${e.attr}"))

邊為「worked on」的子圖：
((1,(RC,Supervisor)),(9,(Cathay,Company)),worked on)
((2,(TH,Data Analyst)),(9,(Cathay,Company)),worked on)
((3,(Roger,Data Engineer)),(9,(Cathay,Company)),worked on)
((4,(Miles,Data Analyst)),(9,(Cathay,Company)),worked on)
((5,(Amber,Data Analyst)),(9,(Cathay,Company)),worked on)
((6,(Bgg,Data Analyst)),(9,(Cathay,Company)),worked on)
((7,(Alex,Data Engineer)),(9,(Cathay,Company)),worked on)
((8,(Vickie,Data Engineer)),(9,(Cathay,Company)),worked on)

子圖中所有的節點：
Miles is Data Analyst
Scala is Programming
RC is Supervisor
Bgg is Data Analyst
Roger is Data Engineer
Alex is Data Engineer
Cathay is Company
Vickie is Data Engineer
Java is Programming
Python is Programming
Amber is Data Analyst
TH is Data Analyst

子圖中所有的邊：
1 to 9 attr: worked on
2 to 9 attr: worked on
3 to 9 attr: worked on
4 to 9 attr: worked on
5 to 9 attr: worked on
6 to 9 attr: worked on
7 to 9 attr: worked on
8 to 9 attr: worked on


[36msubGraph2[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@65c50106

### (3.2)
以下範例說明如何刪除受損的連結：

In [52]:
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
                       (4L, ("peter", "student"))))

// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))

// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")

// Build the initial Graph
val demoGraph = Graph(users, relationships, defaultUser)

[36musers[39m: [32mRDD[39m[([32mVertexId[39m, ([32mString[39m, [32mString[39m))] = ParallelCollectionRDD[150] at parallelize at cmd51.sc:2
[36mrelationships[39m: [32mRDD[39m[[32mEdge[39m[[32mString[39m]] = ParallelCollectionRDD[151] at parallelize at cmd51.sc:8
[36mdefaultUser[39m: ([32mString[39m, [32mString[39m) = ([32m"John Doe"[39m, [32m"Missing"[39m)
[36mdemoGraph[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@1a96f8ba

In [53]:
// Notice that there is a user 0 (for which we have no information) connected to users 4 (peter) and 5 (franklin).
demoGraph.vertices.collect.foreach(println(_))
println()
demoGraph.triplets.map(
  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
).collect.foreach(println(_))

(4,(peter,student))
(0,(John Doe,Missing))
(3,(rxin,student))
(7,(jgonzal,postdoc))
(5,(franklin,prof))
(2,(istoica,prof))

istoica is the colleague of franklin
rxin is the collab of jgonzal
peter is the student of John Doe
franklin is the colleague of John Doe
franklin is the advisor of rxin
franklin is the pi of jgonzal


In [55]:
// Remove missing vertices as well as the edges to connected to them
val validGraph = demoGraph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

// The valid subgraph will disconnect users 4 and 5 by removing user 0
validGraph.vertices.collect.foreach(println(_))
println()
validGraph.triplets.map(
  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
).collect.foreach(println(_))

(4,(peter,student))
(3,(rxin,student))
(7,(jgonzal,postdoc))
(5,(franklin,prof))
(2,(istoica,prof))

istoica is the colleague of franklin
rxin is the collab of jgonzal
franklin is the advisor of rxin
franklin is the pi of jgonzal


[36mvalidGraph[39m: [32mGraph[39m[([32mString[39m, [32mString[39m), [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@bfc634a

### (4). Join運算子（Join Operators）

在許多情況下，必須將外部的資料合併到圖中。例如，可能會想將額外的使用者資訊合併到現有的圖中，或是想從一個圖中取出資訊加到另一個圖中。

這些任務都可以藉由join運算子來完成。

以下列出join運算子主要的功能。
```scala
class Graph[VD, ED] {
  def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)
    : Graph[VD, ED]
  def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)
    : Graph[VD2, ED]
}
```

* joinVertices])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED])：此運算子會將輸入的 RDD 和頂點作結合，回傳一個透過使用者定義的 map 函數所轉換後的頂點的圖。若頂點沒有匹配值則會保留其原始值。

* 除了將使用者自定義的 map 函數套用到所有的頂點和改變頂點屬性類型外，更一般的 outerJoinVertices])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]) 的用法與 joinVertices 類似。因為並非所有頂點在 RDD 中都有匹配值，map 函數需要一個option 型別參數。


In [62]:
// newGraph.vertices.collect.foreach(println(_))

In [63]:
case class User(name: String, title: String, salary: Int, inDeg: Int, outDeg: Int)

// 創建一個新圖，節點的數據類型為 User，並從原圖 newGraph 做類型轉換
val initialUserGraph: Graph[User, String] = newGraph.mapVertices { case (id, (name, title, salary)) => User(name, title, salary, 0, 0)}

// 檢視資訊
// initialUserGraph.vertices.collect.foreach(println(_))
// println
// initialUserGraph.edges.collect.foreach(println(_))

//initialUserGraph 與 inDegrees、outDegrees（RDD）進行連接，並修改 initialUserGraph 中 inDeg值、outDeg值
val userGraph = initialUserGraph.outerJoinVertices(initialUserGraph.inDegrees) {
  case (id, user, inDegOpt) => User(user.name, user.title, user.salary, inDegOpt.getOrElse(0), user.outDeg) // 此時，user.outDeg 必為 0
}.outerJoinVertices(initialUserGraph.outDegrees) {
  case (id, user, outDegOpt) => User(user.name, user.title, user.salary, user.inDeg, outDegOpt.getOrElse(0))
}

defined [32mclass[39m [36mUser[39m
[36minitialUserGraph[39m: [32mGraph[39m[[32mwrapper[39m.[32mwrapper[39m.[32mUser[39m, [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@6301c3d1
[36muserGraph[39m: [32mGraph[39m[[32mwrapper[39m.[32mwrapper[39m.[32mUser[39m, [32mString[39m] = org.apache.spark.graphx.impl.GraphImpl@47cb0b5c

In [63]:
// 兩者的內容相同
// initialUserGraph.inDegrees.collect.foreach(println(_)) == graph.inDegrees.collect.foreach(println(_))
// println
// initialUserGraph.outDegrees.collect.foreach(println(_)) == graph.outDegrees.collect.foreach(println(_))

In [64]:
// 檢視資訊
userGraph.vertices.collect.foreach(println(_))
println
println("圖的屬性：")
userGraph.vertices.collect.foreach(v => println(s"${v._2.name} inDeg: ${v._2.inDeg}  outDeg: ${v._2.outDeg}"))

(4,User(Miles,Data Analyst,2000000,1,1))
(11,User(Scala,Programming,0,1,0))
(1,User(RC,Supervisor,5000000,1,1))
(6,User(Bgg,Data Analyst,2000000,1,1))
(3,User(Roger,Data Engineer,2000000,1,1))
(7,User(Alex,Data Engineer,2000000,1,1))
(9,User(Cathay,Company,0,8,0))
(8,User(Vickie,Data Engineer,2000000,1,1))
(12,User(Java,Programming,0,0,0))
(10,User(Python,Programming,0,1,0))
(5,User(Amber,Data Analyst,2000000,0,10))
(2,User(TH,Data Analyst,2000000,1,1))

圖的屬性：
Miles inDeg: 1  outDeg: 1
Scala inDeg: 1  outDeg: 0
RC inDeg: 1  outDeg: 1
Bgg inDeg: 1  outDeg: 1
Roger inDeg: 1  outDeg: 1
Alex inDeg: 1  outDeg: 1
Cathay inDeg: 8  outDeg: 0
Vickie inDeg: 1  outDeg: 1
Java inDeg: 0  outDeg: 0
Python inDeg: 1  outDeg: 0
Amber inDeg: 0  outDeg: 10
TH inDeg: 1  outDeg: 1


In [65]:
println("外分支度和內分支度相同的節點：")
userGraph.vertices.filter {
  case (id, user) => user.inDeg == user.outDeg
}.collect.foreach {
  case (id, user) => println(user.name)
}

外分支度和內分支度相同的節點：
Miles
RC
Bgg
Roger
Alex
Vickie
Java
TH


### (5). 相鄰聚合（Neighborhood Aggregation）

匯集每個頂點周圍的資訊。例如，可能想知道每個使用者的追隨者數量或是平均年薪。許多的迭代圖形演算法（如PageRank、最短路徑（Shortest Path）和連通分量（Connected Components））重複的匯集相鄰頂點（如PageRank的值、到來源的最短路徑、最小可到達的頂點id）的資訊。

為了改善效能，將主要的聚合運算子從 graph.mapReduceTriplets 改成新的 graph.AggregateMessages。

### 聚合訊息(aggregateMessages)

GraphX 中的核心聚合運算是 aggregateMessages⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A])。這個運算子在圖形的每個edge triplet 應用一個使用者自定義的 sendMsg 函數，然後也應用 mergeMsg 函數去匯集目標頂點的資訊。

```scala
class Graph[VD, ED] {
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[Msg]
}
```


* 使用者自定義的 sendMsg 函數接受一個 EdgeContext 型別，EdgeContext 透露了起始和目標的屬性以及傳送訊息給起始和目標屬性的函數 （sendToSrc:Unit) 和 (sendToDst:Unit) ）。可以將 sendMsg 視作 map-reduce 中的 map 函數。

* 而使用者自定義的 mergeMsg 函數接受兩個指定的訊息到相同的頂點並產生一個訊息，可以將 mergeMsg 視作 map-reduce 中的 reduce 函數。aggregateMessages⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A]) 運算子會回傳一個包含匯集訊息（Msg型別）到指定的每一個頂點的 VertexRDD[Msg]。沒有接收到訊息的頂點不會包含在回傳的 VertexRDD 中。

In [68]:
val maxSalaryFollower: VertexRDD[(String,Int)] = userGraph.aggregateMessages[(String, Int)](
  // 將來源節點的屬性發送給目標節點，map 過程
  triplet => {
      triplet.sendToDst(triplet.srcAttr.name, triplet.srcAttr.salary)
  },
  // 得到薪水最大的 follower，reduce 過程
  (a, b) => if (a._2 > b._2) a else b
)

maxSalaryFollower.collect.foreach(println(_))
println

userGraph.vertices.leftJoin(maxSalaryFollower) {
  (id, user, optMaxSalaryFollower) => optMaxSalaryFollower match {
                                                                    case None => s"${user.name} does not have any followers."
                                                                    case Some((name, salary)) => s"${name} is the follower of ${user.name}."
                                                                  }
}.collect.foreach{ case (id, str) => println(str) }

(4,(Amber,2000000))
(11,(Amber,2000000))
(1,(Amber,2000000))
(6,(Amber,2000000))
(3,(Amber,2000000))
(7,(Amber,2000000))
(9,(RC,5000000))
(8,(Amber,2000000))
(10,(Amber,2000000))
(2,(Amber,2000000))

Amber is the follower of Miles.
Amber is the follower of Scala.
Amber is the follower of RC.
Amber is the follower of Bgg.
Amber is the follower of Roger.
Amber is the follower of Alex.
RC is the follower of Cathay.
Amber is the follower of Vickie.
Java does not have any followers.
Amber is the follower of Python.
Amber does not have any followers.
Amber is the follower of TH.


[36mmaxSalaryFollower[39m: [32mVertexRDD[39m[([32mString[39m, [32mInt[39m)] = VertexRDDImpl[322] at RDD at VertexRDD.scala:57