# Transforming and Shaping Up Graphs to Your Needs

# 우선 이번 장 맡은걸 설명 못드리는 점이 정말 죄송합니다. 그렇지만 실습에 관해서는 중요한 부분을 실습 코드를 통해서 확인하시면 많은 도움이 되실 겁니다.

#### 이번장의 목표 입니다.

• Use property operators to modify vertex or edge properties

• Use structural operators to modify the shape of a graph

• Join additional RDD collections with a property graph

# Transforming the vertex and edge attributes

map 함수는 spark 에서 핵심 Transformation 입니다.
여기 Graph 클래스 에는 3개의 map operator가 정의 되어있습니다.

class Graph[VD, ED] {

def mapVertices[VD2](mapFun: (VertexId, VD) => VD2): Graph[VD2,
 ED]

def mapEdges[ED2](mapFun: Edge[ED] => ED2): Graph[VD, ED2]

def mapTriplets[ED2](mapFun: EdgeTriplet[VD, ED] => ED2):
 Graph[VD, ED2]
}

In [31]:
import org.apache.spark.rdd._

In [1]:
import org.apache.spark.graphx._

In [2]:
case class Person( first: String, last: String, age :Int)
case class Link( relationship: String, duration: Float)

실습을 위해서 PersonData, EdgeData를 만들어 줍니다.

In [3]:
val PersonData = sc.parallelize( Array( (1L, ("f1","l1",1)), (2L, ("f2","l2",2)), (3L, ("f3","l3",3)) ) )
val EdgeData = sc.parallelize( Array( (1L, 2L, ("r1",1.0)), (2L, 3L, ("r2",2.0)) ) )


데이터로 그래프 폼에 맞춰어 맵함수를 적용시켜 Graph 인자에 넣기 알맞게 만들어준다

In [4]:

val VertexRDD = PersonData.map( s => ( s._1, Person(s._2._1, s._2._2, s._2._3) ) )
val EdgeRDD = EdgeData.map( s => Edge( s._1, s._2, Link(s._3._1, s._3._2.toFloat) ) )

In [5]:
val inputGraph: Graph[Person,Link] = Graph(VertexRDD, EdgeRDD)

In [6]:
val outputGraph: Graph[String, Link] =
inputGraph.mapVertices((_, person) => person.first + person.last)

In [29]:
val outputGraph: Graph[Person, String] =
inputGraph.mapEdges(link => link)

Name: Compile Error
Message: <console>:32: error: type mismatch;
 found   : org.apache.spark.graphx.Edge[Link]
 required: String
       inputGraph.mapEdges(link => link)
                                   ^
StackTrace: 

In [26]:
val outputGraph: Graph[Person, (Double, Double)] =
inputGraph.mapTriplets(t => (t.srcAttr.age - t.attr.duration,
t.dstAttr.age - t.attr.duration))

In [60]:
outputGraph.vertices.collect.foreach(println)

(1,Person(f1,l1,1))
(2,Person(f2,l2,2))
(3,Person(f3,l3,3))


# Modifying graph structures

GraphX 라이브러리 안에 graph들의 구조를 바꿀수 있는 메서드를 제공합니다.

class Graph[VD, ED] {
 def reverse: Graph[VD, ED]

 def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
 vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

 def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]

 def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}

# subgraph

연결이 제대로 되어있지 않은 부분을 제거해주는 함수

vertex와 edge들을 생성해줍니다.

In [32]:
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
                       (4L, ("peter", "student"))))

In [33]:
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))

vertex가 없는 case에 대해서 defaultUser가 포함되어있는 graph를 생성한다.

In [34]:
val defaultUser = ("John Doe", "Missing")

In [35]:
val graph = Graph(users, relationships, defaultUser)

triplets은 세 쌍둥이라는 뜻인데 여기서는 vertex -edge - vertex를 뜻합니다.

In [36]:
graph.triplets.map(
    triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
  ).collect.foreach(println(_))

subgraph로 두번째 인자가 Missing 인 vertex를 제한합니다.

In [37]:
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

In [38]:
validGraph.vertices.collect.foreach(println(_))

(2,(istoica,prof))
(3,(rxin,student))
(4,(peter,student))
(5,(franklin,prof))
(7,(jgonzal,postdoc))


## vertex - edge  -vertex를 차례로 보시면 3번 vertex의 첫번째 rxin +" is the " + 7vertex로 향하는 edge의 attr값 +"of "+ 도착7번 vertex의 첫번째 인자를 프린트 합니다. 나머지 연결된 graph의 존재하는 triplets들을 같은 규칙으로 반환하면 아래의 결과를 반환하게 됩니다.

In [39]:
validGraph.triplets.map(
    triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
  ).collect.foreach(println(_))

                                                                                rxin is the collab of jgonzal
franklin is the advisor of rxin
istoica is the colleague of franklin
franklin is the pi of jgonzal


# mask

In [41]:
val ccGraph = graph.connectedComponents()

In [3]:
ccGraph.edges.collect.foreach(println(_))

Edge(3,7,collab)
Edge(5,3,advisor)
Edge(2,5,colleague)
Edge(5,7,pi)
Edge(4,0,student)
Edge(5,0,colleague)


In [42]:
val validGraph =graph.subgraph(vpred = (id,attr) => attr._2 != "Missing")

In [6]:
validGraph.edges.collect.foreach(println(_))

Edge(3,7,collab)
Edge(5,3,advisor)
Edge(2,5,colleague)
Edge(5,7,pi)


In [7]:
validGraph.vertices.collect.foreach(println(_))

(2,(istoica,prof))
(3,(rxin,student))
(4,(peter,student))
(5,(franklin,prof))
(7,(jgonzal,postdoc))


## validGraph로 추려진 graph를 기반으로 연결된 Graph를 반환한다. 

In [8]:
val validCCGraph = ccGraph.mask(validGraph)

In [10]:
validCCGraph.edges.collect.foreach(println(_))

Edge(3,7,collab)
Edge(5,3,advisor)
Edge(2,5,colleague)
Edge(5,7,pi)


# Example - Hollywood movie graph

Chapter4의 예제를 풀이하도록하겠습니다.

In [11]:
 val actors: RDD[(VertexId, String)] = sc.parallelize(List(
 (1L, "George Clooney"),(2L, "Julia Stiles"),
 (3L, "Will Smith"), (4L, "Matt Damon"),
 (5L, "Salma Hayek")))

In [12]:
val movies: RDD[Edge[String]] = sc.parallelize(List(
 Edge(1L,4L,"Ocean's Eleven"),
 Edge(2L, 4L, "Bourne Ultimatum"),
 Edge(3L, 5L, "Wild Wild West"),
 Edge(1L, 5L, "From Dusk Till Dawn"),
 Edge(3L, 4L, "The Legend of Bagger Vance"))
)

In [13]:
 val movieGraph = Graph(actors, movies)

In [14]:
 movieGraph.vertices.collect.foreach(println(_))

(1,George Clooney)
(2,Julia Stiles)
(3,Will Smith)
(4,Matt Damon)
(5,Salma Hayek)


In [15]:
movieGraph.edges.collect.foreach(println(_))

Edge(1,4,Ocean's Eleven)
Edge(2,4,Bourne Ultimatum)
Edge(3,5,Wild Wild West)
Edge(1,5,From Dusk Till Dawn)
Edge(3,4,The Legend of Bagger Vance)


# 위의 내용으로 movieGraph 를 생성합니다

In [19]:
 movieGraph.triplets.collect.foreach(t => println(
t.srcAttr + " & " + t.dstAttr + " appeared in " + t.attr))

George Clooney & Matt Damon appeared in Ocean's Eleven
Julia Stiles & Matt Damon appeared in Bourne Ultimatum
Will Smith & Salma Hayek appeared in Wild Wild West
George Clooney & Salma Hayek appeared in From Dusk Till Dawn
Will Smith & Matt Damon appeared in The Legend of Bagger Vance


In [20]:
 movieGraph.vertices.collect.foreach(println(_))

(1,George Clooney)
(2,Julia Stiles)
(3,Will Smith)
(4,Matt Damon)
(5,Salma Hayek)


# joinVertices를 실습하기위해서 출생지 RDD를 생성합니다.

In [21]:
 case class Biography(birthname: String, hometown: String)

In [24]:
 val bio: RDD[(VertexId, Biography)] = sc.parallelize(List(
 (2, Biography("Julia O'Hara Stiles", "NY City, NY, USA")),
 (3, Biography("Willard Christopher Smith Jr.", "Philadelphia, PA,USA")),
 (4, Biography("Matthew Paige Damon", "Boston, MA, USA")),
 (5, Biography("Salma Valgarma Hayek-Jimenez", "Coatzacoalcos, Veracruz,Mexico")),
 (6, Biography("José Antonio Domínguez Banderas", "Málaga, Andalucía,Spain")),
 (7, Biography("Paul William Walker IV", "Glendale, CA, USA"))))

In [25]:
 def appendHometown(id: VertexId, name: String, bio: Biography):
String = name + ":"+ bio.hometown

## bio 데이터를appendHometown방식으로 joinVertices을 수행해줍니다.

In [26]:
 val movieJoinedGraph =
movieGraph.joinVertices(bio)(appendHometown) 

In [28]:
 movieJoinedGraph.vertices.collect.foreach(println) 

(1,George Clooney)
(2,Julia Stiles:NY City, NY, USA)
(3,Will Smith:Philadelphia, PA,USA)
(4,Matt Damon:Boston, MA, USA)
(5,Salma Hayek:Coatzacoalcos, Veracruz,Mexico)


## outerJoinVertices를 수행해줍니다.


# 여기서 알아야 할 outerJoin은 leftouterjoin개념을  설명 드리면 left 를 기준으로 -> 방식으로 연산을 생각하시면 편합니다. 여기서는 movieGraph 를 기준으로 해서 bio RDD를 조인 하는 것을 생각하면 됩니다. George Clooney의 출생지가 None으로 된것을 보시면 확인 하실수 있습니다.

In [29]:
 val movieOuterJoinedGraph =
movieGraph.outerJoinVertices(bio)((_,name, bio) => (name,bio)) 

In [31]:
 movieOuterJoinedGraph.vertices.collect.foreach(println)

(1,(George Clooney,None))
(2,(Julia Stiles,Some(Biography(Julia O'Hara Stiles,NY City, NY, USA))))
(3,(Will Smith,Some(Biography(Willard Christopher Smith Jr.,Philadelphia, PA,USA))))
(4,(Matt Damon,Some(Biography(Matthew Paige Damon,Boston, MA, USA))))
(5,(Salma Hayek,Some(Biography(Salma Valgarma Hayek-Jimenez,Coatzacoalcos, Veracruz,Mexico))))


# 위의 OuterJoin의 결과를 보기 좋게 변환 하는 처리를 아래에서 진행해줍니다.

In [32]:
 val movieOuterJoinedGraph = movieGraph.outerJoinVertices(bio)((_,
name, bio) =>
(name,bio.getOrElse(Biography("NA","NA")))) 

In [34]:
 movieOuterJoinedGraph.vertices.collect.foreach(println)

(1,(George Clooney,Biography(NA,NA)))
(2,(Julia Stiles,Biography(Julia O'Hara Stiles,NY City, NY, USA)))
(3,(Will Smith,Biography(Willard Christopher Smith Jr.,Philadelphia, PA,USA)))
(4,(Matt Damon,Biography(Matthew Paige Damon,Boston, MA, USA)))
(5,(Salma Hayek,Biography(Salma Valgarma Hayek-Jimenez,Coatzacoalcos, Veracruz,Mexico)))


In [35]:
 case class Actor(name: String, birthname: String, hometown:
String) 

In [36]:
 val movieOuterJoinedGraph = movieGraph.outerJoinVertices(bio)((_,
name, b) => b match {
 case Some(bio) => Actor(name, bio.birthname, bio.hometown)
 case None => Actor(name, "", "")
 })

In [38]:
 movieOuterJoinedGraph.vertices.collect.foreach(println)

(1,Actor(George Clooney,,))
(2,Actor(Julia Stiles,Julia O'Hara Stiles,NY City, NY, USA))
(3,Actor(Will Smith,Willard Christopher Smith Jr.,Philadelphia, PA,USA))
(4,Actor(Matt Damon,Matthew Paige Damon,Boston, MA, USA))
(5,Actor(Salma Hayek,Salma Valgarma Hayek-Jimenez,Coatzacoalcos, Veracruz,Mexico))


# Data operations on VertexRDD and EdgeRDD

값에 접근해 transformation해주는 함수를 사용한 예제

In [39]:
 val actorsBio = movieJoinedGraph.vertices

In [41]:
 actorsBio.collect.foreach(println)

(1,George Clooney)
(2,Julia Stiles:NY City, NY, USA)
(3,Will Smith:Philadelphia, PA,USA)
(4,Matt Damon:Boston, MA, USA)
(5,Salma Hayek:Coatzacoalcos, Veracruz,Mexico)


In [46]:
 actorsBio.mapValues(s => s.split(':')(0)).collect.foreach(println)

(1,George Clooney)
(2,Julia Stiles)
(3,Will Smith)
(4,Matt Damon)
(5,Salma Hayek)


In [47]:
 actorsBio.mapValues((vid,s) => s.split(':')(0)).collect.foreach(println)

(1,George Clooney)
(2,Julia Stiles)
(3,Will Smith)
(4,Matt Damon)
(5,Salma Hayek)


## Joining VertexRDDs

## 여기서는 innerJoin과 leftJoin에 대해서 예제를 통해 설명하도록하겠습니다.

In [48]:
 val actors = movieGraph.vertices

### innerJoin은 조인이 성공한 인자만 반환합니다.

In [51]:
 actors.innerJoin(bio)((vid, name, b) => name + " is from " +
b.hometown).collect.foreach(println)

(2,Julia Stiles is from NY City, NY, USA)
(3,Will Smith is from Philadelphia, PA,USA)
(4,Matt Damon is from Boston, MA, USA)
(5,Salma Hayek is from Coatzacoalcos, Veracruz,Mexico)


### leftJoin은 actors를 기준으로 bio를 leftOutJoin해주는 작동을 하여 조인이 안된 인자가 있더라도 합해주는 결과값을 반환합니다.

In [52]:
 actors.leftJoin(bio)((vid, name, b) => b match {
 case Some(bio) => name + " is from " + bio.hometown
 case None => name + "\'s hometown is unknown"
}).collect.foreach(println)

(1,George Clooney's hometown is unknown)
(2,Julia Stiles is from NY City, NY, USA)
(3,Will Smith is from Philadelphia, PA,USA)
(4,Matt Damon is from Boston, MA, USA)
(5,Salma Hayek is from Coatzacoalcos, Veracruz,Mexico)


# Reversing edge directions

여기서는 reverse 연산자를 통해서 Edge의 방향을 바꿔주는 함수를 사용해봅니다.

In [53]:
 val movies = movieGraph.edges

In [57]:
 movies.collect.foreach(println)

Edge(1,4,Ocean's Eleven)
Edge(2,4,Bourne Ultimatum)
Edge(3,5,Wild Wild West)
Edge(1,5,From Dusk Till Dawn)
Edge(3,4,The Legend of Bagger Vance)


In [55]:
 val bidirectedGraph = Graph(actors, movies union
 movies.reverse)

In [59]:
bidirectedGraph.edges.collect.foreach(println)

Edge(1,4,Ocean's Eleven)
Edge(4,1,Ocean's Eleven)
Edge(2,4,Bourne Ultimatum)
Edge(4,2,Bourne Ultimatum)
Edge(3,5,Wild Wild West)
Edge(5,3,Wild Wild West)
Edge(1,5,From Dusk Till Dawn)
Edge(5,1,From Dusk Till Dawn)
Edge(3,4,The Legend of Bagger Vance)
Edge(4,3,The Legend of Bagger Vance)


# Example - from food network to flavor pairing

## 이게 마지막 실습 이었지만 .. 안타깝게도 제가  ingredients와 compounds 를 기존의 FNNode를 바탕으로 정의 해서 실행해봤는데 잘 진행이 안됩니다.  이부분 실습에 대해서 도움 부탁드립니다. ㅠㅠ

In [None]:
 val nodes = ingredients ++ compounds

In [None]:
 val foodNetwork = Graph(nodes, links)

In [None]:
 val similarIngr: RDD[(VertexId, Array[VertexId])] =
foodNetwork.collectNeighborIds(EdgeDirection.In)

In [None]:
 val flavorPairsRDD: RDD[Edge[Int]] = similarIngr flatMap
pairIngredients

In [None]:
 val flavorNetwork = Graph(ingredients, flavorPairsRDD).cache

In [None]:
 flavorNetwork.triplets.take(20).foreach(println)

In [None]:
 flavorWeightedNetwork.triplets.
sortBy(t => t.attr, false).take(20).
foreach(t => println(t.srcAttr.name + " and " + t.dstAttr.name + " share
" + t.attr + " compounds."))