# Getting Started with Spark and GraphX

Apache Spark Graph Processing - chapter1(복습), 2

DataFiles:  
https://www.packtpub.com/books/content/support/21578  
chapter1, 2  

https://www.dropbox.com/s/v3iprxv5fjxrn8o/data.zip

본 발표자료에 사용된 자료는 모두 "./data/filename"으로 설정되어있습니다.

### 선행학습 !

http://www.slideshare.net/sanghoonlee982/spark-overview-20141106  
Spark overview 이상훈(SK C&C)_스파크 사용자 모임_20141106 

http://www.slideshare.net/sangwookimme/graphx  
스사모 테크톡 (김상우,정향민) - GraphX

### 아직 분석환경이 갖추어 지지않아 Spark 를 공부할 준비가 안되셨다면~ : Let's start with Docker~!!

docker run -d -p 9999:8888 -e GRANT_SUDO=yes --name psy_spark jupyter/all-spark-notebook

jupyter노트북은 8888 port로 기본 설정되어있습니다. 자신이 사용하고 싶은 port로 연결해줍니다.
저는 이미 다른 컨테이너에서 8888포트를 사용하고 있어 9999로 설정하였습니다. 

자신의 서버 ip 또는 docker ip :9999 로 Jupyter notebook을 실행합니다. 

### 지난시간 복습 ch1 (발표는 될 수록 일찍 합시다 :) )

## Building a tiny social network

### import the GraphX and RDD module

In [1]:
import org.apache.spark.graphx._

In [2]:
import org.apache.spark.rdd.RDD

### Loading the data

    @ dataset : 두 개의 CSV파일   
    ./data/people.csv  
    ./data/links.csv  

In [3]:
val people = sc.textFile("./data/people.csv")

In [4]:
people: org.apache.spark.rdd.RDD[String]

MapPartitionsRDD[2] at textFile at <console>:18

In [5]:
val links = sc.textFile("./data/links.csv")

In [6]:
links : org.apache.spark.rdd.RDD[String]

MapPartitionsRDD[4] at textFile at <console>:18

### The property graph

### Transforming RDDs to VertexRDD and EdgeRDD

### graph를 구축하기 위한 Person 클래스 정의

In [7]:
case class Person(name: String, age: Int)

In [8]:
val peopleRDD: RDD[(VertexId, Person)] = people map { line =>
    val row = line split ','
    (row(0).toInt, Person(row(1), row(2).toInt))
}

In [9]:
type Connection = String

In [10]:
val linksRDD: RDD[Edge[Connection]] = links map {line =>
         val row = line split ','
         Edge(row(0).toInt, row(1).toInt, row(2))
       }

In [11]:
val tinySocial: Graph[Person, Connection] =
       Graph(peopleRDD, linksRDD)

### Introducing graph operations

In [13]:
tinySocial.vertices.collect()

In [14]:
tinySocial.edges.collect()

Array(Edge(1,2,friend), Edge(1,3,sister), Edge(2,4,brother), Edge(3,2,boss), Edge(4,5,client), Edge(1,9,friend), Edge(6,7,cousin), Edge(7,9,coworker), Edge(8,9,father))

In [15]:
val profLinks: List[Connection] = List("coworker", "boss", "employee","client", "supplier")

In [16]:
val profNetwork =
   tinySocial.edges.filter{ case Edge(_,_,link) =>
   profLinks.contains(link)}
   for {
     Edge(src, dst, link) <- profNetwork.collect()
     srcName = (peopleRDD.filter{case (id, person) => id == src}
   first)._2.name
     dstName = (peopleRDD.filter{case (id, person) => id == dst}
   first)._2.name
   } println(srcName + " is a " + link + " of " + dstName)

Charlie is a boss of Bob
Dave is a client of Eve
George is a coworker of Ivy


Same expression : 

```    
    tinySocial.subgraph(profLinks contains _.attr).  
        triplets.foreach(t => println(t.srcAttr.name + " is a " + t.attr + " of " + t.dstAttr.name))  
```

Result - triplet

```
    Triplet view   
    EdgeTriplet -- 3-tuple  
    ((VertexId, Person),(VertiexId, Person),(Connection))  
```

# chapter2 : Building and Exploring Graphs

- map their components to vertices or nodes  
- map the interactions between the individual components to edges or links  
- how graphs are stored and represented in GraphX  
- language of graph theory, and the basic characteristics of graphs  

### 이번 챕터에서는

    다양한 포맷의 데이터를 불러와서 여러가지 그래프를 만들어봅시다

• Load data and build Spark graphs in many ways   
• Use the join operator to mix external data into existing graphs  
• Build bipartite graphs and multigraphs  
• Explore graphs and compute their basic statistics  

## Network datasets : real-world datasets

    지난 챕터에서는 toy example이었다면, 이번 챕터에서는 세 가지의 실제 데이터셋을 다룹니다!

* e-mail communication networks
* food flavor netwmork
* social ego networks

## 첫번째 데이터셋 : The communication network

email communication graph : history of e-mails

The original dataset was released by William Cohen at CMU, which can be downloaded from https://www.cs.cmu.edu/~./enron/. A detailed description of the complete dataset was done by Klimmt and Yang, 2004. A cleaner version of the dataset,
which we use here, is provided by Leskovec et al., 2009, and can be obtained
from https://snap.stanford.edu/data/email-Enron.html.

canonical example of a directed graph,  
as each e-mail links a source node to the destination node  

database of e-mails generated by 158 employees of the Enron Corporation

## 두번째 데이터셋 : Flavor networks

ingredient-compound network는 다음 논문에서 소개되었다. 

Ingredient-compound network, introduced by Ahn et al., 2011  (http://yongyeol.com/)  
http://yongyeol.com/papers/ahn-flavornet-2011.pdf  
http://www.nature.com/articles/srep00196  

ingredient-compound network로 부터 flavor network 또한 만들어진다. 

The flavor network can also help food scientists or amateur cooks create new recipes.   
The datasets that we will use consist of ingredient-compound data and the recipes collected from http://www.epicurious.com/, allrecipes.com, and http://www.menupan.com/.   
The datasets are available at http://yongyeol.com/2011/12/15/paper-flavor-network.html.  

food ingredient : 식품재료, chemical compound : 화학성분
* ingredient- compound network : 화학성분이 식품재료안에 존재할 때 link가 연결된다. (이번 챕터에서 다룬다)
* flavor network : 식품재료(ingredient) pair 가 하나 이상의 화학성분을 공유할 때  link가 연결된다. (챕터4에서 다룬다)

![](http://www.nature.com/article-assets/npg/srep/2011/111215/srep00196/images_hires/w926/srep00196-f1.jpg)

![](http://www.nature.com/article-assets/npg/srep/2011/111215/srep00196/images_hires/m685/srep00196-f2.jpg)

## 세번째 데이터셋 : Social ego networks

collection of social ego networks from Google+  
collected by (McAuley and Leskovec, 2012)

The dataset includes the user profiles, their circles, and their ego networks and can be downloaded from Stanford's SNAP project website at http://snap.stanford.edu/data/egonets-Gplus.html

# Graph builders

GraphX에는 property graph를 만드는 4가지 function이 있습니다. 

## 1. The Graph factory method

The first one is the Graph factory method that we have already seen in the previous chapter. It is defined in the apply method of the companion object called Graph, which is as follows:

    def apply[VD, ED](
            vertices: RDD[(VertexId, VD)],
            edges: RDD[Edge[ED]],
            defaultVertexAttr: VD = null)
            : Graph[VD, ED]

두 RDD collections 인 RDD[(VertexId, VD)] 과 RDD[Edge[ED]] 를 parameter로 받아  각각 vertices와 edges로 Graph[VD, ED]를 만든다. 

## 2. edgeListFile

또 하나의 흔한 경우는, 원본 데이터셋이 edge만을 표한하는 경우이다. 이 경우는 GraphX에서 제공하는 GraphLoader.edgeListFile 함수가 GraphLoader에서 정의된다. 

    def edgeListFile(
             sc: SparkContext,
             path: String,
             canonicalOrientation: Boolean = false,
             minEdgePartitions: Int = 1)
             : Graph[Int, Int]

edge의 리스트를 포함하는 파일을 path를 받아서, 각 line이 source ID, destinationID의 2개의 integer로 graph의 edge를 표현한다. 

## 3. fromEdges

GraphLoader.edgeListFile 과 비슷하게, Graph.fromEdges라는 이름의 세번째 function은 RDD[Edge[ED]]로부터 그래프를 생성하도록 한다. edgeRDD에서 특정하는 vertexID파라미터를 사용하여 vertice를 만든다. 

     def fromEdges[VD, ED](
          edges: RDD[Edge[ED]],
    ￼      defaultValue: VD)
    : Graph[VD, ED]

## 4. fromEdgeTuples

마지막 graph builder function인 Graph.fromEdgeTuples 는 RDD[(VertexId, VertexId)]의 집합인 edge tuples의 RDD하나만으로 graph를 만든다. 

    def fromEdgeTuples[VD](
             rawEdges: RDD[(VertexId, VertexId)],
             defaultValue: VD,
             uniqueEdges: Option[PartitionStrategy] = None)
       : Graph[VD, Int]

# Building graphs

자, 그럼 다음 세가지 종류의 graph를 만들어 봅시다!  
1. a directed email communication network  
2. a bipartite graph of ingredient-compound connections  
3. a multigraph using the previous graph builders.  

# 1: the Enron email network

### Building directed graphs

In [24]:
import org.apache.spark.graphx._
import org.apache.spark.rdd._

데이터셋을 불러옵니다. 이 파일은 employee들간의 email communications의 adjacency list입니다. 

GraphLoader.edgeListFile method로 파일을 pass 합니다. 

In [78]:
val emailGraph = GraphLoader.edgeListFile(sc, "./data/emailEnron.txt")

In [77]:
emailGraph

org.apache.spark.graphx.impl.GraphImpl@614894ac

GraphLoader.edgeListFile method는 항상 int type의 vertex와 edge속성을 가지는 graph 객체를 반환합니다. graph의 처음 5개의 vertex와 edge를 확인해봅시다. (take(5))

In [79]:
emailGraph.vertices.take(5)

Array((18624,1), (32196,1), (32432,1), (9166,1), (7608,1))

In [80]:
emailGraph.edges.take(5)

Array(Edge(0,1,1), Edge(1,0,1), Edge(1,2,1), Edge(1,3,1), Edge(1,4,1))

GraphX에서 모든 edge들은 만드시 directed되어야 합니다. 그래서 non-direced or bidirectional graph를 표현하기 위해서, 각 connected pair를 양 방향으로 연결할 수 있습니다. 19021 node가 incoming, outgoing link 양 쪽을 갖는 노드임을 확인할 수 있다. 먼저 19021이 communication하는 destination node를 수집한다. 

In [28]:
emailGraph.edges.filter(_.srcId == 19021).map(_.dstId).collect()

Array(696, 4232, 6811, 8315, 26007)

같은 node들이 19021에 대한 incoming edge를 위한 source node이다. 

In [29]:
emailGraph.edges.filter(_.dstId == 19021).map(_.srcId).collect()

Array(696, 4232, 6811, 8315, 26007)

### 예제 한 가지를 이용해 끝까지 해보기 위해, 책과 순서를 조금 바꾸었습니다~!
30page부터

### Computing the degrees of the network nodes

### In-degree and out-degree of the Enron email network

Enron email network에 대하여, node보다 link가 약 10배 더 많음을 확인할 수 있다. 

In [30]:
emailGraph.numEdges

367662

In [31]:
emailGraph.numVertices

36692

직원들의 in-degree와 out-degree는 정확하게 동일한 bi-directed email graph 이며, 이는 average degree를 통해 확인할 수 있다. 

In [32]:
emailGraph.inDegrees.map(_._2).sum / emailGraph.numVertices

10.020222391802028

In [33]:
emailGraph.outDegrees.map(_._2).sum / emailGraph.numVertices

10.020222391802028

만약 우리가 가장 많은 수의 사람에게 이메일을 보낸 사람을 찾으려면 다음 max function을 정의하여 사용할 수 있다. 

In [34]:
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
     if (a._2 > b._2) a else b
}

In [35]:
emailGraph.outDegrees.reduce(max)

(5038,1383)

이 사람은 hub역할을 하는 사람일 것이다. 비슷하게 min function으로 사람을 찾을 수 있다.   
독립된 그룹들을 찾으려면 :

In [36]:
emailGraph.outDegrees.filter(_._2 <= 1).count

11211

단 한 사람으로부터만 이메일을 받는 많은 직원들이 있으며 이는 아마도 조직의 보스이거나 인사팀 일 것이다. 

# 2. Flavor network : food ingredient-compound network.  
#### 데이터셋 : ingr_info.tsv, comp_info.tsv, ingr_comp. tsv  

### Building a bipartite graph

어떤 경우에는 시스템을 bipartite graph의 관점에서 보는 것이 유용할 때가 있다. bipartite graph는 node가 두 셋트로 구성되어 있다. 같은 집합 안의 node들은 서로 연결 될 수 없고 다른 집합에만 연결하여 pair가 될 수 있다.  예제는 ingredient-compound network이다. 

    food ingredient : 식품재료
    chemical compound : 화학성분
    * ingredient- compound network : 화학성분이 식품재료안에 존재할 때 link가 연결된다. 
    * flavor network : 식품재료(ingredient)쌍을 화학성분을 공유할 때 연결하여 link를 만든다. 

#### 이번 챕터에서는 ingredient-compound network 를, 챕터4에서는 ingredient-compound network로 부터 flavor network를 만들 것이다. 

In [37]:
import scala.io.Source

첫번째 파일을 food ingredient(식품재료)에 대해, 두번째 파일은 compound(화학성분)의 정보가 들어있다.

In [87]:
Source.fromFile("./data/ingr_info.tsv").getLines().
      take(7).foreach(println)

# id	ingredient name	category
0	magnolia_tripetala	flower
1	calyptranthes_parriculata	plant
2	chamaecyparis_pisifera_oil	plant derivative
3	mackerel	fish/seafood
4	mimusops_elengi_flower	flower
5	hyssop	herb


In [88]:
Source.fromFile("./data/comp_info.tsv").getLines().
take(7).foreach(println)

# id	Compound name	CAS number
0	jasmone	488-10-8
1	5-methylhexanoic_acid	628-46-6
2	l-glutamine	56-85-9
3	1-methyl-3-methoxy-4-isopropylbenzene	1076-56-8
4	methyl-3-phenylpropionate	103-25-3
5	3-mercapto-2-methylpentan-1-ol_(racemic)	227456-27-1


세번째 파일은 ingredient와 compound간의 adjacency list정보를가지고 있다. 

In [89]:
Source.fromFile("./data/ingr_comp.tsv").getLines().
take(7).foreach(println)

# ingredient id	compound id
1392	906
1259	861
1079	673
22	906
103	906
1005	906


bipartite graph를 만들기 위하여, case class - Ingredient, Compound를 만들고, Scala inheritance를 이용하여 두 class를 상속하는 FNNode class를 만들었다. 

In [90]:
class FNNode(val name: String)

In [91]:
case class Ingredient(override val name: String, category: String) extends FNNode(name)

In [92]:
case class Compound(override val name: String, cas: String) extends FNNode(name)

이 이후에는, 우리가 Compound, Ingredient 객체를 불러와서 하나의 RDD[FNNode] collection을 만든다. 이 부분은 data wrangling이 필요하다. 

In [93]:
val ingredients: RDD[(VertexId, FNNode)] =
sc.textFile("./data/ingr_info.tsv").
     filter(! _.startsWith("#")).
     map {line =>
            val row = line split '\t'
            (row(0).toInt, Ingredient(row(1), row(2)))
         }

In [94]:
val compounds: RDD[(VertexId, FNNode)] =
   sc.textFile("./data/comp_info.tsv").
         filter(! _.startsWith("#")).
         map {line =>
                val row = line split '\t'
                (10000L + row(0).toInt, Compound(row(1), row(2)))
             }

In [95]:
val links: RDD[Edge[Int]] =
     sc.textFile("./data/ingr_comp.tsv").
        filter(! _.startsWith("#")).
        map {line =>
           val row = line split '\t'
           Edge(row(0).toInt, 10000L + row(1).toInt, 1)
        }

node 집합 두 개를 하나의 RDD로 붙여주고 Graph() factory method로 RDD link에 적용한다. 

In [96]:
val nodes = ingredients ++ compounds

In [97]:
val foodNetwork = Graph(nodes, links)

### 이제 ingredient-compound graph를 살펴보자.

In [98]:
def showTriplet(t: EdgeTriplet[FNNode,Int]): String = "The ingredient " ++ t.srcAttr.name ++ " contains " ++ t.dstAttr.name

## Q1 : foodNetwork 이부분부터 에러납니다. 총 5줄  : 집단지성의 힘을 보여주세요!!! by sejin

In [99]:
foodNetwork.triplets.take(5).foreach(showTriplet _ andThen println _)

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 2 in stage 81.0 failed 1 times, most recent failure: Lost task 2.0 in stage 81.0 (TID 114, localhost): java.io.InvalidClassException: $line191.$read$$iwC$$iwC$Ingredient; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1772)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
	at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
	at org.apache.spark.serializer.DeserializationStream$$anon$

### Computing the degrees of the network nodes

### Degrees in the bipartite food network

bipartite ingredient-compound graph에 대하여 가장 많은 수의 compound를 갖는 food 또는 음식에 가장 빈번하게 포함되는 compound를 가지고 있다. 

In [55]:
foodNetwork.outDegrees.reduce(max)

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 46.0 failed 1 times, most recent failure: Lost task 3.0 in stage 46.0 (TID 67, localhost): java.io.InvalidClassException: $line84.$read$$iwC$$iwC$Ingredient; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1772)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
	at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.

res: (org.apache.spark.graphx.VertexId, Int) = (908,239)

In [56]:
foodNetwork.vertices.filter(_._1 == 908).collect()

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 51.0 failed 1 times, most recent failure: Lost task 3.0 in stage 51.0 (TID 71, localhost): java.io.InvalidClassException: $line84.$read$$iwC$$iwC$Ingredient; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1772)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
	at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.

res: Array[(org.apache.spark.graphx.VertexId, FNNode)] =
Array((908,Ingredient(black_tea,plant derivative)))

In [57]:
foodNetwork.inDegrees.reduce(max)

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 0 in stage 54.0 failed 1 times, most recent failure: Lost task 0.0 in stage 54.0 (TID 72, localhost): java.io.InvalidClassException: $line84.$read$$iwC$$iwC$Ingredient; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1772)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
	at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.

res: (org.apache.spark.graphx.VertexId, Int) = (10292,299)

In [58]:
foodNetwork.vertices.filter(_._1 == 10292).collect()

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 0 in stage 59.0 failed 1 times, most recent failure: Lost task 0.0 in stage 59.0 (TID 76, localhost): java.io.InvalidClassException: $line84.$read$$iwC$$iwC$Ingredient; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:768)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1772)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
	at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
	at org.apache.spark.serializer.DeserializationStream$$anon$2.

res: Array[(org.apache.spark.graphx.VertexId, FNNode)] =
Array((10292,Compound(1-octanol,111-87-5)))

가장 많은 종류의 성분을 가지고 있는 것은 the black tea 그리고 가장 자주 함유되는 성분은 1-octanol이었다. 

### 이 다음 코드부터는 정상 구동 

# 3. Building a weighted social ego network

• ego.edges: These are directed edges in the ego network. The ego node does not appear in this list, but it is assumed that it follows every node ID that appears in the file.  
• ego.feat : This features for each of the nodes that appear in the edge file.  
• ego.featnames: This is the name of each of the feature dimensions. The
feature is 1 if the user has this property in their profile, and 0 otherwise.  

In [59]:
import org.apache.spark.graphx._
import org.apache.spark.rdd._
import breeze.linalg.SparseVector
import scala.io.Source
import scala.math.abs

In [60]:
type Feature = breeze.linalg.SparseVector[Int]

In [61]:
val featureMap: Map[Long, Feature] =
     Source.fromFile("./data/ego.feat").
        getLines().
        map{line =>
        val row = line split ' '
        val key = abs(row.head.hashCode.toLong)
        val feat = SparseVector(row.tail.map(_.toInt))
        (key, feat)
        }.toMap

val key = abs(row.head.hashCode.toLong)

In [62]:
val edges: RDD[Edge[Int]] =
     sc.textFile("./data/ego.edges").
        map {line =>
           val row = line split ' '
           val srcId = abs(row(0).hashCode.toLong)
           val dstId = abs(row(1).hashCode.toLong)
           val srcFeat = featureMap(srcId)
           val dstFeat = featureMap(dstId)
           val numCommonFeats = srcFeat dot dstFeat
           Edge(srcId, dstId, numCommonFeats)
}

In [63]:
val egoNetwork: Graph[Int,Int] = Graph.fromEdges(edges, 1)

In [64]:
egoNetwork.edges.filter(_.attr == 3).count()

1852

In [65]:
egoNetwork.edges.filter(_.attr == 2).count()

9353

In [66]:
egoNetwork.edges.filter(_.attr == 1).count()

107934

### Computing the degrees of the network nodes

### Degree histogram of the social ego networks

Similarly, we can compute the degrees of the connections in the ego network. Let's look at the maximum and minimum degrees in the network:

In [67]:
egoNetwork.degrees.reduce(max)

(1643293729,1084)

In [70]:
egoNetwork.degrees.reduce(min)

Name: Compile Error
Message: <console>:50: error: not found: value min
              egoNetwork.degrees.reduce(min)
                                        ^
StackTrace: 

#### Q2: max는 되는데 왜 min은 에러가날까요...

(550756674,1)

In [69]:
egoNetwork.degrees.
     map(t => (t._2,t._1)).
     groupByKey.map(t => (t._1,t._2.size)).
     sortBy(_._1).collect()

Array((1,15), (2,19), (3,12), (4,17), (5,11), (6,19), (7,14), (8,9), (9,8), (10,10), (11,1), (12,9), (13,6), (14,7), (15,8), (16,6), (17,5), (18,5), (19,7), (20,6), (21,8), (22,5), (23,8), (24,1), (25,2), (26,5), (27,8), (28,4), (29,6), (30,7), (31,5), (32,10), (33,6), (34,10), (35,5), (36,9), (37,7), (38,8), (39,5), (40,4), (41,3), (42,1), (43,3), (44,5), (45,7), (46,6), (47,3), (48,6), (49,1), (50,9), (51,5), (52,8), (53,8), (54,4), (55,2), (56,5), (57,7), (58,4), (59,8), (60,9), (61,12), (62,5), (63,15), (64,5), (65,7), (66,6), (67,9), (68,4), (69,5), (70,4), (71,7), (72,9), (73,10), (74,2), (75,6), (76,7), (77,10), (78,7), (79,9), (80,5), (81,3), (82,4), (83,7), (84,7), (85,4), (86,6), (87,6), (88,10), (89,4), (90,6), (91,3), (92,4), (93,7), (94,4), (95,6)...

이번 챕터에서는 스파크에서 graph를 만드는 여러가지 방법을   
online social networks, food science, and e-mail communications 세가지 real dataset을 이용해서 공부했습니다.   
그래프를 구성하기 위해서는  data preparation and wrangling을 위한 노력이 필요하지만, 
GraphX는 우리가 만들고자하는 graph 표현과 데이터셋의 형태에 따라 여러가지 graph builder 함수를 제공하며 이는 다른 graph-processing frameworks과의 차별된 기능성을 제공합니다. 또한, 기본 통계량, 그래프의 특성을 보았으며 이는 구조의 특성을 파악하고 표현을 이해하는데에 유용합니다. 

다음 챕터에선, 그래프 분석에 대해 더 깊이 들어가, 데이터 시각화 툴을 사용하고 새로운 그래프 이론 개념과 connectedness, triangle counting, PageRank과 같은 알고리즘들을 다룹니다!