# **Exercise 54 - GraphFrames**

Input:


- The textual file vertexes.csv
    - It contains the vertexes of a graph
    

- Each vertex is characterized by
    - id (string): user identifier
    - name (string): user name
    - age (integer): user age



- The textual file edges.csv
    - It contains the edges of a graph
    

- Each edge is characterized by
    - src (string): source vertex
    - dst (string): destination vertex
    - linktype (string): “follow”or “friend”
    
   
Output:


- The pairs of users Ux, Uy such that
    - Ux is a friend of Uy (link “friend” from Ux to Uy)
    - Uy is not a friend of Ux (no link “friend” from Uy to Ux)


- One pair Ux,Uy per line


- Format: idUx, idUy


- Use the CSV format to store the result

In [16]:
from graphframes import GraphFrame

inputPathVertexes = "/data/students/bigdata-01QYD/ex_data/Ex54/data/vertexes.csv"
inputPathEdges = "/data/students/bigdata-01QYD/ex_data/Ex54/data/edges.csv"
outputPath = "resOut_ex54/"

In [2]:
# Read the content of vertexes.csv
vDF = spark.read.load(inputPathVertexes,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

vDF.printSchema()
vDF.show()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+---+-----+---+
| id| name|age|
+---+-----+---+
| u1|Alice| 34|
| u2|  Bob| 36|
| u3| John| 30|
| u4|David| 29|
| u5| Paul| 32|
| u6| Adel| 36|
| u7| Eddy| 60|
+---+-----+---+



In [5]:
# Read the content of edges.csv
eDF = spark.read.load(inputPathEdges,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

eDF.printSchema()
eDF.show()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- linktype: string (nullable = true)

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| u1| u2|  friend|
| u1| u5|  friend|
| u2| u3|  follow|
| u3| u2|  follow|
| u4| u1|  friend|
| u4| u5|  friend|
| u5| u1|  friend|
| u5| u4|  friend|
| u5| u6|  follow|
| u6| u3|  follow|
| u7| u6|  follow|
+---+---+--------+



In [6]:
filteredEdges = eDF.filter("linktype = 'friend'")
filteredEdges.show()

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| u1| u2|  friend|
| u1| u5|  friend|
| u4| u1|  friend|
| u4| u5|  friend|
| u5| u1|  friend|
| u5| u4|  friend|
+---+---+--------+



In [7]:
# Create the input graph
g = GraphFrame(vDF, filteredEdges)

In [9]:
# We look for: 
# - userX -> friend -> userY 
# - userY -> NOT friend -> userX

# To do that we apply the motif finding
selectedPaths = g.find("(userx)-[]->(usery);!(usery)-[]->(userx)")

In [10]:
selectedPaths.printSchema()
selectedPaths.show()

root
 |-- userx: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: integer (nullable = true)
 |-- usery: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: integer (nullable = true)

+---------------+---------------+
|          userx|          usery|
+---------------+---------------+
|[u4, David, 29]|[u1, Alice, 34]|
|[u1, Alice, 34]|  [u2, Bob, 36]|
+---------------+---------------+



In [12]:
# Select only the names of the users and rename the column
SelectedPairsDF = selectedPaths\
.selectExpr("userx.id as IdFriend", "usery.id as IdNotFriend")

In [13]:
SelectedPairsDF.show()

+--------+-----------+
|IdFriend|IdNotFriend|
+--------+-----------+
|      u4|         u1|
|      u1|         u2|
+--------+-----------+



In [17]:
# Save
SelectedPairsDF.write.csv(outputPath, header=True)