# ***Exercise 57 - GraphFrames***

Input:

- The textual file vertexes.csv
    - It contains the vertexes of a graph
    

- Each vertex is characterized by
    - id (string): user identifier
    - name (string): user name
    - age (integer): user age

- The textual file edges.csv
    - It contains the edges of a graph
    

- Each edge is characterized by
    - src (string): source vertex
    - dst (string): destination vertex
    - linktype (string): “follow”or “friend”

Output:

- Select the users who can reach user u1 in less than 3 hops (i.e., at most two edges)
    - Do not consider u1 itself
    

- For each of the selected users, store in the output folder his/her name and the minimum number of hops to reach user u1
    - One user per line
    - Format: user name, #hops to user u1
    

- Use the CSV format to store the result

In [1]:
from graphframes import GraphFrame
from pyspark.sql.functions import sum
from graphframes.lib import AggregateMessages
from pyspark.sql.types import IntegerType

inputPathVertexes = "/data/students/bigdata-01QYD/ex_data/Ex57/data/vertexes.csv"
inputPathEdges = "/data/students/bigdata-01QYD/ex_data/Ex57/data/edges.csv"
outputPath = "resOut_ex57/"

In [2]:
# Read the content of vertexes.csv
vDF = spark.read.load(inputPathVertexes,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

vDF.printSchema()
vDF.show()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+---+-----+---+
| id| name|age|
+---+-----+---+
| u1|Alice| 34|
| u2|  Bob| 36|
| u3| John| 30|
| u4|David| 29|
| u5| Paul| 32|
| u6| Adel| 36|
| u7| Eddy| 60|
+---+-----+---+



In [3]:
# Read the content of edges.csv
eDF = spark.read.load(inputPathEdges,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

eDF.printSchema()
eDF.show()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- linktype: string (nullable = true)

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| u1| u2|  friend|
| u1| u4|  friend|
| u1| u5|  friend|
| u2| u1|  friend|
| u2| u3|  follow|
| u3| u2|  follow|
| u4| u1|  friend|
| u4| u5|  friend|
| u5| u1|  friend|
| u5| u4|  friend|
| u5| u6|  follow|
| u6| u3|  follow|
+---+---+--------+



In [4]:
g = GraphFrame(vDF, eDF)

In [7]:
shortestPath = g.shortestPaths(["u1"])
shortestPath.show()

+---+-----+---+---------+
| id| name|age|distances|
+---+-----+---+---------+
| u6| Adel| 36|[u1 -> 3]|
| u3| John| 30|[u1 -> 2]|
| u2|  Bob| 36|[u1 -> 1]|
| u4|David| 29|[u1 -> 1]|
| u5| Paul| 32|[u1 -> 1]|
| u1|Alice| 34|[u1 -> 0]|
| u7| Eddy| 60|       []|
+---+-----+---+---------+



In [8]:
spLess3 = shortestPath.filter("distances.u1 < 3 AND distances.u1 > 0")

In [9]:
spLess3.show()

+---+-----+---+---------+
| id| name|age|distances|
+---+-----+---+---------+
| u3| John| 30|[u1 -> 2]|
| u2|  Bob| 36|[u1 -> 1]|
| u4|David| 29|[u1 -> 1]|
| u5| Paul| 32|[u1 -> 1]|
+---+-----+---+---------+

