# **Exercise 55 - GraphFrames**

Input:

- The textual file vertexes.csv
    - It contains the vertexes of a graph

- Each vertex is characterized by
    - id (string): vertex identifier
    - entityType (string): “user” or “topic”
    - name (string): name of the entity


- The textual file edges.csv
    - It contains the edges of a graph

- Each edge is characterized by
    - src (string): source vertex
    - dst (string): destination vertex
    - linktype (string): “expertOf” or “follow” or “correlated”


Output:

- The followed topics for each user

- One pair (user name, followed topic) per line

- Format: username, followed topic

- Use the CSV format to store the result

In [2]:
from graphframes import GraphFrame

inputPathVertexes = "/data/students/bigdata-01QYD/ex_data/Ex55/data/vertexes.csv"
inputPathEdges = "/data/students/bigdata-01QYD/ex_data/Ex55/data/edges.csv"
outputPath = "resOut_ex55/"

In [3]:
# Read the content of vertexes.csv
vDF = spark.read.load(inputPathVertexes,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

vDF.printSchema()
vDF.show()

root
 |-- id: string (nullable = true)
 |-- entityName: string (nullable = true)
 |-- name: string (nullable = true)

+---+----------+--------+
| id|entityName|    name|
+---+----------+--------+
| V1|      user|   Paolo|
| V2|     topic|     SQL|
| V3|      user|   David|
| V4|     topic|Big Data|
| V5|      user|    John|
+---+----------+--------+



In [4]:
# Read the content of edges.csv
eDF = spark.read.load(inputPathEdges,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

eDF.printSchema()
eDF.show()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- linktype: string (nullable = true)

+---+---+----------+
|src|dst|  linktype|
+---+---+----------+
| V1| V2|      like|
| V1| V3|    follow|
| V1| V4|    follow|
| V3| V2|    follow|
| V3| V4|    follow|
| V5| V2|  expertOf|
| V2| V4|correlated|
| V4| V2|correlated|
+---+---+----------+



In [5]:
# We look for:
# - user -> follow -> topic

filteredEdges = eDF.filter("linktype = 'follow'")
filteredEdges.show()

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| V1| V3|  follow|
| V1| V4|  follow|
| V3| V2|  follow|
| V3| V4|  follow|
+---+---+--------+



In [7]:
# Create the input graph
g = GraphFrame(vDF, filteredEdges)

In [14]:
selectedPaths = g.find("(v1)-[]->(v2)")

In [16]:
selectedPaths.show(truncate=False)

+-----------------+---------------------+
|v1               |v2                   |
+-----------------+---------------------+
|[V1, user, Paolo]|[V3, user, David]    |
|[V1, user, Paolo]|[V4, topic, Big Data]|
|[V3, user, David]|[V2, topic, SQL]     |
|[V3, user, David]|[V4, topic, Big Data]|
+-----------------+---------------------+



In [17]:
# Select only the users name and topics
selectedPairsDF = selectedPaths\
.filter("v1.entityName='user' AND v2.entityName='topic'")

In [18]:
selectedPairsDF.show()

+-----------------+--------------------+
|               v1|                  v2|
+-----------------+--------------------+
|[V1, user, Paolo]|[V4, topic, Big D...|
|[V3, user, David]|    [V2, topic, SQL]|
|[V3, user, David]|[V4, topic, Big D...|
+-----------------+--------------------+



In [19]:
# Select name and topic
userTopicDF = selectedPairsDF\
.selectExpr("v1.name AS username", "v2.name AS topic")
userTopicDF.show(truncate=False)

+--------+--------+
|username|topic   |
+--------+--------+
|Paolo   |Big Data|
|David   |SQL     |
|David   |Big Data|
+--------+--------+



In [20]:
userTopicDF.write.csv(outputPath, header=True)