# **Exercise 56 - GraphFrames**


Input:

- The textual file vertexes.csv
    - It contains the vertexes of a graph


- Each vertex is characterized by
    - id (string): vertex identifier
    - entityType (string): “user” or “topic”
    - name (string): name of the entity


- The textual file edges.csv
    - It contains the edges of a graph


- Each edge is characterized by
    - src (string): source vertex
    - dst (string): destination vertex
    - linktype (string): “expertOf” or “follow” or “correlated”


Output:

- The names of the users who follow a topic correlated to the “Big Data” topic

- One user name per line

- Format: username

- Use the CSV format to store the result

In [2]:
from graphframes import GraphFrame

inputPathVertexes = "/data/students/bigdata-01QYD/ex_data/Ex56/data/vertexes.csv"
inputPathEdges = "/data/students/bigdata-01QYD/ex_data/Ex56/data/edges.csv"
outputPath = "resOut_ex56/"

In [3]:
# Read the content of vertexes.csv
vDF = spark.read.load(inputPathVertexes,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

vDF.printSchema()
vDF.show()

root
 |-- id: string (nullable = true)
 |-- entityName: string (nullable = true)
 |-- name: string (nullable = true)

+---+----------+--------+
| id|entityName|    name|
+---+----------+--------+
| V1|      user|   Paolo|
| V2|     topic|     SQL|
| V3|      user|   David|
| V4|     topic|Big Data|
| V5|      user|    John|
+---+----------+--------+



In [4]:
# Read the content of edges.csv
eDF = spark.read.load(inputPathEdges,\
                             format="csv",
                             header=True,\
                             inferSchema=True)

eDF.printSchema()
eDF.show()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- linktype: string (nullable = true)

+---+---+----------+
|src|dst|  linktype|
+---+---+----------+
| V1| V2|      like|
| V1| V3|    follow|
| V1| V4|    follow|
| V3| V2|    follow|
| V3| V4|    follow|
| V5| V2|  expertOf|
| V2| V4|correlated|
| V4| V2|correlated|
+---+---+----------+



In [7]:
# We look for:
# - user -> follow -> topic

filteredEdges = eDF.filter("linktype = 'follow' OR linktype = 'correlated'")
filteredEdges.show()

+---+---+----------+
|src|dst|  linktype|
+---+---+----------+
| V1| V3|    follow|
| V1| V4|    follow|
| V3| V2|    follow|
| V3| V4|    follow|
| V2| V4|correlated|
| V4| V2|correlated|
+---+---+----------+



In [19]:
# We look for:
# - user -> follow -> topic -> correlated -> topic

# Create the input graph
g = GraphFrame(vDF, filteredEdges)
paths = g.find("(v1)-[e1]->(v2);(v2)-[e2]->(v3)")

In [20]:
paths.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                  v1|                  e1|                  v2|                  e2|                  v3|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|   [V1, user, Paolo]|    [V1, V3, follow]|   [V3, user, David]|    [V3, V4, follow]|[V4, topic, Big D...|
|   [V1, user, Paolo]|    [V1, V3, follow]|   [V3, user, David]|    [V3, V2, follow]|    [V2, topic, SQL]|
|   [V1, user, Paolo]|    [V1, V4, follow]|[V4, topic, Big D...|[V4, V2, correlated]|    [V2, topic, SQL]|
|   [V3, user, David]|    [V3, V2, follow]|    [V2, topic, SQL]|[V2, V4, correlated]|[V4, topic, Big D...|
|   [V3, user, David]|    [V3, V4, follow]|[V4, topic, Big D...|[V4, V2, correlated]|    [V2, topic, SQL]|
|    [V2, topic, SQL]|[V2, V4, correlated]|[V4, topic, Big D...|[V4, V2, correlated]|    [V2, topic, SQL]|
|[V4, topic, Big D...|[V4, V2, correl

In [21]:
# Select now the triplets user
selectedPath = paths.filter("""v1.entityName='user'
                                 AND e1.linktype='follow'
                                 AND v2.entityName='topic'
                                 AND e2.linktype='correlated'
                                 AND v3.entityName='topic'
                                 AND v3.name='Big Data' """)

In [22]:
selectedPath.show(truncate=False)

+-----------------+----------------+----------------+--------------------+---------------------+
|v1               |e1              |v2              |e2                  |v3                   |
+-----------------+----------------+----------------+--------------------+---------------------+
|[V3, user, David]|[V3, V2, follow]|[V2, topic, SQL]|[V2, V4, correlated]|[V4, topic, Big Data]|
+-----------------+----------------+----------------+--------------------+---------------------+



In [24]:
userDF = selectedPath\
.selectExpr('v1.name as username')

userDF.show()

+--------+
|username|
+--------+
|   David|
+--------+

