# Exercise 52 - GraphFrame

Input:
- The textual file vertexes.csv: It contains the vertexes of a graph
    
- Each vertex is characterized by
    - id (string): user identifier
    - name (string): user name
    - age (integer): user age


- The textual file edges.csv
    - It contains the edges of a graph


- Each edge is characterized by
    - src (string): source vertex
    - dst (string): destination vertex
    - linktype (string): “follow”or “friend”


Output:
- For each user with at least one follower, store in the output folder the number of followers
    - One user per line
    - Format: user id, number of followers
- Use the CSV format to store the result

In [1]:
vertexPath = "/data/students/bigdata-01QYD/ex_data/Ex52/data/vertexes.csv"
edgePath = "/data/students/bigdata-01QYD/ex_data/Ex52/data/edges.csv"
outputPath = "res_out_ex52/"

In [2]:
from graphframes import GraphFrame
# Create a spark session
spark = SparkSession.builder.getOrCreate()

vDF = spark.read.load(vertexPath,\
                            format='csv',\
                            header=True,\
                            inferSchema=True)
eDF = spark.read.load(edgePath,\
                            format='csv',\
                            header=True,\
                            inferSchema=True)

In [3]:
vDF.show()
eDF.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
| u1|Alice| 34|
| u2|  Bob| 36|
| u3| John| 30|
| u4|David| 29|
| u5| Paul| 32|
| u6| Adel| 36|
| u7| Eddy| 60|
+---+-----+---+

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| u1| u2|  friend|
| u1| u4|  friend|
| u1| u5|  friend|
| u2| u1|  friend|
| u2| u3|  follow|
| u3| u2|  follow|
| u4| u1|  friend|
| u4| u5|  friend|
| u5| u1|  friend|
| u5| u4|  friend|
| u5| u6|  follow|
| u6| u3|  follow|
| u7| u6|  follow|
+---+---+--------+



In [4]:
# Since we need only follow linkTypes, we remove all edges with friend type
edgesDF = eDF.filter(" linktype = 'follow' ")
edgesDF.show()

+---+---+--------+
|src|dst|linktype|
+---+---+--------+
| u2| u3|  follow|
| u3| u2|  follow|
| u5| u6|  follow|
| u6| u3|  follow|
| u7| u6|  follow|
+---+---+--------+



In [5]:
g = GraphFrame(vDF,edgesDF)

In [9]:
# Count the number of followers for each user
userNumFollowersDF = g.inDegrees\
.withColumnRenamed("inDegree","numFollowers")

In [10]:
userNumFollowersDF.printSchema()
userNumFollowersDF.show()

root
 |-- id: string (nullable = true)
 |-- numFollowers: integer (nullable = false)

+---+------------+
| id|numFollowers|
+---+------------+
| u3|           2|
| u6|           2|
| u2|           1|
+---+------------+



In [12]:
# Save the result in the output folder
userNumFollowersDF.write.csv(outputPath)