# Graph processing using GraphFrames

In this notebook you will construct a graph from answers and users datasets and use GraphFrames library to run some algorithms on it.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count, greatest, least

import os
from IPython.display import Image

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Graph processing I')
    .config("spark.jars.packages", "graphframes:graphframes:0.8.4-spark3.5-s_2.12")
    .getOrCreate()
)

In [None]:
from graphframes import *

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

image_path = os.path.join(project_path, 'data/images/graphframes.png')

# Task

Create a graph from users and answers. The users will be represented as nodes in the graph and two users will be connected by edge if they answered the same question (see the image bellow).

On the Graph run the following algorithms:
* [Label Propagation](https://en.wikipedia.org/wiki/Label_propagation_algorithm) to find some communities / clusters of users
* [PageRank](https://en.wikipedia.org/wiki/PageRank) to find important nodes in the graph 

Note
* consider taking only [sample](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sampleBy.html#pyspark.sql.DataFrame.sampleBy) of answers to reduce the size of the graph if you run in local mode
* also check the user guide for [GrahpFrames](https://graphframes.github.io/graphframes/docs/_site/user-guide.html)

In [None]:
Image(image_path, width=480)

#### Read the data:

In [None]:
# your code here:

# answers is the main dataset used for the graph

# we will also need users for metadata:

#### Create vertices:

Hint:
* select user_id
* deduplicate
* rename the col to id
* you may keep additional cols as metadata (joined from users)

In [None]:
# your code here:


#### Create edges:

Hint:
* do self-join of answers on `question_id` column
* filter out records where user_id from left side is the same as from right side
* rename `user_id` cols as `src` / `dst`

Example:
* when we do a self-join of the following data (one question answered by two users `a` and `b`):\
question_id  user_id \
1 &nbsp;&nbsp;&nbsp;&nbsp;a\
1 &nbsp;&nbsp;&nbsp;&nbsp;b
* we will get: \
a &nbsp;&nbsp; 1 &nbsp;&nbsp;a \
a &nbsp;&nbsp; 1 &nbsp;&nbsp;b \
b &nbsp;&nbsp; 1 &nbsp;&nbsp;a \
b &nbsp;&nbsp; 1 &nbsp;&nbsp;b
* we need to remove where the node is joined with itself, `a-1-a` and `b-1-b`
* we also need to remove the duplicated rows created by the join: `a-1-b` is the same as `b-1-a`
    * these functions will be helpful:
        * [greatest](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.greatest.html#pyspark.sql.functions.greatest)
        * [least](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.least.html#pyspark.sql.functions.least)

In [None]:
# your code here:


#### Create the graph:

Hint:
* use GraphFrame(vertices, edges) 

In [None]:
# your code here:


#### See some properties of the graph:

Hint:
* count number of edges
* count number of vertices

In [None]:
# your code here:


#### Find frequent edges

Hint:
* group by edge (edge is defined by two cols: `src`, `dst`) and count how many times the edge is in the graph
* order by the count in descending order

In [None]:
# your code here:


#### Find communities

Hint:
* use [labelPropagation](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.labelPropagation)
* see how many users are in each community
 * group by `label` and count
* see what users are in a given community
 * filter on `label` col

In [None]:
# your code here:


In [None]:
# your code here:


In [None]:
# your code here:


#### Compute PageRank

* use [pageRank](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.pageRank) method
* order the vertices by pagerank

In [None]:
# your code here:


In [None]:
# your code here:


In [None]:
spark.stop()