# Graph processing using GraphFrames

In this notebook you will construct a graph from answers and users datasets and use GraphFrames library to run some algorithms on it.

In [None]:
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import col, desc, count, greatest, least

import networkx as nx
import matplotlib.pyplot as plt

import os
from IPython.display import Image
from sklearn.preprocessing import LabelEncoder

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Graph processing')
    .config('spark.jars.packages', 'graphframes:graphframes:0.8.4-spark3.5-s_2.12')
    .getOrCreate()
)

In [None]:
from graphframes import *

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

answers_input_path = os.path.join(project_path, 'data/answers')

users_input_path = os.path.join(project_path, 'data/users')

image_path = os.path.join(project_path, 'data/images/graphframes.png')

# Task

Create a graph from users and answers. The users will be represented as nodes in the graph and two users will be connected by edge if they answered the same question (see the image bellow).

On the Graph run the following algorithms:
* [Label Propagation](https://en.wikipedia.org/wiki/Label_propagation_algorithm) to find some communities / clusters of users
* [PageRank](https://en.wikipedia.org/wiki/PageRank) to find important nodes in the graph 

Note
* consider taking only [sample](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sampleBy.html#pyspark.sql.DataFrame.sampleBy) of answers to reduce the size of the graph if you run in local mode
* also check the user guide for [GrahpFrames](https://graphframes.github.io/graphframes/docs/_site/user-guide.html)

In [None]:
Image(image_path, width=480)

In [None]:
# answers is the main dataset used for the graph

answersDF = (
    spark
    .read
    .option('path', answers_input_path)
    .load()
).sample(0.5, False, 24).cache()

#### Create vertices:

Hint:
* select user_id
* deduplicate
* rename the col to id

In [None]:
# your code here:



#### Create edges:

Hint:
* do self-join of answers on `question_id` column
* filter out records where user_id from left side is the same as from right side
* rename `user_id` cols as `src` / `dst`

Example:
* when we do a self-join of the following data (one question answered by two users `a` and `b`):\
question_id  user_id \
1 &nbsp;&nbsp;&nbsp;&nbsp;a\
1 &nbsp;&nbsp;&nbsp;&nbsp;b
* we will get: \
a &nbsp;&nbsp; 1 &nbsp;&nbsp;a \
a &nbsp;&nbsp; 1 &nbsp;&nbsp;b \
b &nbsp;&nbsp; 1 &nbsp;&nbsp;a \
b &nbsp;&nbsp; 1 &nbsp;&nbsp;b
* we need to remove where the node is joined with itself, `a-1-a` and `b-1-b`
* we also need to remove the duplicated rows created by the join: `a-1-b` is the same as `b-1-a`
* also for now keep each edge only once, so if we have `a-1-a` and `b-1-b` it is the same edge and we will keep only one of them (you could compute weight for such edges that could be useful for some algorithms, but let's skip it for now) 

In [None]:
# your code here:



In [None]:
edgesDF.show(n=5)

#### Create the graph:

Hint:
* use GraphFrame(vertices, edges) 

In [None]:
# your code here:



#### See some properties of the graph:

Hint:
* count number of edges
* count number of vertices

In [None]:
# your code here:



#### Find communities

Hint:
* use [labelPropagation](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.labelPropagation)
* see how many users are in each community
 * group by `label` and count
* see what users are in a given community
 * filter on `label` col

In [None]:
# your code here:



#### Compute PageRank

* use [pageRank](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.pageRank) method
* order the vertices by pagerank

In [None]:
# your code here:



### Let's visualize some of the communities

We will filter for communities of the size 30 which will give us a small sample of the data convenient for the visualisation. We will convert the sample to Pandas dataframe and use the `networkx` library to plot it.

Notice that not all members of the same community need to be connected. You may also want to check [connectedComponents](https://graphframes.io/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.connectedComponents) algorithm to discover subgraphs where all nodes are connected.

In [None]:
sampled_users = (
    communities
    .withColumn('n', count('*').over(Window().partitionBy('label')))
    .filter(col('n') == 30)
    .select('id', 'label')
)

sampled_edges = (
    usersGraph.edges
    .join(sampled_users.select(col('id').alias('src')), 'src')
    .join(sampled_users.select(col('id').alias('dst')), 'dst')
)

vertices_with_labels = sampled_users.toPandas()
edges = sampled_edges.toPandas()

In [None]:
G = nx.Graph()
for _, row in vertices_with_labels.iterrows():
    G.add_node(row['id'], label=row['label'])

for _, row in edges.iterrows():
    G.add_edge(row['src'], row['dst'])

node_labels = nx.get_node_attributes(G, 'label')

print('Removing isolated nodes:', len(list(nx.isolates(G))))
G.remove_nodes_from(list(nx.isolates(G)))

# Encode community labels to color indices
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(list(node_labels.values()))

# Map encoded colors back to the nodes (maintain same order as G.nodes)
node_colors = [encoded_labels[list(node_labels.keys()).index(node)] for node in G.nodes()]

In [None]:
# Draw
pos = nx.spring_layout(G, seed=12, k=0.9, iterations=200)

plt.figure(figsize=(9, 6))
nx.draw(G, pos, node_color=node_colors, cmap=plt.cm.Set3, with_labels=False, node_size=50)
plt.title('Label Propagation Clusters')
plt.show()

### Let's visualize the user with the highes pagerank

In [None]:
top_user_row = pr.vertices.orderBy(col('pagerank').desc()).limit(1).collect()[0]
top_user_id = top_user_row['id']
top_user_pagerank = top_user_row['pagerank']

# Get all edges where this user is involved (as src or dst)
connected_edges = pr.edges.filter((col('src') == top_user_id) | (col('dst') == top_user_id))

# Get all connected user IDs
connected_user_ids_df = (
    connected_edges.select('src')
    .union(connected_edges.select('dst'))
    .distinct()
    .withColumnRenamed('src', 'id')
)
visual_users = pr.vertices.join(connected_user_ids_df, 'id')

# Collect to Pandas
nodes_pd = visual_users.toPandas()
edges_pd = connected_edges.toPandas()

In [None]:
# Build graph
G = nx.Graph()

# Create a dict for fast lookup
pagerank_dict = {int(row['id']): row['pagerank'] for _, row in nodes_pd.iterrows()}

# Add nodes with pagerank (if available)
for node_id in pagerank_dict:
    G.add_node(node_id, pagerank=pagerank_dict[node_id])

# Add edges only if both src and dst are in pagerank_dict
for _, row in edges_pd.iterrows():
    if row['src'] in pagerank_dict and row['dst'] in pagerank_dict:
        G.add_edge(row['src'], row['dst'])

# Color and size based on pagerank
node_colors = []
node_sizes = []
for node in G.nodes():
    if node == top_user_id:
        node_colors.append('red')
    else:
        node_colors.append('skyblue')
    pr = pagerank_dict.get(node, 0)
    node_sizes.append(50 * pr)

In [None]:
# Draw
pos = nx.spring_layout(G, seed=42)
plt.figure(figsize=(12, 9))
nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=node_sizes, alpha=0.9)
nx.draw_networkx_edges(G, pos, alpha=0.5, width=1)
nx.draw_networkx_labels(G, pos, font_size=8)

plt.title(f'Top User ({top_user_id}) and Their Connected Users')
plt.axis('off')
plt.show()

In [None]:
spark.stop()