# **Graph Analitycs in Spark**
Graphs are natural way of describing relationships. Practical example of analytics over graphs
- Ranking web pages (Google PageRank) Detecting group of friends
- Determine importance of infrastructure in electrical networks
- ...

## **GraphX**
Spark RDD-based library for performing graph processing, written in Scala. Core part of Spark. Low level interface with RDD Very powerful.. Many application and libraries built on top of it. However, not easy to use or optimize. No Python version of the APIs

If you want to use Graphs in Python you can use **GraphFrame**, no officially in the Spark environment but commonly used.

## **Building and Querying graphs with GraphFrame**
Define vertexes and edges of the graph Vertexes and edges are represented by means of records inside DataFrames with specifically named columns:
- One DataFrame for the definition of the vertexes of the graph
- One DataFrame for the definition of the edges of the graph

The DataFrames that are used to represent **nodes/vertexes**:
- Contain one record per vertex
- Must contain a column named "id" that stores unique vertex IDs
- Can contain other columns that are used to characterize vertexes

The DataFrames that are used to represent **edges**:
- Contain one record per edge
- Must contain two columns "src" and "dst" storing source vertex IDs and destination vertex IDs of edges
- Can contain other columns that are used to characterize edges

Create a graph of type
graphframes.graphframe.GraphFrame by
invoking the constructor GraphFrame(v,e)
- v: the DataFrame containing the definition of the vertexes
- e: the DataFrame containing the definition of the edges

**NOTE:** Graphs in graphframes are directed graphs. To create undirected ones you need to create another DF for opposed direction.

In [1]:
from graphframes import GraphFrame

# Vertex DataFrame
v = spark.createDataFrame([ ("u1", "Alice", 34),\
                            ("u2", "Bob", 36),\
                            ("u3", "Charlie", 30),\
                            ("u4", "David", 29),\
                            ("u5", "Esther", 32),\
                            ("u6", "Fanny", 36),\
                            ("u7", "Gabby", 60)],\
                            ["id", "name", "age"])

# Edge DataFrame
e = spark.createDataFrame([ ("u1", "u2", "friend"),\
                            ("u2", "u3", "follow"),\
                            ("u3", "u2", "follow"),\
                            ("u6", "u3", "follow"),\
                            ("u5", "u6", "follow"),\
                            ("u5", "u4", "friend"),\
                            ("u4", "u1", "friend"),\
                            ("u1", "u5", "friend")],\
                            ["src", "dst", "relationship"])

# Create the graph
g = GraphFrame(v, e)

In [2]:
g.vertices.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
| u1|  Alice| 34|
| u2|    Bob| 36|
| u3|Charlie| 30|
| u4|  David| 29|
| u5| Esther| 32|
| u6|  Fanny| 36|
| u7|  Gabby| 60|
+---+-------+---+



In [3]:
g.edges.show()

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| u1| u2|      friend|
| u2| u3|      follow|
| u3| u2|      follow|
| u6| u3|      follow|
| u5| u6|      follow|
| u5| u4|      friend|
| u4| u1|      friend|
| u1| u5|      friend|
+---+---+------------+



In undirected graphs the edges indicate a two- way relationship (each edge can be traversed in both directions):
- In GraphX you could use to_undirected() to create an undirected copy of the Graph
- Unfortunately GraphFrames does not support it

As with RDD and DataFrame, you can cache graphs in GraphFrame
- Convenient if the same (complex) graph result of (multiple) transformations is used multiple times in the same application
- Simply invoke cache() on the GraphFrame you want to cache

## **Querying the Graph**
Some specific methods are provided to execute queries on graphs
- filterVertices(condition)
- filterEdges(condition)
- dropIsolatedVertices()

The returned result is the filtered version of the input graph.

### **filterVertices(condition)**
Condition contains an SQL-like condition on the values of the attributes of the vertexes
- E.g., “age>35” 
Selects only the vertexes for which the specified condition is satisfied and returns a new graph with only the subset of selected vertexes.

### **filterEdges(condition)**
Condition contains an SQL-like condition on the values of the attributes of the edgesedges
- E.g., "relationship='friend' " 
Selects only the edges for which the specified condition is satisfied and returns a new graph with only the subset of selected edges.

### **dropIsolatedVertices()**
Drops the vertexes that are not connected with any other node and returns a new graph without the dropped nodes.

In [4]:
from graphframes import GraphFrame

# Vertex DataFrame
v = spark.createDataFrame([ ("u1", "Alice", 34),\
                            ("u2", "Bob", 36),\
                            ("u3", "Charlie", 30),\
                            ("u4", "David", 29),\
                            ("u5", "Esther", 32),\
                            ("u6", "Fanny", 36),\
                            ("u7", "Gabby", 60)],\
                            ["id", "name", "age"])

# Edge DataFrame
e = spark.createDataFrame([ ("u1", "u2", "friend"),\
                            ("u2", "u3", "follow"),\
                            ("u3", "u2", "follow"),\
                            ("u6", "u3", "follow"),\
                            ("u5", "u6", "follow"),\
                            ("u5", "u4", "friend"),\
                            ("u4", "u1", "friend"),\
                            ("u1", "u5", "friend")],\
                            ["src", "dst", "relationship"])

# Create the graph
g = GraphFrame(v, e)

selectedUsersandFriendRelGraph = g\
.filterVertices("age>=29 AND age<=50")\
.filterEdges("relationship='friend'")\
.dropIsolatedVertices()

In [7]:
g.vertices.show()
selectedUsersandFriendRelGraph.vertices.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
| u1|  Alice| 34|
| u2|    Bob| 36|
| u3|Charlie| 30|
| u4|  David| 29|
| u5| Esther| 32|
| u6|  Fanny| 36|
| u7|  Gabby| 60|
+---+-------+---+

+---+------+---+
| id|  name|age|
+---+------+---+
| u4| David| 29|
| u5|Esther| 32|
| u1| Alice| 34|
| u2|   Bob| 36|
+---+------+---+



In [8]:
g.edges.show()
selectedUsersandFriendRelGraph.edges.show()

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| u1| u2|      friend|
| u2| u3|      follow|
| u3| u2|      follow|
| u6| u3|      follow|
| u5| u6|      follow|
| u5| u4|      friend|
| u4| u1|      friend|
| u1| u5|      friend|
+---+---+------------+

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| u5| u4|      friend|
| u1| u5|      friend|
| u4| u1|      friend|
| u1| u2|      friend|
+---+---+------------+



Given a GraphFrame, we can easily access its vertexes and edges:
- **g.vertices** returns the DataFrame associated with the vertexes of the input graph
- **g.edges** returns the DataFrame associated with the edges of the input graph

All the standard DataFrame transformations/actions are available also for the DataFrames that are used to store vertexes and edges.

In [9]:
# Count how many vertexes and edges has the graph
print("Number of vertexes: ",g.vertices.count())
print("Number of edges: ",g.edges.count())

# Print on the standard output the smallest value of age
# (i.e., the age of the youngest user in the graph)
g.vertices.agg({"age":"min"}).show()

# Print on the standard output
# the number of "follow" edges in the graph.
numFollows = g.edges.filter("relationship = 'follow' ").count()

print(numFollows)

Number of vertexes:  7
Number of edges:  8
+--------+
|min(age)|
+--------+
|      29|
+--------+

4


## **Motif finding**
Motif finding refers to searching for structural patterns in graphs. A simple Domain-Specific Language (DSL) is
used to specify the structure of the patterns we are interested in.

The basic unit of a pattern is a connection between vertexes
- (v1) – [e1] -> (v2)

Edges denoted by squared brackets, vertexes with round. **Patterns are chains of basic units**.

It is acceptable to omit names for vertices or edges in patterns when not needed.

A basic unit (an edge between two vertexes) can be negated to indicate that the edge should not be present in the graph:
- (v1)-[]->(v2); !(v2)-[]->(v1)

The **find(motif)** method of GraphFrame is used to select motifs. **find()** returns a DataFrame of all the paths matching the structural motif/pattern.

Applying this pattern: **(v1) – [e1] -> (v2); (v2) – [e2] -> (v1)**

Content of the returned DataFrame:

    +--------------------+--------------------+--------------------+---------------------+
    |         v1         |         e1         |          v2        |           e2        |
    +--------------------+--------------------+--------------------+---------------------+
    | [u2, Bob, 36]      | [u2, u3, follow]   | [u3, Charlie, 30]  | [u3, u2, follow]    |
    | [u3, Charlie, 30]  | [u3, u2, follow]   | [u2, Bob, 36]      | [u2, u3, follow]    |
    +--------------------+--------------------+--------------------+---------------------+

In [10]:
# Retrieve the motifs associated with the pattern
# vertex -> edge -> vertex -> edge ->vertex
motifs = g.find("(v1)-[e1]->(v2); (v2)-[e2]->(v1)")

# Retrieve the motifs associated with the pattern
# vertex -> edge -> vertex -> edge ->vertex
motifs = g.find("(v1)-[friend]->(v2); (v2)-[follow]->(v3)")

# Filter the motifs (the content of the motifs DataFrame)
# Select only the ones matching the pattern
# vertex -> friend-> vertex -> follow ->vertex
motifsFriendFollow = motifs\
.filter("friend.relationship='friend'AND follow.relationship='follow' ")

In [12]:
motifs.show()
motifsFriendFollow.show()

+-----------------+----------------+-----------------+----------------+-----------------+
|               v1|          friend|               v2|          follow|               v3|
+-----------------+----------------+-----------------+----------------+-----------------+
|  [u1, Alice, 34]|[u1, u2, friend]|    [u2, Bob, 36]|[u2, u3, follow]|[u3, Charlie, 30]|
|[u3, Charlie, 30]|[u3, u2, follow]|    [u2, Bob, 36]|[u2, u3, follow]|[u3, Charlie, 30]|
| [u5, Esther, 32]|[u5, u6, follow]|  [u6, Fanny, 36]|[u6, u3, follow]|[u3, Charlie, 30]|
|  [u1, Alice, 34]|[u1, u5, friend]| [u5, Esther, 32]|[u5, u4, friend]|  [u4, David, 29]|
|  [u4, David, 29]|[u4, u1, friend]|  [u1, Alice, 34]|[u1, u5, friend]| [u5, Esther, 32]|
| [u5, Esther, 32]|[u5, u4, friend]|  [u4, David, 29]|[u4, u1, friend]|  [u1, Alice, 34]|
|  [u1, Alice, 34]|[u1, u5, friend]| [u5, Esther, 32]|[u5, u6, follow]|  [u6, Fanny, 36]|
|  [u4, David, 29]|[u4, u1, friend]|  [u1, Alice, 34]|[u1, u2, friend]|    [u2, Bob, 36]|
|    [u2, 

## **Basic Statistics**
Given the input graph, compute
- **Degree** of each vertex
- **inDegree** of each vertex
- **outDegree** of each vertex

In [2]:
# Retrieve the DataFrame with the information about the degree of
# each vertex
vertexesDegreesDF = g.degrees

# Retrieve the DataFrame with the information about the in-degree of
# each vertex
vertexesInDegreesDF = g.inDegrees

# Retrieve the DataFrame with the information about the out-degree of
# each vertex
vertexesOutDegreesDF = g.outDegrees

In [3]:
vertexesDegreesDF.show()
vertexesInDegreesDF.show()
vertexesOutDegreesDF.show()

+---+------+
| id|degree|
+---+------+
| u3|     3|
| u5|     3|
| u4|     2|
| u1|     3|
| u6|     2|
| u2|     3|
+---+------+

+---+--------+
| id|inDegree|
+---+--------+
| u3|       2|
| u4|       1|
| u5|       1|
| u1|       1|
| u6|       1|
| u2|       2|
+---+--------+

+---+---------+
| id|outDegree|
+---+---------+
| u3|        1|
| u5|        2|
| u4|        1|
| u1|        2|
| u6|        1|
| u2|        1|
+---+---------+



In [4]:
# Retrieve the DataFrame with the information about the in-degree of
# each vertex
vertexesInDegreesDF = g.inDegrees

# Select only the vertexes with and in-degree value >=2
selectedVertexesDF = vertexesInDegreesDF.filter("inDegree>=2")

# Select only the content of Column id
selectedVertexesIDsDF = selectedVertexesDF.select("id")

In [5]:
selectedVertexesDF.show()
selectedVertexesIDsDF.show()

+---+--------+
| id|inDegree|
+---+--------+
| u3|       2|
| u2|       2|
+---+--------+

+---+
| id|
+---+
| u3|
| u2|
+---+

