# Introduction into GraphFrames

Starting with initializing a SparkSession:

In [1]:
import os
from pyspark import SQLContext, SparkContext, SparkConf

appName = "rdd programming guide"
master = "local[1]"
conf = SparkConf().setAppName(appName).setMaster(master)
#sc = SparkContext.getOrCreate(conf=conf)
#sqlContext = SQLContext.getOrCreate(sc)
sc = SparkSession.builder.config(conf=conf).getOrCreate()
sqlContext = sc

23/12/12 22:54:33 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Load a json file into a DataFrame

In [2]:
spark_home = os.environ.get('SPARK_HOME')
df = sqlContext.read.json(spark_home + "/examples/src/main/resources/people.json")

In [3]:
display(df)

DataFrame[age: bigint, name: string]

In [4]:
type(df)

pyspark.sql.dataframe.DataFrame

In [5]:
df.show(n=4)

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Create a Vertex dataframe with unique ID column "id": 

In [6]:
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

In [7]:
v.show(n=4)

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
+---+-------+---+



[Stage 2:>                                                          (0 + 1) / 1]                                                                                

Create an Edge dataframe with "src" and "dst" columns

In [8]:
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])

In [9]:
e.show(n=4)

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
+---+---+------------+



Create a graphframe

In [10]:
from graphframes import *
g = GraphFrame(v, e)

Query: get in-degree of each vertex

In [11]:
g.inDegrees.show()

+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       1|
+---+--------+



Query: count the number of follow connections in the graph

In [12]:
g.edges.filter("relationship = 'follow'").count()

2

Run PageRank algorithm and show results

In [13]:
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

+---+------------------+
| id|          pagerank|
+---+------------------+
|  c|1.8994109890559092|
|  b|1.0905890109440908|
|  a|              0.01|
+---+------------------+



## Creating GraphFrames

Users can create GraphFrames from vertex and edge DataFrames
* _Vertex DataFrame_: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex in the graph
* _Edge Dataframe_: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst" (destination vertex ID of edge) 

Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.
A GraphFrame can also be constructed from a single DataFrame containing edge information. The vertices will be inferred from the sources and destinations of the edges.

The following example demonstrates how to create a GraphFrame from vertex and edge DataFrames

Vertex DataFrame

In [14]:
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])

Edge DataFrame

In [15]:
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

Create GraphFrame

In [16]:
g = GraphFrame(v, e)

The GraphFrame constructed above is available in the GraphFrames package:

In [17]:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends()

## Basic graph and DataFrame queries

GraphFrames provide several simple graph queries, such as node degree

Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on the vertex and edge DataFrames. Those DataFrames are made available as vertices and edges fields in the GraphFrame.

In [18]:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends()  # Get example graph

Display the vertex and edge DataFrames

In [19]:
g.vertices.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
|  d|  David| 29|
|  e| Esther| 32|
|  f|  Fanny| 36|
+---+-------+---+



In [20]:
g.edges.show()

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
|  f|  c|      follow|
|  e|  f|      follow|
|  e|  d|      friend|
|  d|  a|      friend|
+---+---+------------+



Get a DataFrame with columns "id" and "inDegree" (in-degree)

In [21]:
vertexInDegrees = g.inDegrees

In [22]:
vertexInDegrees.show()

+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       2|
|  f|       1|
|  d|       1|
|  a|       1|
+---+--------+



Find the youngest user's age in teh graph.
This queries the vertex DataFrame

In [23]:
g.vertices.groupBy().min("age").show()

+--------+
|min(age)|
+--------+
|      29|
+--------+



Count the number of "follows" in the graph.
This queries the edge DataFrame

In [24]:
numFollows = g.edges.filter("relationship = 'follow'").count()

In [25]:
print(f"numFollows={numFollows}")

numFollows=4


## Motif finding

Motif finding refers to searching for structural patterns in a graph.

GraphFrame motif finding uses a simple Domain-Specific Language (DSL) for expressin structural queries. For example, `graph.find("(a)-[e]->(b); (b)-[e2]->(a)")` will search for pairs of vertices `a,b` connected by edges in both directions. It will return a `DataFrame` of all such structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. In this case, the returned columns will be `"a, b, e, e2"`.

DSL for expressing structural patterns:

* The basic unit of a pattern is an edge. For example, `"(a)-[e]->(b)"` expresses an edge `e` from vertex `a` to vertex `b`. Note that vertices are denoted by parentheses `(a)`, while edges are denoted by square brackets `[e]`.
* A pattern is expressed as a union of edges. Edge patterns can be joined with semicolons. Motif `"(a)-[e]->(b); (b)-[e2]->(c)"` specifies two edges from `a` to `b` to `c`.
* Within a pattern, names can be assigned to vertices and edges. For example, `"(a)-[e]->[b]"` has three named elements: vertices `a`, `b`, and edge `e`. These names serve two purposes:
     * The names can identify common elements among edges. For example, `"(a)-[e]->(b); (b)-[e2]->(c)"` specifies that the same vertex is the destination of edge `e` and source of edge `e2`.
     * The names are used as column names in the result `DataFrame`. If a motif contains named vertex `a`, then the result `DataFrame` will contain a column `"a"` which is a `StructType` with sub-fields equivalent to the schema (columns) of `GraphFrame.vertices`.
