#GraphFrames User Guide (Python)

This notebook demonstrates examples from the [GraphFrames User Guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html).

The GraphFrames package is available from [Spark Packages](http://spark-packages.org/package/graphframes/graphframes).

#### Import the following library from maven repo

<code>
  graphframes:graphframes:0.8.1-spark2.4-s_2.11
</code>

#### Start a new cluster with the following run time

<code>
  6.4 (includes Apache Spark 2.4.5, Scala 2.11)
</code>

In [0]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *

## Creating GraphFrames

Users can create GraphFrames from vertex and edge DataFrames.

* Vertex DataFrame: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex in the graph.
* Edge DataFrame: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst" (destination vertex ID of edge).

Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.

Create the vertices first:

In [0]:
vertices = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)], ["id", "name", "age"])

And then some edges:

In [0]:
edges = sqlContext.createDataFrame([
  ("a", "b", "follow"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "follow"),
  ("d", "a", "follow"),
  ("a", "e", "follow")
], ["src", "dst", "relationship"])

Let's create a graph from these vertices and these edges:

In [0]:
g = GraphFrame(vertices, edges)
print(g)

In [0]:
# This example graph also comes with the GraphFrames package.
from graphframes.examples import Graphs
same_g = Graphs(sqlContext).friends()
print(same_g)

## Basic graph and DataFrame queries

GraphFrames provide several simple graph queries, such as node degree.

Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on the vertex and edge DataFrames. Those DataFrames are made available as vertices and edges fields in the GraphFrame.

In [0]:
display(g.vertices)

id,name,age
a,Alice,34
b,Bob,36
c,Charlie,30
d,David,29
e,Esther,32
f,Fanny,36
g,Gabby,60


In [0]:
display(g.edges)

src,dst,relationship
a,b,follow
b,c,follow
c,b,follow
f,c,follow
e,f,follow
e,d,follow
d,a,follow
a,e,follow


### In Degrees and Out Degrees

The incoming degree of the vertices:

In [0]:
display(g.inDegrees)

id,inDegree
f,1
e,1
d,1
c,2
b,2
a,1


The outgoing degree of the vertices:

In [0]:
display(g.outDegrees)

id,outDegree
f,1
e,2
d,1
c,1
b,1
a,2


The degree of the vertices:

In [0]:
display(g.degrees)

id,degree
f,2
e,3
d,2
c,3
b,3
a,3


You can run queries directly on the vertices DataFrame. For example, we can find the age of the youngest person in the graph:

Likewise, you can run queries on the edges DataFrame. For example, let's count the number of 'follow' relationships in the graph:

### Applying filters to graphs

In [0]:
numFollows = g.vertices.filter("age > 30")
numFollows.show()

## Standard graph algorithms

GraphFrames comes with a number of standard graph algorithms built in:
* Breadth-first search (BFS)
* PageRank (regular and personalized)
* Shortest paths

##Breadth-first search (BFS)

Search from "Esther" for users of age < 32.

In [0]:
paths = g.bfs("id = 'a'", "id = 'd'")
display(paths)

from,e0,v1,e1,to
"List(a, Alice, 34)","List(a, e, follow)","List(e, Esther, 32)","List(e, d, follow)","List(d, David, 29)"


In [0]:
paths = g.bfs("name = 'Esther'", "age > 32")
display(paths)

from,e0,to
"List(e, Esther, 32)","List(e, f, follow)","List(f, Fanny, 36)"


## Shortest paths

Computes shortest paths to the given set of landmark vertices, where landmarks are specified by vertex ID.

In [0]:
results = g.shortestPaths(landmarks=["a"])
display(results)

id,name,age,distances
g,Gabby,60,Map()
b,Bob,36,Map()
e,Esther,32,Map(a -> 2)
a,Alice,34,Map(a -> 0)
f,Fanny,36,Map()
d,David,29,Map(a -> 1)
c,Charlie,30,Map()


## PageRank

Identify important vertices in a graph based on connections.

In [0]:
results = g.pageRank(resetProbability=0.15, maxIter=10)
display(results.vertices)

id,name,age,pagerank
g,Gabby,60,0.1707317073170731
b,Bob,36,2.7025217677349773
e,Esther,32,0.3613490987992571
a,Alice,34,0.4485115093698443
f,Fanny,36,0.3250491054969424
d,David,29,0.3250491054969424
c,Charlie,30,2.6667877057849627


## Subgraphs

GraphFrames provides APIs for building subgraphs by filtering on edges and vertices. These filters can be composed together, for example the following subgraph only includes people who are more than 30 years old and have friends who are more than 30 years old.

In [0]:
g2 = g.filterVertices("age > 30").dropIsolatedVertices()

In [0]:
display(g2.vertices)

id,name,age
f,Fanny,36
e,Esther,32
b,Bob,36
a,Alice,34


In [0]:
display(g2.edges)

src,dst,relationship
e,f,follow
a,e,follow
a,b,follow


## Motif finding

Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.

Check out the [GraphFrame User Guide](http://graphframes.github.io/user-guide.html#motif-finding) for more details on the API.

In [0]:
# Search for pairs of vertices with edges in both directions between them.
motifs = g.find("(x)-[r1]->(y); (y)-[r2]->(x)")
display(motifs)

x,r1,y,r2
"List(c, Charlie, 30)","List(c, b, follow)","List(b, Bob, 36)","List(b, c, follow)"
"List(b, Bob, 36)","List(b, c, follow)","List(c, Charlie, 30)","List(c, b, follow)"


Since the result is a DataFrame, more complex queries can be built on top of the motif. Let us find all the reciprocal relationships in which one person is older than 30:

In [0]:
filtered = motifs.filter("y.age > 30 or x.age > 30")
display(filtered)

x,r1,y,r2
"List(c, Charlie, 30)","List(c, b, follow)","List(b, Bob, 36)","List(b, c, follow)"
"List(b, Bob, 36)","List(b, c, follow)","List(c, Charlie, 30)","List(c, b, follow)"


In [0]:
display(g.edges)

src,dst,relationship
a,b,follow
b,c,follow
c,b,follow
f,c,follow
e,f,follow
e,d,follow
d,a,follow
a,e,follow
