In [None]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *


In [None]:
vertices = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)], ["id", "name", "age"])

In [None]:
edges = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

In [None]:
g = GraphFrame(vertices, edges)
print(g)

In [None]:
from graphframes.examples import Graphs
same_g = Graphs(sqlContext).friends()
print(same_g)

# Basic graph and DataFrame queries

GraphFrames provide several simple graph queries, such as node degree.

GraphFrames represent graphs as pairs of vertex and edge DataFrames which enables querying directly on the vertex and edge DataFrames. Those DataFrames are available as vertices and edge fields in the GraphFrame.


In [None]:
g.vertices.show()

In [None]:
vertices.show()

In [None]:
g.edges.show()

The incoming degree of the vertices:
    

In [None]:
g.inDegrees.show()

The outgoing degree of the vertices:

In [None]:
g.outDegrees.show()

In [None]:
g.degrees.show()

Running queries directly on the `vertices` DataFrame. For example, we can find the age of the youngest person in the graph: 

In [None]:
youngest = g.vertices.groupBy().min("age")
youngest.show()

Likewise, we can query the `edges` DataFrame. For example, let's count the number of _follow_ relationships in the graph:

In [None]:
numFollows = g.edges.filter("relationship = 'follow'").count()
print("The number of follow edges is", numFollows)

# Motif finding

Using motifs allows us to build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.

In [None]:
# search for pairs of vertices with edges in both directions between them.
motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()

Since the result is a `DataFrame`, more complex queries can be built on top of the motif. Let us find all the reciprocal relationships in which one person is older than 30:

In [None]:
filtered = motifs.filter("b.age > 30 or a.age > 30")
filtered.show()

## Stateful queries

Most motif queries are stateless and simple to express. We will look into a more complex query that carries state along a path in the motif. Such queries can be expressed by combining `GraphFrame` motif finding with filters on the result where the filters use sequence operations to operate over `DataFrame` columns. 
For example, let us consider a chain of 4 vertices with some property defined by a sequence of functions. That is, among chains of 4 vertices `a->b->c->d`, identify the subset of chains matching this complex filter:

* initialize state on the path
* update state based on vertex a
* update state based on vertex b
* update state based on vertex c
* update state based on vertex d

If final state matches some condition, then the filter accepts the chain. The below code snippets demonstrate this process, where we dienfity chains of 4 vertices such that at least 2 of the 3 edges are "friend" relationships. In this example, the satte is the current count of "friend" edges; in general, it could be any `DataFrame` Column.
