## DSE 230 : Spark GraphFrames

* GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs.
* GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between users).
* GraphFrames also provide powerful tools for running queries and standard graph algorithms. With GraphFrames, you can easily search for patterns within graphs, find important vertices, and more.

### Resources

---

Spark DataFrame Guide: https://spark.apache.org/docs/latest/sql-programming-guide.html

PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html

Spark GraphFrames Guide: http://graphframes.github.io/graphframes/docs/_site/user-guide.html

Spark GraphFrames Guide: http://graphframes.github.io/graphframes/docs/_site/user-guide.html

### Installing graphframes (Run from a terminal)
1. Place the graphframes-0.8.1-spark3.0-s_2.12.jar file along with launch.sh in the work directory. Copy graphframes-0.8.1-spark3.0-s_2.12.jar to /usr/spark-3.1.1/jars:

    `cp graphframes-0.8.1-spark3.0-s_2.12.jar /usr/spark-3.1.1/jars`
    
2. Run the following command (Run `exit()` from Pyspark shell once the execution completes):

    `pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 --jars graphframes-0.3.0-spark3.0-s_2.12.jar`
    
3. Copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:

    `cp /root/.ivy2/jars/* /usr/spark-3.1.1/jars`
    
4. Run pyspark command again (Run `exit()` from Pyspark shell once the execution completes):

    `pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 --jars graphframes-0.3.0-spark3.0-s_2.12.jar`
    
5. Continue with the instructions in this notebook

In [1]:
import pyspark
from pyspark.sql import SparkSession, SQLContext

#### The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:

Also note that graphframes jar is added to the spark context so that graphframes can be imported in the notebook

In [2]:
conf = pyspark.SparkConf().setAll([('spark.master', 'local[1]'),
                                   ('spark.app.name', 'Spark GraphFrame Demo')])
# Add jar file to current spark context
pyspark.SparkContext.getOrCreate(conf).addPyFile("graphframes-0.8.1-spark3.0-s_2.12.jar")

spark = SparkSession.builder.config(conf=conf).getOrCreate()

In [3]:
# Can import graphframe only after adding .jar to context
import graphframes

#### Create vertex and edge dataframe

In [4]:
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])

#### Create a graph and run some queries

In [5]:
g = graphframes.GraphFrame(v, e)

In [6]:
# Query: Get in-degree of each vertex.
g.inDegrees.show()

+---+--------+
| id|inDegree|
+---+--------+
|  c|       1|
|  b|       2|
+---+--------+



In [7]:
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

2

#### Shortest path

 - Applications of these graph algorithms
 - Number of connected components - Applications and how the algorithm works

In [8]:
from graphframes.examples import Graphs

g = Graphs(spark).friends().cache()  # Get example graph
# TODO: Show the graph using a plotting library

In [9]:
%%timeit
results = g.shortestPaths(landmarks=["a"])
results.select("id", "distances").show()

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+---+---------+
|  f|       {}|
|  a| {a -> 0}|
|  e| {a -> 2}|
|  d| {a -> 1}|
|  b|       {}|
|  c|       {}|
+---+---------+

+---+---------+
| id|distances|
+-