## DSE 230 : Spark GraphFrames

* GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs.
* GraphFrames represent graphs: vertices (e.g., users) and edges (e.g., relationships between users).
* GraphFrames also provide powerful tools for running queries and standard graph algorithms. With GraphFrames, you can easily search for patterns within graphs, find important vertices, and more.

### Resources

---

Spark DataFrame Guide: https://spark.apache.org/docs/latest/sql-programming-guide.html

PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html

Spark GraphFrames Guide: http://graphframes.github.io/graphframes/docs/_site/user-guide.html

Spark GraphFrames Guide: http://graphframes.github.io/graphframes/docs/_site/user-guide.html

### Installing graphframes (Run from a terminal)
1. Open a terminal in JupyterLab and run `bash`. Change to work directory:

    `cd work`

2. Place the graphframes-0.8.1-spark3.0-s_2.12.jar file along with launch.sh in the work directory. Copy graphframes-0.8.1-spark3.0-s_2.12.jar to /usr/spark-3.1.1/jars:

    `cp graphframes-0.8.1-spark3.0-s_2.12.jar /usr/spark-3.1.1/jars`
    
3. Run the following command (Run `exit()` from Pyspark shell once the execution completes):

    `pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 --jars graphframes-0.3.0-spark3.0-s_2.12.jar`
    
4. Copy all the jars appearing in /root/.ivy2/jars to your spark's jars directory:

    `cp /root/.ivy2/jars/* /usr/spark-3.1.1/jars`
    
5. Run pyspark command again (Run `exit()` from Pyspark shell once the execution completes):

    `pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 --jars graphframes-0.3.0-spark3.0-s_2.12.jar`
    
6. Continue with the instructions in this notebook

In [None]:
import pyspark
from pyspark.sql import SparkSession, SQLContext

#### The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder:

Also note that graphframes jar is added to the spark context so that graphframes can be imported in the notebook

NOTE - Checkpointing must be enabled for applications with any of the following requirements(Here, it is required to run graph connected components):
 * Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing.
 * Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. You can setup checkpoint directory using sc.checkpoint(checkpointDirectoryLocation)

In [None]:
conf = pyspark.SparkConf().setAll([('spark.master', 'local[*]'),
                                   ('spark.app.name', 'Spark GraphFrame Demo')])
# Add jar file to current spark context
pyspark.SparkContext.getOrCreate(conf).addPyFile('graphframes-0.8.1-spark3.0-s_2.12.jar')

spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.setAll

# Set checkpoint directory
spark.sparkContext.setCheckpointDir('checkpoints')

In [None]:
# Can import graphframe only after adding .jar to context
import graphframes

#### Create vertex and edge dataframe

In [None]:
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([('1', 'Carter', 'Derrick', 50), 
                           ('2', 'May', 'Derrick', 26),
                           ('3', 'Mills', 'Jeff', 80),
                           ('4', 'Hood', 'Robert', 65),
                           ('5', 'Banks', 'Mike', 93),
                           ('98', 'Berg', 'Tim', 28),
                           ('99', 'Page', 'Allan', 16)],
                          ['id', 'name', 'firstname', 'age'])
e = spark.createDataFrame([('1', '2', 'friend'), 
                           ('2', '1', 'friend'),
                           ('3', '1', 'friend'),
                           ('1', '3', 'friend'),
                           ('2', '3', 'follows'),
                           ('3', '4', 'friend'),
                           ('4', '3', 'friend'),
                           ('5', '3', 'friend'),
                           ('3', '5', 'friend'),
                           ('4', '5', 'follows'),
                           ('98', '99', 'friend'),
                           ('99', '98', 'friend')],
                          ['src', 'dst', 'type'])

#### Create a graph and run some queries

In [None]:
g = graphframes.GraphFrame(v, e)

##### Visualization

!["Graph"](graph.png)

##### Show the in-degrees of all vertices
Hint - [`GraphFrame.inDegrees`](http://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html?highlight=edges#graphframes.GraphFrame.inDegrees)

In [None]:
<<< YOUR CODE HERE >>>

##### Count the number of "friend" connections in the graph.
Hint - `GraphFrame.edges` is a DataFrame holding edge information. Use `filter` on the DataFrame

In [None]:
<<< YOUR CODE HERE >>>

##### Find the shortest paths from '1' and assign the result to 'results'
Hint - Use [`Graph.shortestPaths`](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.shortestPaths)

In [None]:
<<< YOUR CODE HERE >>>

In [None]:
results.select("id", "distances").show()

##### Find the connected components of the graph
Hint - [`GraphFrame.connectedComponents`](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.connectedComponents)

NOTE - A random number maybe assigned to each component

In [None]:
<<< YOUR CODE HERE >>>