# Working with GraphFrames

### How to enable GraphFrames in PySpark Jupyter Notebooks
1. Download the graphframe jar corresponding to the version of spark from here https://spark-packages.org/package/graphframes/graphframes  
`ubuntu@ip-172-30-2-158:/home/ubuntu$ wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar`
2. Add the jar to the PYTHONPATH in .bashrc  
`SPARK_HOME="/home/ubuntu/spark-2.4.0-bin-hadoop2.7"  
ANACONDA_HOME="/home/ubuntu/anaconda3/envs/spark"  
PYSPARK_PYTHON="$ANACONDA_HOME/bin/python"  
PYSPARK_DRIVER_PYTHON="$ANACONDA_HOME/bin/python"  
PYTHONPATH="$ANACONDA_HOME/bin/python"  
export PATH="$ANACONDA_HOME/bin:$SPARK_HOME/bin:$PATH"  
export PYTHONPATH="$PYTHONPATH:/home/ubuntu/graphframes-0.6.0-spark2.3-s_2.11.jar:."  
export SPARK_HOME` 

3. We need maven to download the graphframes jar and its transitive dependencies. Launch `spark-shell` with the following arguments for the first time so that it downloads all the graphframe's jars dependencies:  
`ubuntu@ip-172-30-2-158:/home/ubuntu$ spark-shell --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 --jars graphframes-0.6.0-spark2.3-s_2.11.jar`
4. Maven will download all the jars in the local repo:  
`ubuntu@ip-172-30-2-158:/home/ubuntu/.ivy2/jars$ ls
com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar    org.scala-lang_scala-reflect-2.11.0.jar
com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar  org.slf4j_slf4j-api-1.7.7.jar
graphframes_graphframes-0.6.0-spark2.3-s_2.11.jar`
6. Copy all the jars appearing in `/home/ubuntu/.ivy2/jars` to spark's jars directory:  
`ubuntu@ip-172-30-2-158:/home/ubuntu/.ivy2/jars$ cp *.jar /home/ubuntu/spark-2.4.0-bin-hadoop2.7/jars`

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

In [2]:
spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("working-with-graphframes")
         .getOrCreate())

In [3]:
spark

In [4]:
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

In [5]:
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])

In [6]:
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

In [7]:
# Query: Get in-degree of each vertex.
g.inDegrees.show()

+---+--------+
| id|inDegree|
+---+--------+
|  c|       1|
|  b|       2|
+---+--------+



In [None]:
# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

2

In [None]:
# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

In [None]:
spark.stop()