# Spark Learning Note - Graph Analysis
Jia Geng | gjia0214@gmail.com


## 1. Graph Analysis with Spark

Graphs are a natural way of describing relationships and many different problem sets, and Spark provides serveral ways of working in this analytics paradigm. Some business user case include:
- detecting credit card fraud
- modif finding
- determining importance of papers in bibliographic networks
- page ranking

Spark has a RDD-based library for performning graph processing: `GraphX`. `GraphX` provides some low level interface that was extremely powerful, but wasn't easy to use or optimize. 

`GraphFrames` extend `GraphX` to provide a DataFrame level API and support for Spark's different language bindings so that users of Python can take advantage of the scalability of the tool.


In [None]:
!set PYSPARK_DRIVER_PYTHON=jupyter
!set PYSPARK_DRIVER_PYTHON_OPTS=notebook
!pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 

Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/jgeng/.ivy2/cache
The jars for the packages stored in: /home/jgeng/.ivy2/jars
:: loading settings :: url = jar:file:/home/jgeng/anaconda3/envs/ml/lib/python3.7/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-75448d96-b263-4711-9f19-5d1b1e73d4e8;1.0
	confs: [default]
	found graphframes#graphframes;0.6.0-spark2.3-s_2.11 in spark-packages
	found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
	found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
	found org.scala-lang#scala-reflect;2.11.0 in central
	found org.slf4j#slf4j-api;1.7.7 in central
:: resolution report :: resolve 185ms :: artifacts dl 5ms
	:: modules in u

In [None]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('Example').getOrCreate()
spark

Traceback (most recent call last):
  File "/home/jgeng/anaconda3/envs/ml/bin/jupyter", line 11, in <module>
    sys.exit(main())
  File "/home/jgeng/anaconda3/envs/ml/lib/python3.7/site-packages/jupyter_core/command.py", line 247, in main
    command = _jupyter_abspath(subcommand)
  File "/home/jgeng/anaconda3/envs/ml/lib/python3.7/site-packages/jupyter_core/command.py", line 134, in _jupyter_abspath
    'Jupyter command `{}` not found.'.format(jupyter_subcommand)
Exception: Jupyter command `jupyter-/home/jgeng/anaconda3/envs/ml/bin/find_spark_home.py` not found.
/home/jgeng/anaconda3/envs/ml/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory
/home/jgeng/anaconda3/envs/ml/bin/pyspark: line 77: /bin/spark-submit: No such file or directory


In [None]:
station_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/bike-data/201508_station_data.csv'
trip_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/bike-data/201508_trip_data.csv'
stations = spark.read.option('header', True).option('inferSchema', True).csv(station_path)
trips = spark.read.option('header', True).option('inferSchema', True).csv(trip_path)

In [None]:
# check data and nulls

stations.show(3)
stations.printSchema()
stations.cache()
print(stations.count())
print(stations.na.drop().count())

trips.show(3)
trips.printSchema()
trips.cache()
total = trips.count()
print(total)
print(trips.na.drop().count())

In [None]:
# check nulls in trip data
# SQL Queries can no recogize the space and special chars such as #
# Need to parse these column names
trips = spark.read.option('header', True).option('inferSchema', True).csv(trip_path)
for col_name in trips.columns:
    col_name_formatted =  col_name.replace(' ', '_')
    col_name_formatted =  col_name_formatted.replace('#', 'No')
    trips = trips.withColumnRenamed(col_name, col_name_formatted)
    n = trips.where('{} is null'.format(col_name_formatted)).count()
    print('Column {} num of nulls: {}/{}'.format(col_name_formatted, n, total))

## 2. Build a Graph

In spark, to build a graph we need to define the vertices and edges with named columns in DataFrame. For example, to build a directed graph, we need to
- specify the vertice column as `id`
- specify the edge source and destination as `src` and `dst`


In [None]:
from graphframes import *

vertices = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)], ["id", "name", "age"])

edges = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

g = GraphFrame(vertices, edges)
print(g)

In [5]:
# rename the vertice columns as id
stationVertices = stations.withColumnRenamed('name', 'id').distinct()
stationVertices.show(1)

# rename columns for the edge as src and dst
tripEdges = trips.withColumnRenamed('Start_Station', 'src')\
                        .withColumnRenamed('End_Station', 'dst')
tripEdges.show(1)

+----------+--------------------+---------+-----------+---------+-------------+------------+
|station_id|                  id|      lat|       long|dockcount|     landmark|installation|
+----------+--------------------+---------+-----------+---------+-------------+------------+
|        46|Washington at Kea...|37.795425|-122.404767|       15|San Francisco|   8/19/2013|
+----------+--------------------+---------+-----------+---------+-------------+------------+
only showing top 1 row

+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+-------+---------------+--------+
|Trip_ID|Duration|     Start_Date|                 src|Start_Terminal|       End_Date|                 dst|End_Terminal|Bike_No|Subscriber_Type|Zip_Code|
+-------+--------+---------------+--------------------+--------------+---------------+--------------------+------------+-------+---------------+--------+
| 913460|     765|8/31/2015 23:26|Harry Bridges P

In [6]:
from graphframes import GraphFrame

??GraphFrame
stationGraph = GraphFrame(stationVertices, tripEdges)
stationGraph.cache()

Py4JJavaError: An error occurred while calling o90.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
