#Working with Graphs in Spark
In this lab, you will learn some of the functionality of Spark GraphFrames. GraphFrames is the next generation library for working with graphs on Spark.

First of all, you need to import the graphframes library. However, this is not installed by default. The post-startup script downloaded the library (which is in a jar format), it extracts it, and makes it available for you to use.
 
There are still a couple of things you need to do to get this to work in this setup

In [None]:
# import findspark and os and let findspark find all the environment variables
import findspark
import os
findspark.init()

In [None]:
# Before you create the SparkSession, you need to add a new environment variable 
# to tell pyspark where the graphframes library is
SUBMIT_ARGS = "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

In [None]:
# Since you added some new environment variables, you want to make
# sure that the Spark configuration sees it
import pyspark
conf = pyspark.SparkConf()

In [None]:
# Create the SparkSession using the configuration
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("graphx-lab").getOrCreate()

In [None]:
# Import the graphframes library
from graphframes import *

You will be using data from the Bay Area Bike Share Portal (similar service to Capital Bikeshare in DC.)

In the following two cells, read in two csv files located in s3:

*   s3://bigdatateaching/bike-data/station_data.csv
*   s3://bigdatateaching/bike-data/trip_data.csv

The station file contains the metadata of the bicycile stations, and the trip data contains all the bike trips.

Read in the two files

Explore the datasets:

You will now modify the two DataFrames read in above to create a vertix 
list and an edge list.

In the next cell, use the station data and rename the "name" column to "id" and get distinct records:

In the next cell, use the trip data and rename the "Start Station" column to "src" and the "End Station" column to "dst".

In the next cell, you will create a GraphFrame passing in a vertex list and an edge list. Which is which
from your original datasets?

Since you will be using the GraphFrame more than once, it is best to cache it.

## Graph metadata

Count the number of vertices in the graph:

Count the number of edges in the graph:

# Querying the Graph
The most basic way of interacting with the graph is querying it. Since the GraphFrame is based on DataFrames, you can perform the same type of operations you would on a DataFrame.

In the next cell, show the top 10 source and destination combinations, ordered in descending order by count:

In the next cell, show the top 10 source and destination combinations **where the source or destination
station is 'Townsend at 7th'**, ordered in descending order by count

# Subsetting a Graph
Sometimes you need to work with a subset of a graph. The easiest way to create a subset is create a
new graph with the vertices and edges of your your subset.

In the next cell, subset the edges where the source or destination station is 'Townsend at 7th', and
create a new graph called sg1 using the original vertices and the new edge list:

# Motifs
Motifs are ways of expressing structural patterns in a graph. The following cell has a triangle motif: (a)
signifies the starting station, and [ab] represents an edge from (a) to our next station (b). We repeat this
for stations (b) to (c) and then from (c) to (a):The following cell creates a triangular pattern.

In [None]:
motifs = station_graph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")

The DataFrame we get from running this query contains nested fields for vertices a, b, and c, as well as
the respective edges. We can now query this as we would a DataFrame. For example, given a certain
bike, what is the shortest trip the bike has taken from station a, to station b, to station c, and back to
station a? The following logic will parse our timestamps, into Spark timestamps and then we’ll do
comparisons to make sure that it’s the same bike, traveling from station to station, and that the start
times for each trip are correct.

# PageRank
One of the most prolific graph algorithms is PageRank. Larry Page, cofounder of Google, created PageRank as a research project for how to rank web pages. Unfortu‐ nately, a complete explanation of
how PageRank works is outside the scope of this book. However, to quote Wikipedia, the high-level
explanation is as follows:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to
receive more links from other websites.

PageRank generalizes quite well outside of the web domain. We can apply this right to our own data and
get a sense for important bike stations (specifically, those that receive a lot of bike traffic). In this
example, important bike stations will be assigned large PageRank values:

In the next cells, you will run the PageRank algorith on the stations graph dataset.
* Run the pagerank algorithm with a reset probability of 0.15, and 10 maximum iterations
* Show the top 10 vertices based on pageranks