# Gourmet Meals Business -- SQL + Neo4j Project (Part 3.2)

Authors: **Brodie Deb, Carolyn Dunlap, and Ethan Moody**

Date: **December 2022**

### Business Case

You have been working at AGM as a data engineer for a couple of years. As a data engineer, you have had a great working relationship with the data science team, and they have been very impressed with your work and your suggestions. The data science team has several positions approved, but has been unsuccessful at filling them.  

In light you your successful work as a data engineer, your knowledge of the company and the business, your great working relationship with the data science team, and your studying to become a data scientist, the data science team has asked you to join their team as a data scientist. Likewise, there are a couple of other data engineers in a similar situation that they have also asked to join the data science team. 

The data science team is looking to upgrade its skill set in terms of awareness of new technologies.   

Specifically, the data science team has no experience with NoSQL databases, and they want your team of new data scientists to give a quick overview of Neo4j, MongoDB, and Redis for the entire data science team, focusing on:

- Business examples which would involve **Neo4j**, **MongdoDB**, and **Redis**
- How **Neo4j**, **MongdoDB**, and **Redis** can be used to solve those business examples
- Why a relational database would not be a good fit for those business examples

A couple of days after you joined the data science team, the AGM executives met with the data science team in a special off site meeting and discussed their vision for the future of the company. They let the data science team know that for their vision to be successful, they will be heavily relying on the data science team. The data science team is very anxious to demonstrate to the executives that they have the knowledge and skills to have a key role in implementing their vision for the future. The data science team has asked that your business case examples be directly related to the vision of the future of the company.  

The executives relayed that all proofs of concepts will take place in the CA Bay Area, as the Berkeley store is the original store, has the oldest and largest customer base, etc. They have also been in talks with BART regarding public transportation.

AGM executives' vision of the future of the company includes the following:

- Adding more pickup locations
- Using public transportation to transport deliveries
    - BART could avoid gridlock traffic
    - Special cars or even special trains could be used
- Using delivery drones
- Using delivery robots
- Hybrid combinations:
    - Adding pickup locations at BART stations would have the following advantages:
        - Lots of potential customers passing through the BART stations each day
        - Allow the use of BART to transport deliveries
    - Using BART for transportation would have the following advantages:
        - Traditional delivery trucks could pick up deliveries at the BART station and:
            - Run a local delivery route
            - Deliver a truck load to another local pickup location
        - Delivery drones could pick up deliveries at the BART station and deliver them locally
        - Delivery robots could pick up deliveries at the BART station and deliver them locally
        
You and the data science team have been tasked with exploring these options using SQL, Neo4j, and external research as needed.

## Notebook Overview/Steps

### [0] Set up neo4j/postgres base code

### [1] Create neo4j database

- Node = individual BART stations
- Relationships = number of customers that are located between those two BART stations, within an approximate 5 mile radius

### [2] Set up algorithms for database

- Louvain modularity to define neighborhoods of stations (weighted by # customers)
- Harmonic centrality to identify most central locations per neighborhood
- MST from Berkeley to verify all stations accessible from Berkeley (should be yes)
- Betweeness as another metric by which to identify key stations

## Included Modules and Packages

In [1]:
import neo4j
import csv
import math
import numpy as np
import pandas as pd
import psycopg2
import json
import gmaps
import gmaps.geojson_geometries
from geographiclib.geodesic import Geodesic
from IPython.display import display

In [2]:
#! pip install geopy #for looking at distances

In [3]:
from geopy.distance import geodesic

In [4]:
driver = neo4j.GraphDatabase.driver(uri="neo4j://neo4j:7687", auth=("neo4j","w205"))

In [5]:
session = driver.session(database="neo4j")

## Neo4j Supporting Functions

In [6]:
def my_neo4j_wipe_out_database():
    "wipe out database by deleting all nodes and relationships"
    
    query = "match (node)-[relationship]->() delete node, relationship"
    session.run(query)
    
    query = "match (node) delete node"
    session.run(query)

In [7]:
def my_neo4j_run_query_pandas(query, **kwargs):
    "run a query and return the results in a pandas dataframe"
    
    result = session.run(query, **kwargs)
    
    df = pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    return df

In [8]:
def my_neo4j_number_nodes_relationships():
    "print the number of nodes and relationships"
   
    
    query = """
        match (n) 
        return n.name as node_name, labels(n) as labels
        order by n.name
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_nodes = df.shape[0]
    
    
    query = """
        match (n1)-[r]->(n2) 
        return n1.name as node_name_1, labels(n1) as node_1_labels, 
            type(r) as relationship_type, n2.name as node_name_2, labels(n2) as node_2_labels
        order by node_name_1, node_name_2
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_relationships = df.shape[0]
    
    print("-------------------------")
    print("  Nodes:", number_nodes)
    print("  Relationships:", number_relationships)
    print("-------------------------")


In [9]:
def my_neo4j_nodes_relationships():
    "print all the nodes and relationships"
   
    print("-------------------------")
    print("  Nodes:")
    print("-------------------------")
    
    query = """
        match (n) 
        return n.name as node_name, labels(n) as labels
        order by n.name
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_nodes = df.shape[0]
    
    display(df)
    
    print("-------------------------")
    print("  Relationships:")
    print("-------------------------")
    
    query = """
        match (n1)-[r]->(n2) 
        return n1.name as node_name_1, labels(n1) as node_1_labels, 
            type(r) as relationship_type, n2.name as node_name_2, labels(n2) as node_2_labels
        order by node_name_1, node_name_2
    """
    
    df = my_neo4j_run_query_pandas(query)
    
    number_relationships = df.shape[0]
    
    display(df)
    
    density = (2 * number_relationships) / (number_nodes * (number_nodes - 1))
    
    print("-------------------------")
    print("  Density:", f'{density:.1f}')
    print("-------------------------")

In [10]:
def my_neo4j_create_node(station_name):
    "create a node with label Station"
    
    query = """
    
    CREATE (:Station {name: $station_name})
    
    """
    
    session.run(query, station_name=station_name)

In [11]:
def my_neo4j_create_relationship_one_way(from_station, to_station, weight):
    "create a relationship one way between two stations with a weight"
    
    query = """
    
    MATCH (from:Station), 
          (to:Station)
    WHERE from.name = $from_station and to.name = $to_station
    CREATE (from)-[:LINK {weight: $weight}]->(to)
    
    """
    
    session.run(query, from_station=from_station, to_station=to_station, weight=weight)

In [12]:
def my_neo4j_create_relationship_two_way(from_station, to_station, weight):
    "create relationships two way between two stations with a weight"
    
    query = """
    
    MATCH (from:Station), 
          (to:Station)
    WHERE from.name = $from_station and to.name = $to_station
    CREATE (from)-[:LINK {weight: $weight}]->(to),
           (to)-[:LINK {weight: $weight}]->(from)
    
    """
    
    session.run(query, from_station=from_station, to_station=to_station, weight=weight)

## Additional Setup Code

In [13]:
# Function to run a select query and return rows in a pandas dataframe
# Note: pandas formats all numeric values from postgres as float

def my_select_query_pandas(query, rollback_before_flag, rollback_after_flag):
    "Function to run a select query and return rows in a pandas dataframe"
    
    if rollback_before_flag:
        connection.rollback()
    
    df = pd.read_sql_query(query, connection)
    
    if rollback_after_flag:
        connection.rollback()
    
    # Fix any float columns that really should be integers
    
    for column in df:
    
        if df[column].dtype == "float64":

            fraction_flag = False

            for value in df[column].values:
                
                if not np.isnan(value):
                    if value - math.floor(value) != 0:
                        fraction_flag = True

            if not fraction_flag:
                df[column] = df[column].astype('Int64')
    
    return(df)

In [14]:
# Function to read a csv file and print a set number of rows

def my_read_csv_file(file_name, limit):
    "Read the csv file and print only the first 'limit' rows"
    
    csv_file = open(file_name, "r")
    
    csv_data = csv.reader(csv_file)
    
    i = 0
    
    for row in csv_data:
        i += 1
        if i <= limit:
            print(row)
            
    print("\nPrinted ", min(limit, i), "lines of ", i, "total lines.")

## Supporting Functions for Distance Calculations

In [15]:
# Function to calculate a box on a map, given a point and miles

def my_calculate_box(point, miles):
    "Given a point and miles, calculate the box in form left, right, top, bottom"
    
    geod = Geodesic.WGS84

    kilometers = miles * 1.60934
    meters = kilometers * 1000

    g = geod.Direct(point[0], point[1], 270, meters)
    left = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 90, meters)
    right = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 0, meters)
    top = (g["lat2"], g["lon2"])

    g = geod.Direct(point[0], point[1], 180, meters)
    bottom = (g["lat2"], g["lon2"])
    
    return(left, right, top, bottom)

In [16]:
# Function to sum up population for all zip codes within a certain distance from a station

def my_station_get_zips(box):
    "Given a station, pull all zip codes with miles distance, print them, sum the population"
    
    connection.rollback()

    (left, right, top, bottom) = box
    
    query = "select zip, population from zip_codes "
    query += " where latitude >= " + str(bottom[0])
    query += " and latitude <= " + str(top [0])
    query += " and longitude >= " + str(left[1])
    query += " and longitude <= " + str(right[1])
    query += " order by 1 "

    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    total_population = 0
    
    for row in rows:
        zip = row[0]
        population = row[1]
        total_population += population
        
    return total_population

In [17]:
# Set up connection to postgres
# Note: All connection inputs below have been removed for protection
connection = psycopg2.connect(
    user = "",
    password = "",
    host = "",
    port = "",
    database = ""
)

In [18]:
cursor = connection.cursor()

## Drop SQL Tables

In [19]:
# Query drops the table housing BART station_cutstomer population data, if it already exists

connection.rollback()

query = """

drop table if exists final_table;

"""

cursor.execute(query)

connection.commit()

In [20]:
# Query drops the table housing BART station data, if it already exists

connection.rollback()

query = """

drop table if exists stations;

"""

cursor.execute(query)

connection.commit()

In [21]:
# Query drops the table housing BART station data, if it already exists

connection.rollback()

query = """

drop table if exists lines;

"""

cursor.execute(query)

connection.commit()

In [22]:
# Query drops the table housing BART station data, if it already exists

connection.rollback()

query = """

drop table if exists travel_times;

"""

cursor.execute(query)

connection.commit()

## Create final_table, stations table, lines table, and travel_times table

In [23]:
# Query creates a table to house BART station data and population data from the final_table_only.csv file

connection.rollback()

query = """

create table final_table (
  station_1 varchar(32),
  station_2 varchar(32),
  population integer,
  primary key (station_1, station_2)
)

;

"""

cursor.execute(query)

connection.commit()

In [24]:
# Query creates a table to house BART station data from the stations.csv file

connection.rollback()

query = """

create table stations (
  station varchar(32)
, latitude numeric(9,6)
, longitude numeric(9,6)
, transfer_time numeric(3)
, primary key (station)
)

;

"""

cursor.execute(query)

connection.commit()

In [25]:
# Query creates a table to house BART lines data from the lines.csv file

connection.rollback()

query = """

create table lines (
  line varchar(6),
  sequence numeric(2),
  station varchar(32),
  primary key (line, sequence)
);

"""

cursor.execute(query)

connection.commit()

In [26]:
# Query creates a table to house BART travel times data from the travel_times.csv file

connection.rollback()

query = """

create table travel_times (
  station_1 varchar(32),
  station_2 varchar(32),
  travel_time numeric(3),
  primary key (station_1, station_2)
);

"""

cursor.execute(query)

connection.commit()

## Load Tables

In [27]:
# Load BART station data from the stations.csv file into a database

connection.rollback()

query = """

copy stations
from '/user/projects/exercise/stations.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

In [28]:
# Load BART station data from the stations.csv file into a database

connection.rollback()

query = """

copy lines
from '/user/projects/exercise/lines.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

In [29]:
# Load BART station data from the stations.csv file into a database

connection.rollback()

query = """

copy travel_times
from '/user/projects/exercise/travel_times.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

## Determine the Number of AGM Customers Between Stations

In [30]:
# Query returns the percentage of customers per population by zip code

rollback_before_flag = True
rollback_after_flag = True

query = """

select
  t1_customers.zip as zip, round(count(t1_customers.customer_id),3) as num_customers

from customers as t1_customers

join zip_codes as t2_zipcodes
on t1_customers.zip = t2_zipcodes.zip

group by
  t1_customers.zip
, t2_zipcodes.zip
  
order by zip

;

"""

df_zip_cust = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df_zip_cust

Unnamed: 0,zip,num_customers
0,33004,2
1,33009,42
2,33010,156
3,33012,127
4,33013,59
...,...,...
545,98403,2
546,98407,4
547,98416,2
548,98421,9


In [31]:
# Sanity check

df_zip_cust['num_customers'][df_zip_cust['zip']=='33002'].sum()

0

In [32]:
def my_station_get_zips_2(box):
    "given a station, pull all zip codes with miles distance, print them, sum the population"
    
    connection.rollback()

        
    (left, right, top, bottom) = box
    
    query = "select zip, population from zip_codes "
    query += " where latitude >= " + str(bottom[0])
    query += " and latitude <= " + str(top [0])
    query += " and longitude >= " + str(left[1])
    query += " and longitude <= " + str(right[1])
    query += " order by 1 "

    cursor.execute(query)
    
    connection.rollback()
    
    rows = cursor.fetchall()
    
    total_population = 0
    
    for row in rows:
        zip_c = str(row[0])
        population = df_zip_cust.loc[df_zip_cust['zip']==zip_c,'num_customers'].sum()
        total_population += population
  
    return total_population

In [33]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select
x.line,
x.sequence,
x.from_station as station_1,
z.latitude as st_1_lat,
z.longitude as st_1_long,
x.to_station as station_2,
q.latitude as st_2_lat,
q.longitude as st_2_long,
(z.latitude+q.latitude)/2 as mid_lat,
(z.longitude+q.longitude)/2 as mid_long

from (
select 
a.line,
a.sequence,
a.station as from_station, 
b.station as to_station, 
t.travel_time

from lines a
  join lines b
    on a.line = b.line and b.sequence = (a.sequence + 1)
  join travel_times t
    on (a.station = t.station_1 and b.station = t.station_2)
    or (a.station = t.station_2 and b.station = t.station_1)
order by line, from_station, to_station

) as x

join stations as z on x.from_station = z.station
join stations as q on x.to_station = q.station


order by line, sequence

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

In [34]:
df

Unnamed: 0,line,sequence,station_1,st_1_lat,st_1_long,station_2,st_2_lat,st_2_long,mid_lat,mid_long
0,blue,1,Dublin,37.701663,-121.899232,West Dublin,37.699726,-121.928273,37.700694,-121.913753
1,blue,2,West Dublin,37.699726,-121.928273,Castro Valley,37.690748,-122.075679,37.695237,-122.001976
2,blue,3,Castro Valley,37.690748,-122.075679,Bay Fair,37.697000,-122.126500,37.693874,-122.101090
3,blue,4,Bay Fair,37.697000,-122.126500,San Leandro,37.721764,-122.160684,37.709382,-122.143592
4,blue,5,San Leandro,37.721764,-122.160684,Coliseum,37.753611,-122.196944,37.737687,-122.178814
...,...,...,...,...,...,...,...,...,...,...
103,yellow,22,Balboa Park,37.721667,-122.447500,Daly City,37.706224,-122.468934,37.713946,-122.458217
104,yellow,23,Daly City,37.706224,-122.468934,Colma,37.684722,-122.466111,37.695473,-122.467523
105,yellow,24,Colma,37.684722,-122.466111,South San Francisco,37.664264,-122.444043,37.674493,-122.455077
106,yellow,25,South San Francisco,37.664264,-122.444043,San Bruno,37.638300,-122.416500,37.651282,-122.430272


In [35]:
df['distance'] = df.apply(lambda x: geodesic((x['st_1_lat'], x['st_1_long']), (x['st_2_lat'],x['st_2_long'])).km/2,axis=1)

In [36]:
df['coord'] = df.apply(lambda x:my_calculate_box((x['mid_lat'], x['mid_long']), 5),axis=1)

In [37]:
df['population'] = df.apply(lambda x:my_station_get_zips_2(x['coord']),axis=1)

In [38]:
df

Unnamed: 0,line,sequence,station_1,st_1_lat,st_1_long,station_2,st_2_lat,st_2_long,mid_lat,mid_long,distance,coord,population
0,blue,1,Dublin,37.701663,-121.899232,West Dublin,37.699726,-121.928273,37.700694,-121.913753,1.285041,"((37.70065919732704, -122.00499706713151), (37...",26
1,blue,2,West Dublin,37.699726,-121.928273,Castro Valley,37.690748,-122.075679,37.695237,-122.001976,6.519280,"((37.69520170423249, -122.0932138788049), (37....",85
2,blue,3,Castro Valley,37.690748,-122.075679,Bay Fair,37.697000,-122.126500,37.693874,-122.101090,2.267810,"((37.69383870595695, -122.19232570869097), (37...",417
3,blue,4,Bay Fair,37.697000,-122.126500,San Leandro,37.721764,-122.160684,37.709382,-122.143592,2.039641,"((37.70934668633259, -122.2348472176549), (37....",701
4,blue,5,San Leandro,37.721764,-122.160684,Coliseum,37.753611,-122.196944,37.737687,-122.178814,2.382729,"((37.7376521504929, -122.27010395091831), (37....",1407
...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,yellow,22,Balboa Park,37.721667,-122.447500,Daly City,37.706224,-122.468934,37.713946,-122.458217,1.275696,"((37.713910180556226, -122.54947781416072), (3...",1112
104,yellow,23,Daly City,37.706224,-122.468934,Colma,37.684722,-122.466111,37.695473,-122.467523,1.199737,"((37.6954377039339, -122.55876066799237), (37....",702
105,yellow,24,Colma,37.684722,-122.466111,South San Francisco,37.664264,-122.444043,37.674493,-122.455077,1.495487,"((37.674457730470955, -122.54628947294165), (3...",456
106,yellow,25,South San Francisco,37.664264,-122.444043,San Bruno,37.638300,-122.416500,37.651282,-122.430272,1.884950,"((37.6512467598127, -122.52145557664103), (37....",257


In [39]:
df = df[['line','station_1','station_2','population']]
df

Unnamed: 0,line,station_1,station_2,population
0,blue,Dublin,West Dublin,26
1,blue,West Dublin,Castro Valley,85
2,blue,Castro Valley,Bay Fair,417
3,blue,Bay Fair,San Leandro,701
4,blue,San Leandro,Coliseum,1407
...,...,...,...,...
103,yellow,Balboa Park,Daly City,1112
104,yellow,Daly City,Colma,702
105,yellow,Colma,South San Francisco,456
106,yellow,South San Francisco,San Bruno,257


## Final_table EDITS: Need to remove duplicate values

Remove duplicate values in final_table.csv

Export a clean csv

In [40]:
# Drop index and lines (not relevant), remove duplicates
df = df.drop(['line'], axis = 1)
df = df.drop_duplicates()
df

Unnamed: 0,station_1,station_2,population
0,Dublin,West Dublin,26
1,West Dublin,Castro Valley,85
2,Castro Valley,Bay Fair,417
3,Bay Fair,San Leandro,701
4,San Leandro,Coliseum,1407
...,...,...,...
88,Walnut Creek,Lafayette,818
89,Lafayette,Orinda,1154
90,Orinda,Rockridge,3448
91,Rockridge,MacArthur,3554


### MANUALLY DELETE 8 OVERLAPPING LINES

- 11 overlapping connections from Lake Merritt to Fremont because the green and orange lines are listed in opposite directions
- Delete lines 34-43 for these connections

In [41]:
# Check stations 34-43 to make sure they are Lake Merritt to Fremont since we already have connections going Fremont to Lake Merritt

df[34:45]

Unnamed: 0,station_1,station_2,population
48,Lake Merritt,Fruitvale,2426
49,Fruitvale,Coliseum,2442
50,Coliseum,San Leandro,1407
51,San Leandro,Bay Fair,701
52,Bay Fair,Hayward,371
53,Hayward,South Hayward,250
54,South Hayward,Union City,61
55,Union City,Fremont,35
56,Fremont,Warm Springs,8
57,Warm Springs,Milpitas,0


In [42]:
test = df.drop([48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58])
test

Unnamed: 0,station_1,station_2,population
0,Dublin,West Dublin,26
1,West Dublin,Castro Valley,85
2,Castro Valley,Bay Fair,417
3,Bay Fair,San Leandro,701
4,San Leandro,Coliseum,1407
5,Coliseum,Fruitvale,2442
6,Fruitvale,Lake Merritt,2426
7,Lake Merritt,West Oakland,3013
8,West Oakland,Embarcadero,2345
9,Embarcadero,Montgomery Street,1400


In [43]:
# Check shape - should have 51 connections

test.shape

(51, 3)

In [44]:
# Write clean.csv

test.to_csv('final_table_only_clean.csv', index = False)

In [45]:
# Read clean csv to double check

my_read_csv_file("final_table_only_clean.csv", limit = 10)

['station_1', 'station_2', 'population']
['Dublin', 'West Dublin', '26']
['West Dublin', 'Castro Valley', '85']
['Castro Valley', 'Bay Fair', '417']
['Bay Fair', 'San Leandro', '701']
['San Leandro', 'Coliseum', '1407']
['Coliseum', 'Fruitvale', '2442']
['Fruitvale', 'Lake Merritt', '2426']
['Lake Merritt', 'West Oakland', '3013']
['West Oakland', 'Embarcadero', '2345']

Printed  10 lines of  52 total lines.


## Load Clean CSV Table to SQL

Load data from the final_table_only_clean.csv file into the database table called *final_table*.

In [46]:
# Load BART station data from the stations.csv file into a database

connection.rollback()

query = """

copy final_table
from '/user/projects/code/final_table_only_clean.csv' delimiter ',' NULL '' csv header;

"""

cursor.execute(query)

connection.commit()

## Check final_table Loaded Correctly

In [47]:
# Query everything to check

rollback_before_flag = True
rollback_after_flag = True

query = """

select *
from final_table
order by station_1, station_2

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df

Unnamed: 0,station_1,station_2,population
0,12th Street,Lake Merritt,3049
1,12th Street,West Oakland,3043
2,16th Street Mission,24th Street Mission,1518
3,19th Street,12th Street,3079
4,24th Street Mission,Glen Park,1491
5,Antioch,Pittsburg Center,1
6,Ashby,MacArthur,3554
7,Balboa Park,Daly City,1112
8,Bay Fair,San Leandro,701
9,Berryessa,Milpitas,0


## Wipe Out Neo4j Database and Check

In [48]:
my_neo4j_wipe_out_database()

In [49]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 0
  Relationships: 0
-------------------------


## Create Station Nodes

In [50]:
connection.rollback()

query = """

select station
from stations
order by station

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    station = row[0]
    
    my_neo4j_create_node(station)

## Verify the Number of Nodes and Relationships

In [51]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 50
  Relationships: 0
-------------------------


## Create Relationships

Create relationships from the final_table, where each relationship (2-way) describes the number of customers in located between the two stations.

In [52]:
connection.rollback()

query = """

select station_1, station_2, population
from final_table
order by station_1, station_2

"""

cursor.execute(query)

connection.rollback()

rows = cursor.fetchall()

for row in rows:
    
    station_1 = row[0]
    station_2 = row[1]
    population = row[2]
    
    my_neo4j_create_relationship_two_way(station_1, station_2, population)

## Verify the Number of Nodes and Relationships

In [53]:
my_neo4j_number_nodes_relationships()

-------------------------
  Nodes: 50
  Relationships: 102
-------------------------


In [54]:
my_neo4j_nodes_relationships()

-------------------------
  Nodes:
-------------------------


Unnamed: 0,node_name,labels
0,12th Street,[Station]
1,16th Street Mission,[Station]
2,19th Street,[Station]
3,24th Street Mission,[Station]
4,Antioch,[Station]
5,Ashby,[Station]
6,Balboa Park,[Station]
7,Bay Fair,[Station]
8,Berryessa,[Station]
9,Castro Valley,[Station]


-------------------------
  Relationships:
-------------------------


Unnamed: 0,node_name_1,node_1_labels,relationship_type,node_name_2,node_2_labels
0,12th Street,[Station],LINK,19th Street,[Station]
1,12th Street,[Station],LINK,Lake Merritt,[Station]
2,12th Street,[Station],LINK,West Oakland,[Station]
3,16th Street Mission,[Station],LINK,24th Street Mission,[Station]
4,16th Street Mission,[Station],LINK,Civic Center,[Station]
...,...,...,...,...,...
97,West Dublin,[Station],LINK,Castro Valley,[Station]
98,West Dublin,[Station],LINK,Dublin,[Station]
99,West Oakland,[Station],LINK,12th Street,[Station]
100,West Oakland,[Station],LINK,Embarcadero,[Station]


-------------------------
  Density: 0.1
-------------------------


In [55]:
## Notes: Extra relationship from Lake Merritt - double check the SQL table
## This is probably because of duplicates of station 1/2
## Future step, go back and fix in pandas - should have 49 connections between stations

## Minimum Spanning Tree (MST)

In [56]:
def my_neo4j_wipe_out_mst_relationships():
    "wipe out mst relationships"
    
    query = "match (node)-[relationship:MST]->() delete relationship"
    session.run(query)

In [57]:
my_neo4j_wipe_out_mst_relationships()

In [58]:
query = "CALL gds.graph.drop('ds_graph', false)"
session.run(query)

query = "CALL gds.graph.project('ds_graph', 'Station', 'LINK', {relationshipProperties: 'weight'})"
session.run(query)

<neo4j.work.result.Result at 0x7f63e9b3d910>

In [59]:
query = """

MATCH (n:Station {name: $source})
CALL gds.alpha.spanningTree.minimum.write('ds_graph',
                                          {startNodeId: id(n),
                                           relationshipWeightProperty: 'weight',
                                           writeProperty: 'MST',
                                           weightWriteProperty: 'writeCost'
                                          }
                                         )
YIELD preProcessingMillis, computeMillis, writeMillis, effectiveNodeCount
RETURN preProcessingMillis, computeMillis, writeMillis, effectiveNodeCount;
"""

source = "Downtown Berkeley"

my_neo4j_run_query_pandas(query, source=source)

Unnamed: 0,preProcessingMillis,computeMillis,writeMillis,effectiveNodeCount
0,0,1,7,50


## Convert Graphy Features (Louvain Modularity and Harmonic Centrality) to Pandas Table

In [60]:
connection.rollback()

query = """

drop table if exists graphy_features
;

create table graphy_features(
    node varchar(32),
    closeness numeric(5,4),
    betweenness numeric(5),
    community numeric(5)
)
;

"""

cursor.execute(query)

connection.commit()

In [61]:
def my_get_node_list():
    "get a list of nodes in the current graph"
    
    query = "match (n) return n.name as name"
    
    result = session.run(query)
    
    node_list = []
    
    for r in result:
        node_list.append(r["name"])
        
    node_list = sorted(node_list)
    
    return node_list

In [62]:
connection.rollback()

query = """

insert into graphy_features
values
(%s, 0, 0, 0)
;

"""

node_list = my_get_node_list()

for node in node_list:
    cursor.execute(query, (node,))

connection.commit()

In [63]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from graphy_features
order by node

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,node,closeness,betweenness,community
0,12th Street,0,0,0
1,16th Street Mission,0,0,0
2,19th Street,0,0,0
3,24th Street Mission,0,0,0
4,Antioch,0,0,0
5,Ashby,0,0,0
6,Balboa Park,0,0,0
7,Bay Fair,0,0,0
8,Berryessa,0,0,0
9,Castro Valley,0,0,0


## Harmonic Centrality

In [64]:
query = "CALL gds.graph.drop('ds_graph', false)"
session.run(query)

query = "CALL gds.graph.project('ds_graph', 'Station', 'LINK', {relationshipProperties: 'weight'})"
session.run(query)

<neo4j.work.result.Result at 0x7f63d445a790>

In [65]:
query = """

CALL gds.alpha.closeness.harmonic.stream('ds_graph', {})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).name AS name, centrality as closeness
ORDER BY centrality DESC

"""

result = session.run(query)

for r in result:
    
    query = "update graphy_features set closeness = %s where node = %s"
    
    cursor.execute(query, (r["closeness"], r["name"]))

connection.commit()

In [66]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from graphy_features
order by node

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,node,closeness,betweenness,community
0,12th Street,0.2333,0,0
1,16th Street Mission,0.1712,0,0
2,19th Street,0.2195,0,0
3,24th Street Mission,0.1663,0,0
4,Antioch,0.1069,0,0
5,Ashby,0.1986,0,0
6,Balboa Park,0.1574,0,0
7,Bay Fair,0.2093,0,0
8,Berryessa,0.1101,0,0
9,Castro Valley,0.1799,0,0


## Adding Harmonic Centrality Values to Neo4j Graph

In [67]:
query = """

CALL gds.alpha.closeness.harmonic.write('ds_graph', {})
YIELD nodes, writeProperty
"""

session.run(query)

<neo4j.work.result.Result at 0x7f63d445a8e0>

## Betweenness

In [68]:
query = "CALL gds.graph.drop('ds_graph', false)"
session.run(query)

query = "CALL gds.graph.project('ds_graph', 'Station', 'LINK', {relationshipProperties: 'weight'})"
session.run(query)

<neo4j.work.result.Result at 0x7f63e9b3d580>

In [69]:
query = """

CALL gds.betweenness.stream('ds_graph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score as betweenness
ORDER BY betweenness DESC

"""

result = session.run(query)

for r in result:
    
    query = "update graphy_features set betweenness = %s where node = %s"
    
    cursor.execute(query, (r["betweenness"], r["name"]))

connection.commit()

In [70]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from graphy_features
order by node

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,node,closeness,betweenness,community
0,12th Street,0.2333,1116,0
1,16th Street Mission,0.1712,720,0
2,19th Street,0.2195,1088,0
3,24th Street Mission,0.1663,656,0
4,Antioch,0.1069,0,0
5,Ashby,0.1986,440,0
6,Balboa Park,0.1574,516,0
7,Bay Fair,0.2093,822,0
8,Berryessa,0.1101,0,0
9,Castro Valley,0.1799,188,0


## Adding Betweenness Values to Neo4j Graph

In [71]:
query = """

CALL gds.betweenness.write('ds_graph', { writeProperty: 'betweenness' })
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten

"""

session.run(query)

<neo4j.work.result.Result at 0x7f63d44709a0>

## Louvain Modularity

In [72]:
query = "CALL gds.graph.drop('ds_graph', false)"
session.run(query)

query = """

CALL gds.graph.project('ds_graph', 'Station', 'LINK', 
                      {relationshipProperties: 'weight'})
"""

session.run(query)

<neo4j.work.result.Result at 0x7f63e9b33ca0>

In [73]:
query = """

CALL gds.louvain.stream('ds_graph')
YIELD nodeId, communityId, intermediateCommunityIds
RETURN gds.util.asNode(nodeId).name AS name, communityId as community, intermediateCommunityIds as intermediate_community
ORDER BY community, name ASC

"""

result = session.run(query)

for r in result:
    
    query = "update graphy_features set community = %s where node = %s"
    
    cursor.execute(query, (r["community"], r["name"]))

connection.commit()

In [74]:
rollback_before_flag = True
rollback_after_flag = True

query = """

select * 
from graphy_features
order by node

"""

my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

Unnamed: 0,node,closeness,betweenness,community
0,12th Street,0.2333,1116,29
1,16th Street Mission,0.1712,720,10
2,19th Street,0.2195,1088,46
3,24th Street Mission,0.1663,656,10
4,Antioch,0.1069,0,31
5,Ashby,0.1986,440,2
6,Balboa Park,0.1574,516,14
7,Bay Fair,0.2093,822,23
8,Berryessa,0.1101,0,45
9,Castro Valley,0.1799,188,23


## Add Communities to Neo4j Graph

In [75]:
query = """

CALL gds.louvain.write('ds_graph', { writeProperty: 'community' })
YIELD communityCount, modularity, modularities

"""

session.run(query)

<neo4j.work.result.Result at 0x7f63d4478f40>

## Query Average and Standard Deviation Stats for Closeness and Betweenness

In [76]:
rollback_before_flag = True
rollback_after_flag = True

query = """

with summary as (

select avg(closeness) as avg_closeness,
       stddev(closeness) as std_closeness,
       avg(betweenness) as avg_betweenness,
       stddev(betweenness) as std_betweenness
from graphy_features

)

select a.node,
       a.closeness, b.avg_closeness, b.std_closeness,
       a.betweenness, b.avg_betweenness, b.std_betweenness,
       a.community
from graphy_features as a,
     summary as b
order by community, closeness desc

"""

df = my_select_query_pandas(query, rollback_before_flag, rollback_after_flag)

df

Unnamed: 0,node,closeness,avg_closeness,std_closeness,betweenness,avg_betweenness,std_betweenness,community
0,Ashby,0.1986,0.169662,0.033104,440,470.44,354.982561,2
1,Downtown Berkeley,0.1823,0.169662,0.033104,360,470.44,354.982561,2
2,North Berkeley,0.1694,0.169662,0.033104,276,470.44,354.982561,2
3,El Cerrito Plaza,0.157,0.169662,0.033104,188,470.44,354.982561,2
4,El Cerrito del Norte,0.1426,0.169662,0.033104,96,470.44,354.982561,2
5,Richmond,0.119,0.169662,0.033104,0,470.44,354.982561,2
6,Powell Street,0.1835,0.169662,0.033104,836,470.44,354.982561,10
7,Civic Center,0.1768,0.169662,0.033104,780,470.44,354.982561,10
8,16th Street Mission,0.1712,0.169662,0.033104,720,470.44,354.982561,10
9,24th Street Mission,0.1663,0.169662,0.033104,656,470.44,354.982561,10


## Export Graph Features as CSV

In [77]:
df.to_csv("graph_features_final.csv", index = False)