# Graph Embeddings on the European Roads dataset

In this notebook we're going to generate graph embeddigs on a European Roads dataset using the Neo4j Graph Data Science Library and then explore those embeddings using Python Data Science tools.

Let's start by importing some libraries:

In [24]:
from neo4j import GraphDatabase
from neo4j.exceptions import ClientError
from sklearn.manifold import TSNE

import numpy as np
import altair as alt
import altair_viewer
import pandas as pd

Once we've done that we can initialise the Neo4j driver.

In [None]:
driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", "neo"))

## Importing the dataset

And now we can import the dataset:

In [39]:
with driver.session(database="foo") as session:
    try: 
        session.run("""
        CREATE CONSTRAINT ON (p:Place) ASSERT p.name IS UNIQUE;
        """)
    except ClientError as ex:
        print(ex)
        
    result = session.run("""
    USING PERIODIC COMMIT 1000
    LOAD CSV WITH HEADERS FROM "https://github.com/neo4j-examples/graph-embeddings/raw/main/data/roads.csv"
    AS row

    MERGE (origin:Place {name: row.origin_reference_place})
    SET origin.countryCode = row.origin_country_code

    MERGE (destination:Place {name: row.destination_reference_place})
    SET destination.countryCode = row.destination_country_code

    MERGE (origin)-[eroad:EROAD {number: row.road_number}]->(destination)
    SET eroad.distance = toInteger(row.distance), eroad.watercrossing = row.watercrossing;
    """)
    display(result.consume().counters)

An equivalent constraint already exists, 'Constraint( UNIQUE, :Place(name) )'.


{'properties_set': 5000}

## Random Projection

Now we're ready to run some embeddings. We're going to run the streaming version of the Random Projection algorithm. We need to define the following config:

* `nodeProjection` - the node labels to use for our projected graph
* `relationshipProjection` - the relationship types to use for our projected graph
* `embeddingSize` - the size of the vector/list of numbers to create for each node
* `maxIterations` - the number of iterations to run

Let's give it a try:

In [15]:
with driver.session(database="foo") as session:
    result = session.run("""
    CALL gds.alpha.randomProjection.stream({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       maxIterations: 1
    })
    YIELD nodeId, embedding
    RETURN gds.util.asNode(nodeId).name AS place, embedding
    LIMIT 10
    """)
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,place,embedding
0,Larne,"[0.0, 0.18257418274879456, 0.0, 0.0, -0.547722..."
1,Belfast,"[0.18257418274879456, 0.0, -0.0912870913743972..."
2,Dublin,"[0.0, 0.13693062961101532, 0.13693062961101532..."
3,Wexford,"[0.13693062961101532, 0.41079187393188477, 0.0..."
4,Rosslare,"[0.13693062961101532, 0.27386125922203064, 0.0..."
5,La Coruña,"[0.0, 0.3651483654975891, -0.18257418274879456..."
6,Pontevedra,"[0.0, 0.0, 0.0, -0.5477225184440613, -0.273861..."
7,Valença do Minho,"[-0.27386125922203064, 0.0, 0.2738612592220306..."
8,Porto,"[-0.18257418274879456, 0.18257418274879456, 0...."
9,Aveiro,"[0.0, -0.13693062961101532, 0.1369306296110153..."


So far everything looks good. Let's now store the embeddings in Neo4j, by using the write version of the algorithm:

In [109]:
with driver.session(database="foo") as session:
    result = session.run("""
    CALL gds.alpha.randomProjection.write({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       maxIterations: 1,
       writeProperty: $embeddingProperty
    })
    """, {"embeddingProperty": "embeddingRandomProjection"})
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,nodeCount,nodePropertiesWritten,createMillis,computeMillis,writeMillis,configuration
0,894,894,10,36,26,"{'maxIterations': 1, 'writeConcurrency': 4, 'n..."


The embeddings will be stored in the `embeddingRandomProjection` property on each node. Let's now build a data frame that contains each place and its embedding so that we can explore them further.

In [110]:
with driver.session(database="foo") as session:
    result = session.run("""
    MATCH (p:Place)
    RETURN p.name AS place, p.embeddingRandomProjection AS embedding, p.country AS country
    """)
    X = pd.DataFrame([dict(record) for record in result])
X    

Unnamed: 0,place,embedding,country
0,Larne,"[-0.18257418274879456, 0.0, 0.0, 0.36514836549...",GB
1,Belfast,"[-0.09128709137439728, 0.18257418274879456, -0...",GB
2,Dublin,"[0.0, 0.13693062961101532, 0.27386125922203064...",IRL
3,Wexford,"[0.0, -0.13693062961101532, -0.136930629611015...",IRL
4,Rosslare,"[0.13693062961101532, 0.0, 0.41079190373420715...",IRL
...,...,...,...
889,Vólos,"[0.5477225184440613, -0.5477225184440613, 0.0,...",GR
890,Trapani,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.54772251...",I
891,Merzifon,"[-0.5477225184440613, 0.5477225184440613, -0.5...",TR
892,Gíthio,"[0.5477225184440613, 0.0, 0.0, 0.0, 0.0, 0.0, ...",GR


## Visualizing Random Projection embeddings

We can visualize our embeddings with the help of the [t-SNE algorithm](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). 

t-SNE is a dimensionality reductiion technique that can be used to reduce high dimensionality objects to 2 or 3 dimensions so that they can be better visualized. We're going to use it to create a scatterplot of our embeddings.

The following code snippet applies t-SNE to the embeddings and then creates a data frame containing each place, its country, as well as x and y coordinates.

In [111]:
X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))

places = list(X.place)
df = pd.DataFrame(data = {
    "place": places,
    "country": X.country,
    "x": [value[0] for value in list(X_embedded)],
    "y": [value[1] for value in list(X_embedded)]
})
df

Unnamed: 0,place,country,x,y
0,Larne,GB,18.039846,20.024580
1,Belfast,GB,-3.358563,-17.995136
2,Dublin,IRL,-2.909632,-7.746000
3,Wexford,IRL,-13.913317,1.407290
4,Rosslare,IRL,-19.017954,12.351611
...,...,...,...,...
889,Vólos,GR,6.698317,7.730943
890,Trapani,I,-8.629001,-13.253885
891,Merzifon,TR,-4.648191,-31.553631
892,Gíthio,GR,-12.572576,-16.032948


We can then use the Altair visualization library to create a scatterplot of these coordinates:

In [112]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['place', 'country']
).properties(width=700, height=400)

chart

There don't seem to be any clusters of points in our visualization. It's also hard to tell what each point represents without hovering over them individually. We can color each point based on their `country` property with the following code:

In [113]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color='country',
    tooltip=['place', 'country']
).properties(width=700, height=400)

chart

## Node2vec



In [114]:
with driver.session(database="foo") as session:
    result = session.run("""
    CALL gds.alpha.node2vec.write({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       iterations: 10,
       writeProperty: $embeddingProperty
    })
    """, {"embeddingProperty": "embeddingNode2vec"})
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,nodeCount,nodePropertiesWritten,createMillis,computeMillis,writeMillis,configuration
0,894,894,4,23188,21,"{'initialLearningRate': 0.025, 'writeConcurren..."


In [117]:
with driver.session(database="foo") as session:
    result = session.run("""
    MATCH (p:Place)
    RETURN p.name AS place, p.embeddingNode2vec AS embedding, p.country AS country
    """)
    X = pd.DataFrame([dict(record) for record in result])
X    

Unnamed: 0,place,embedding,country
0,Larne,"[-1.7483952045440674, 0.3618572950363159, 0.85...",GB
1,Belfast,"[-1.6908007860183716, 0.7326405048370361, 0.61...",GB
2,Dublin,"[-0.9450862407684326, 1.0680564641952515, 1.38...",IRL
3,Wexford,"[-1.4188686609268188, 0.5563820004463196, 1.64...",IRL
4,Rosslare,"[-1.5241732597351074, 0.06164241582155228, 1.6...",IRL
...,...,...,...
889,Vólos,"[-0.8323580622673035, 1.766160249710083, -1.53...",GR
890,Trapani,"[-0.1678529977798462, -2.1658661365509033, 2.5...",I
891,Merzifon,"[1.241807222366333, -0.4553961753845215, -1.54...",TR
892,Gíthio,"[3.0779261589050293, 1.590602159500122, -0.011...",GR


In [119]:
X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))
places = list(X.place)
df = pd.DataFrame(data = {
    "place": places,
    "country": X.country,
    "x": [value[0] for value in list(X_embedded)],
    "y": [value[1] for value in list(X_embedded)]
})
df

Unnamed: 0,place,country,x,y
0,Larne,GB,13.909543,-14.981020
1,Belfast,GB,14.294166,-14.053535
2,Dublin,IRL,15.597778,-12.288795
3,Wexford,IRL,15.964207,-10.871155
4,Rosslare,IRL,16.485857,-9.628704
...,...,...,...,...
889,Vólos,GR,-30.886980,17.924509
890,Trapani,I,0.308325,31.518797
891,Merzifon,TR,-29.401266,-20.471985
892,Gíthio,GR,-34.427895,27.753803


In [122]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color='country',
    tooltip=['place', 'country']
).properties(width=700, height=400)

chart