# Neo4j: Application of graph algorithms
COURSE: INFO9016-1 Advanced Databases

AUTHORS: 
- Emilien DE LA BRASSINNE BONARDAUX 
- Yanis GEURTS
- Eri VAN DE VYVER


## Introduction

Graph databases have the particularity of representing data as nodes interconnected with edges, which respectively model entities and their relationships. These databases are based on the mathematical notion of graphs, resulting in a more flexible representation of data than relational databases. This representation allows to design algorithms more efficiently to answer different kinds of questions using graph properties and traversal methods from graph theory.[1]

In this tutorial, Neo4j has been chosen to illustrate the usage of graph databases, since it allows to model and query graph databases more easily. It uses the Cypher query language, which is optimised for pattern matching and graph exploration  and combines it with already implemented graph algorithms such as PageRank and shortest path. Unlike relational databases that rely on tables and JOIN operations, Neo4j uses a native graph structure that directly stores relationships, enabling faster and more scalable queries across networks.[2]

Neo4j is used for a wide range of real-world applications such as in fraud detection, where it identifies suspicious patterns by analysing transaction graphs or in recommendation systems, where it leverages similarity and collaborative filtering to suggest products or content.

This tutorial uses a dataset of airport routes [3] to show how graph databases handle tasks that are more difficult for traditional databases, especially when analyzing connectivity, hubs, and optimal paths across complex networks.

## Setup

For following this tutorial, docker must already be installed. Once installed execute these lines : 

`docker pull neo4j`

`docker compose up`

It will create a Neo4j container running the database service.

Install the Neo4j python driver and import next libraries.

In [1]:
!pip install neo4j
from neo4j import GraphDatabase
import pandas as pd



Now that everything is set up, the actual python environment just need to connect to the Neo4j docker container.

In [2]:
# Define Connection Parameters
URI = "bolt://localhost:7687"
AUTH = ("neo4j", "neo4j")  # Default credentials for Neo4j when no auth is set

# Establish Connection and Query Executor
try:
    driver = GraphDatabase.driver(URI, auth=AUTH)
    driver.verify_connectivity()
    print("Connection to Neo4j successful!")
except Exception as e:
    print(f"Connection failed. Error: {e}")

def run_cypher_query(query):
    with driver.session() as session:
        result = session.execute_write(lambda tx: tx.run(query).data())
        return result
def run_cypher_query_df(query, params={}):
    """
    Executes a Cypher query using the global 'driver'
    and returns the result as a pandas DataFrame.
    """
    try:
        df = driver.execute_query(
            query,
            parameters_=params,
            database_="neo4j", 
            result_transformer_=pd.DataFrame
        )
        return df
    
    except Exception as e:
        print(f"Query failed: {e}")
        return pd.DataFrame()

Connection to Neo4j successful!


Let’s begin by creating our graph database with the essential structure and data.

Let's begin by creating our graph database, we first define **constraints** and **indexes**.  
These ensure data consistency, improve performance, and prevent the creation of duplicate nodes.

- **Uniqueness constraints** guarantee that key identifiers such as airport IATA codes or country codes appear only once.
- **Indexes** accelerate point lookups on frequently accessed properties. In this case, we have an index on the airport’s location.

Creating constraints *before* loading data is important: Neo4j enforces them during the import phase and prevents storing inconsistent or duplicate nodes.

The following code creates:

- 5 uniqueness constraints  
- 1 index on the `location` property of `Airport`

In [3]:
print("\nStep 1: Creating Constraints and Indexes")

CONSTRAINTS_AND_INDEXES = [
    "CREATE CONSTRAINT airports IF NOT EXISTS FOR (a:Airport) REQUIRE a.iata IS UNIQUE",
    "CREATE CONSTRAINT cities IF NOT EXISTS FOR (c:City) REQUIRE c.name IS UNIQUE",
    "CREATE CONSTRAINT regions IF NOT EXISTS FOR (r:Region) REQUIRE r.name IS UNIQUE",
    "CREATE CONSTRAINT countries IF NOT EXISTS FOR (c:Country) REQUIRE c.code IS UNIQUE",
    "CREATE CONSTRAINT continents IF NOT EXISTS FOR (c:Continent) REQUIRE c.code IS UNIQUE",
    "CREATE INDEX locations IF NOT EXISTS FOR (air:Airport) ON (air.location)"
]

try:
    for query in CONSTRAINTS_AND_INDEXES:
        run_cypher_query(query)
    print("Constraints and indexes created successfully.")

except Exception as e:
    print(f"Failed to create constraints/indexes: {e}")


Step 1: Creating Constraints and Indexes
Constraints and indexes created successfully.


Next, we import all nodes and their complete geo-hierarchy from the `airport-node-list.csv` file.  
This step builds the structure of our graph by creating:

- **Five node labels:** `Airport`, `City`, `Region`, `Country`, and `Continent`
- **Four relationship types:** `:IN_CITY`, `:IN_REGION`, `:IN_COUNTRY`, and `:ON_CONTINENT`

Each row in the CSV is used to:

1. **Create (or merge) the corresponding nodes** for the airport and its geographic containers.  
2. **Link them together** using the geo-hierarchy relationships.  
3. **Assign all airport attributes** (IATA, ICAO, altitude, runways, coordinates, etc.) using the `SET` command.

This ensures that every airport is uniquely connected to its city, region, country, and continent while avoiding duplicate nodes across the dataset.

- Explain what a node is
- change "container" word

In [4]:
print("\nStep 2: Loading Nodes and Geo-Hierarchy from airport-node-list.csv")

NODE_LOADING_QUERY = """
LOAD CSV WITH HEADERS FROM 'file:///airport-node-list.csv' AS row

// Creation of the nodes 
MERGE (a:Airport {iata: row.iata})
MERGE (ci:City {name: row.city})
MERGE (r:Region {name: row.region})
MERGE (co:Country {code: row.country})
MERGE (con:Continent {name: row.continent})

// Creation of the relationship types
MERGE (a)-[:IN_CITY]->(ci)
MERGE (a)-[:IN_COUNTRY]->(co)
MERGE (ci)-[:IN_COUNTRY]->(co)
MERGE (r)-[:IN_COUNTRY]->(co)
MERGE (a)-[:IN_REGION]->(r)
MERGE (ci)-[:IN_REGION]->(r)
MERGE (a)-[:ON_CONTINENT]->(con)
MERGE (ci)-[:ON_CONTINENT]->(con)
MERGE (co)-[:ON_CONTINENT]->(con)
MERGE (r)-[:ON_CONTINENT]->(con)

// Assigning attributes
SET a.id = row.id,
    a.icao = row.icao,
    a.descr = row.descr,
    a.runways = toInteger(row.runways),
    a.longest = toInteger(row.longest),
    a.altitude = toInteger(row.altitude),
    a.latitude = toFloat(row.lat),
    a.longitude = toFloat(row.lon)
    // Removed any potential a.city property to keep city as a separate City label

RETURN count(a) AS airports_loaded
"""

results_nodes = run_cypher_query(NODE_LOADING_QUERY)
print(f"Airports and Geographies loaded. Total Airports: {results_nodes[0]['airports_loaded']}")


Step 2: Loading Nodes and Geo-Hierarchy from airport-node-list.csv
Airports and Geographies loaded. Total Airports: 3503


Finally, we import the `:HAS_ROUTE` relationships between airports.  
Each relationship connects a source airport to a destination airport and includes a `distance` property, representing the distance between the two airports.  

This distance will later be used to create **weighted graphs** for routing and network analysis.

In [5]:

print("\nStep 3: Loading Routes from iroutes-edges.csv")

ROUTE_LOADING_QUERY = """
LOAD CSV WITH HEADERS FROM 'file:///iroutes-edges.csv' AS row

MATCH (source:Airport {iata: row.src})
MATCH (target:Airport {iata: row.dest})

MERGE (source)-[r:HAS_ROUTE]->(target)
ON CREATE SET r.distance = toFloat(row.dist)

RETURN count(r) AS routes_loaded
"""

results_routes = run_cypher_query(ROUTE_LOADING_QUERY)
print(f"Routes loaded. Total HAS_ROUTE relationships: {results_routes[0]['routes_loaded']}")


Step 3: Loading Routes from iroutes-edges.csv
Routes loaded. Total HAS_ROUTE relationships: 46389


The structure of the relation of the dataset can be seen in the figure below: 
![](Images/Structure.png "Structure of the dataset")

3. Neo4j. Airport Routes Graph Example. GitHub, url: https://github.com/neo4j-graph-examples/airport-routes. Accessed 14 Nov 2025.

## 2. Features of Neo4j

Neo4j is a native graph database built around the idea that relationships are as important as the data itself.  

Neo4j follows a Property Graph Model, made of:

- Nodes: Entities in the graph. Example: `(:Airport {iata:"CDG"})`
- Relationships: Directed edges connecting nodes. Example:  `(:Airport)-[:HAS_ROUTE]->(:Airport)`
- Properties: Key–value pairs assigned to nodes or relationships. Example: `a.latitude = 50.9014`
- Labels: Group nodes into types/classes. Example: `:Airport`, `:City`
- Relationship types: Classify relationship semantics. Example: `:IN_CITY`, `:HAS_ROUTE`

### 2.1 Cypher Query Language Basics

Cypher is Neo4j’s declarative graph query language. It allows users to express complex queries in a readable, SQL-like style for connected data. Here we will show how to perform various simple queries.

#### Count nodes
First we can count nodes in the graph by using the *MATCH* clause to get all the airports and return their count

In [6]:
query = """
MATCH (a:Airport)
RETURN count(a) AS number_of_airports;
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0
0,3503


#### Find airport
To find a specific node, we have to use the *MATCH* clause and precise the attribute we search. Here, we seek Brussels airport, using the IATA "BRU".

In [7]:
query = """
MATCH (a:Airport {iata: "BRU"}) RETURN a.descr, a.runways, a.altitude, a.longest, a.latitude, a.longitude
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0,1,2,3,4,5
0,Brussels Airport,3,184,11936,50.901402,4.48444


#### Find connected airports

We are search for all airports that can be joined from Brussels Airport. 

In Cypher, a relationship r from a to b is represented by (a)-[r]->(b). 

In [8]:
query = """
MATCH (a1:Airport {iata: "BRU"})-[:HAS_ROUTE]->(a2:Airport)
                                 RETURN a2.iata, a2.descr
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0,1
0,TSF,"Venice, Treviso-Sant Angelo Airport"
1,TIA,Tirana International Airport Mother Teresa
2,BRS,Bristol International Airport
3,SVQ,Sevilla Airport
4,SPU,Split Airport
...,...,...
189,BIO,Bilbao Airport
190,FNC,"Funchal, Madeira Airport"
191,ACC,Kotoka International Airport
192,FCO,Leonardo da Vinci-Fiumicino International Airport


#### Filter on relationship
We are now going to check, in all airports that have a direct route from Brussels Airport, which one are further than 2000 miles. 

A filter in cypher is done with the *WHERE* clause.

In [9]:
query = """
    MATCH (a1:Airport {BRU})-[r:HAS_ROUTE]->(a2:Airport)
    WHERE r.distance > 2000
    RETURN a2.iata, a2.descr
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Query failed: {code: Neo.ClientError.Statement.SyntaxError} {message: Invalid input '}': expected ':' (line 2, column 27 (offset: 27))
"    MATCH (a1:Airport {BRU})-[r:HAS_ROUTE]->(a2:Airport)"
                           ^}


#### Find airports within three flights
We look for all airports that we where we can land in 3 flights or less, starting from Brussels Airport. We can sort the result by using the clause *ORDER BY*

In [10]:
query = """
    MATCH a = (start:Airport {iata:"BRU"})-[:HAS_ROUTE*1..3]->(dest:Airport)
    RETURN DISTINCT dest.iata AS airport, length(a) AS num_flights
    ORDER BY num_flights, airport;
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0,1
0,ABJ,1
1,ACC,1
2,ACE,1
3,ADB,1
4,AGA,1
...,...,...
4842,ZUM,3
4843,ZVK,3
4844,ZYI,3
4845,ZYL,3


#### Data types in Neo4j 
Neo4j supports:
- Primitive types (Integer, Float, String, Boolean, Lists)
- Temporal types (date, datetime, duration)
- Spatial types: `Point`

The `Point` type is the mechanism implemented in Neo4j to store geographical data. It supports:

- latitude/longitude points
- cartesian points
- `distance()` between points
- indexing on points

[6]

First, here is how to get the point type of Brussels Airport from the dataset

In [11]:
query = """
MATCH (a:Airport {iata:"BRU"})
RETURN point({latitude:a.latitude, longitude:a.longitude}) AS geo_point;
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0
0,"(4.48443984985, 50.9014015198)"


The distance between Brussels Aiport and Brussels South Charleroi Airport can be computed as followed.

In [12]:
query = """
MATCH (a:Airport {iata:"BRU"}), (b:Airport {iata:"CRL"})
RETURN point.distance(
  point({latitude:a.latitude, longitude:a.longitude}),
  point({latitude:b.latitude, longitude:b.longitude})
) / 1000 AS distance_km;
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0
0,49.272828


Creating a spatial index 

In [13]:
query = """
CREATE INDEX airport_location IF NOT EXISTS
FOR (a:Airport) ON (a.location);
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Finding the nearest airports from a given place. Here we specify the place by giving the latitude and longitude.

In [14]:
query = """
WITH point({latitude:50.5859, longitude:5.56027}) AS montef
MATCH (a:Airport)
WITH a, point.distance(montef, point({latitude:a.latitude, longitude:a.longitude})) / 1000 AS dist
RETURN a.iata, dist
ORDER BY dist ASC
LIMIT 5;
"""

with driver.session() as session:
    result = run_cypher_query_df(query)
result

Unnamed: 0,0,1
0,LGG,10.061613
1,MST,39.164735
2,CRL,79.567205
3,BRU,83.525441
4,EIN,97.077585


### Other features

Neo4j also presents important features. It is ACID compliant. Neo4j supports full-text indexing, which allows efficient search across textual properties. It may be useful for matching airport descriptions or names.Finally, Neo4j has schema flexibility. You *can* define constraints, add properties at any time and you don't need to predefine a full-schema.

## Advantages and limitations

Neo4j’s biggest strength compared to relational databases is how it handles connected queries. In a relational database, queries that require exploring several layers of relationships (such as “find all airports reachable within three flights from Brussels”) requires multiple JOIN operations. As the number of JOINs increases, the performance degrades, since each JOIN requires matching rows across large tables and scanning indexes. Neo4j, however, uses a native graph storage where relationships are stored as direct pointers between nodes. This makes multi-hop queries scale with the length of the path rather than with the number of records, which is a fundamental advantage when working with networks structures. This means that algorithms as shortest paths are executed in constant time relative to the number of hops, rather than degrading with dataset size. This will be shown in the last question of this notebook where we are showing how to execute the A* algorithm to find the shortest flight path. 

Another strength of Neo4j are the built-in functions like PageRank or community detection that are optimised to run directly inside the database engine, avoiding the overhead of exporting data into external analytics tools. This makes Neo4j particularly advantageous for real-time applications such as fraud detection, recommendation engines and route optimisation, where speed is important. 


Cypher allows users to express graph queries using pattern matching instead of JOINs. Many complex relational queries become simpler and more readable when expressed as graph patterns.


The main drawback is that for flat, tabular or purely transactional data, relational databases often perform better.


Finally, because Neo4j stores relationships explicitly, datasets with extremely dense connectivity can consume more storage compared to normalized relational schemas.


[4]

## Running Graph Algorithms


### Presentation of the dataset

The airport routes dataset showcases Neo4j’s real-world use case in transportation and logistics. It contains 3503 airports and 46389 flight routes. It enables getting information like shortest paths between cities and route optimisation. This dataset is well suited for exploring network connectivity, route planning and airline performance analysis. 


### Data Exploration

Pour moi, à supprimer

In [15]:
# Explore basic statistics
with driver.session() as session:
    num_nodes = session.run("MATCH (n) RETURN count(n) AS c").single()["c"]
    num_rels = session.run("MATCH ()-[r]->() RETURN count(r) AS c").single()["c"]

print(f"Nombre de nœuds : {num_nodes}")
print(f"Nombre de relations : {num_rels}")

with driver.session() as session:
    labels = session.run("CALL db.labels()")
    print("Labels disponibles :")
    for record in labels:
        print("-", record["label"])


Nombre de nœuds : 8627
Nombre de relations : 73954
Labels disponibles :
- Airport
- City
- Region
- Country
- Continent


Pour moi, à supprimer

In [16]:
# Explorate some relationships
with driver.session() as session:
    rels = session.run("CALL db.relationshipTypes()")
    print("Types de relations :")
    for record in rels:
        print("-", record["relationshipType"])

Types de relations :
- IN_CITY
- IN_COUNTRY
- IN_REGION
- ON_CONTINENT
- HAS_ROUTE


J'ai l'impression qu'on devrait le mettre au début mais je sais pas où

The following code present examples of nodes with all their properties. Note that the altitude and the length of the longest runway are measured in feet.

In [17]:
# Explorate some nodes

label = "Airport"   # Pick one label from Airport, City, Region, Country, Continent
with driver.session() as session:
    result = session.run(f"MATCH (n:{label}) RETURN n LIMIT 10")
    df = pd.DataFrame([dict(record["n"]) for record in result])  
df

Unnamed: 0,altitude,descr,longest,iata,latitude,icao,id,runways,longitude
0,1026,Hartsfield - Jackson Atlanta International Air...,12390,ATL,33.6367,KATL,1,5,-84.428101
1,151,Anchorage Ted Stevens,12400,ANC,61.1744,PANC,2,3,-149.996002
2,542,Austin Bergstrom International Airport,12250,AUS,30.1945,KAUS,3,2,-97.669899
3,599,Nashville International Airport,11030,BNA,36.1245,KBNA,4,4,-86.6782
4,19,Boston Logan,10083,BOS,42.3643,KBOS,5,6,-71.005203
5,143,Baltimore/Washington International Airport,10502,BWI,39.1754,KBWI,6,3,-76.668297
6,14,Ronald Reagan Washington National Airport,7169,DCA,38.8521,KDCA,7,3,-77.037697
7,607,Dallas/Fort Worth International Airport,13401,DFW,32.896801,KDFW,8,7,-97.038002
8,64,Fort Lauderdale/Hollywood International Airport,9000,FLL,26.072599,KFLL,9,2,-80.152702
9,313,Washington Dulles International Airport,11500,IAD,38.9445,KIAD,10,4,-77.455803


Here is how the relationships are represented. Distances between airports are measured in miles.

In [18]:
# Choose the relationship from IN_CITY, IN_COUNTRY, IN_REGION, ON_CONTINENT, HAS_ROUTE 
# with the correct corresponding node lables

query = """
    MATCH (a:Airport)-[r:HAS_ROUTE]->(b:Airport)
    RETURN a.iata AS source, b.iata AS target, r.distance AS distance, type(r) AS relation
    LIMIT 10
"""

# Explorate some relationships
with driver.session() as session:
    result = session.run(query)
    df_rels = pd.DataFrame([record.data() for record in result])

df_rels 

Unnamed: 0,source,target,distance,relation
0,ATL,ZRH,4677.0,HAS_ROUTE
1,ATL,COS,1181.0,HAS_ROUTE
2,ATL,ABQ,1266.0,HAS_ROUTE
3,ATL,LHR,4198.0,HAS_ROUTE
4,ATL,SFO,2133.0,HAS_ROUTE
5,ATL,XNA,588.0,HAS_ROUTE
6,ATL,MID,933.0,HAS_ROUTE
7,ATL,BSB,4178.0,HAS_ROUTE
8,ATL,MTY,1084.0,HAS_ROUTE
9,ATL,STL,484.0,HAS_ROUTE


### Using the GDS library

Typically, when one wants to use graph algorithms, the Graph Data Science library is the best option. Indeed, it provides efficiently implemented versions of graph algorithms. In addition, it offers an intuitive framework that connects with Neo4j's graph database[5]. 


In cypher, the way to call a graph algorithm with GDS is as follow : 

`CALL gds[.<tier>].<algorithm>.<execution-mode>[.<estimate>](`

`graphName: String,`

  `configuration: Map)`

Where graphName is a projected graph keeping the attribute we need for the our algorithms

### Path Finding Algorithms

#### The Dijkstra Algorithm

For this first example, we are going to apply the Dijkstra algorithm from the GDS library to find the shortest path between two cities. Because the original dataset contains airports connected by routes, we must first build a projected graph that contains only City nodes and City–City connections. 

A city is connected to another city if at least one airport in the first city has a route to an airport in the second city. 
The Cypher projection deriving this relationship is:

`(c1:City)<-[:IN_CITY]-(a1:Airport)-[:HAS_ROUTE]->(a2:Airport)-[:IN_CITY]->(c2:City)`


with the `RETURN` clause, it produces weighted edges where the weight corresponds to the flight distance.

In [19]:
query1 = """   
    CALL gds.graph.exists('city_routes') YIELD exists
    WITH exists
    WHERE exists
    CALL gds.graph.drop('city_routes') YIELD graphName
    RETURN graphName
"""

query2 = """   
    CALL gds.graph.project.cypher(
        'city_routes', 
        'MATCH (c:City) RETURN id(c) AS id',
        'MATCH (c1:City)<-[:IN_CITY]-(a1:Airport)-[r:HAS_ROUTE]->(a2:Airport)-[:IN_CITY]->(c2:City) 
        RETURN id(c1) AS source, id(c2) AS target, r.distance AS dist'
    )
    YIELD graphName, nodeCount, relationshipCount
"""


with driver.session() as session:
    run_cypher_query_df(query1)
    result = run_cypher_query_df(query2)
    print(result)
    



             0     1      2
0  city_routes  3359  46389


Once the projected graph is created, we run Dijkstra between two distant island cities: Funafuti (Tuvalu) and Basseterre (Saint Kitts)

In [20]:
with driver.session() as session:
    result = run_cypher_query_df("""
    MATCH (source:City {name: 'Funafuti'}), (target:City {name: 'Basseterre'})
    CALL gds.shortestPath.dijkstra.stream('city_routes', {
        sourceNode: source,
        targetNodes: target,
        relationshipWeightProperty: 'dist'
    })
    YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
    RETURN
        gds.util.asNode(sourceNode).name AS sourceNodeName,
        gds.util.asNode(targetNode).name AS targetNodeName,
        totalCost,
        [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
        costs
    ORDER BY index
    """)
    for _, row in result.iterrows():
        for i in range(len(row)):
            print(f"{i}: {row[i]}")

0: Funafuti
1: Basseterre
2: 3796.0
3: ['Funafuti', 'Nausori', 'Auckland', 'Hamilton', 'Montreal', 'Toronto', 'Kingston', 'Santo Domingo', 'San Juan', 'Basseterre']
4: [0.0, 659.0, 1988.0, 2054.0, 2398.0, 2704.0, 2859.0, 3327.0, 3567.0, 3796.0]


Dijkstra returns : 
- 0: The first city
- 1: The second city
- 2: the total length of the shortest path (in miles)
- 3: the list of intermediate cities
- 4: the cumulative cost after each hop (in miles)

This approach demonstrates how Neo4j GDS can use Dijkstra algorithm to compute shortest paths on a custom graph projection derived from the original dataset.

#### The A* Algorithm
The A* algorithm also computes the shortest path between two nodes in a graph with weighted relationships. The difference with dijkstra is that it uses an heuristic. In our case, the heuristic will use the latitude and longitude of the airport. 

Let's first create a projected graph that contains the airports and their connections. 



In [21]:
# Let's now create a projected graph with only the needed nodes and relationships
with driver.session() as session:
    run_cypher_query_df("CALL gds.graph.drop('airport_relations') YIELD graphName")
    result = run_cypher_query_df("""
    CALL gds.graph.project.cypher(
        'airport_relations',
        'MATCH (a:Airport)
        RETURN id(a) AS id,
                a.latitude AS latitude, 
                a.longitude AS longitude',
        'MATCH (a1:Airport)-[r:HAS_ROUTE]->(a2:Airport)
        RETURN id(a1) AS source, id(a2) AS target, 
                r.distance AS distance'
    )
    YIELD graphName, nodeCount, relationshipCount
    """)
    print(result)



                   0     1      2
0  airport_relations  3503  46389


The next step is to use the Algorithm with our graph. For the purpose of the example, we will try to use two city that are poorly connected. We are able to find them by using the PageRank algorithm. We are going to try to find the shortest path between Pago Pago and Mount Pleasant. Both are located in small island, the first one in Oceania and the second near Argentina. 

In [22]:
with driver.session() as session:
    result = run_cypher_query_df("""
    MATCH (source:Airport {iata : 'PPG'}), (target:Airport {iata: 'MPN'})
    CALL gds.shortestPath.astar.stream('airport_relations', {
        sourceNode: id(source),
        targetNode: id(target),
        latitudeProperty: 'latitude',
        longitudeProperty: 'longitude',
        relationshipWeightProperty: 'distance'
    })
    YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
    RETURN
        gds.util.asNode(sourceNode).iata AS sourceNodeName,
        gds.util.asNode(targetNode).iata AS targetNodeName,
        totalCost,
        [nodeId IN nodeIds | gds.util.asNode(nodeId).iata] AS nodeNames,
        costs
    ORDER BY index
    """)
    for _, row in result.iterrows():
        for i in range(len(row)):
            print(f"{i}: {row[i]}")



0: PPG
1: MPN
2: 12202.0
3: ['PPG', 'HNL', 'PPT', 'IPC', 'SCL', 'PUQ', 'MPN']
4: [0.0, 2610.0, 5352.0, 7990.0, 10320.0, 11674.0, 12202.0]


The output shows us that the algorithm found the shortest path between both airports. 

### Centrality Algorithm

#### Betweenness Centrality
Betweenness centrality is a measure of the importance of a node in a network based on how often it appears on the shortest paths between other nodes. A high betweenness centrality indicates that an airport acts as a critical bridge, facilitating connections between many other airports.

To perform the betweenness centrality analysis, we create a projected graph that includes only the Airport nodes and HAS_ROUTE relationships along with the distance property. This projection ensures that the analysis focuses only on airports.

In [23]:
query1 = """   
    CALL gds.graph.exists('airport_relations') YIELD exists
    WITH exists
    WHERE exists
    CALL gds.graph.drop('airport_relations') YIELD graphName
    RETURN graphName
"""

query2 = """   
    CALL gds.graph.project.cypher(
        'airport_relations',
        'MATCH (a:Airport) RETURN id(a) AS id, a.latitude AS latitude, a.longitude AS longitude',
        'MATCH (a1:Airport)-[r:HAS_ROUTE]->(a2:Airport)
        RETURN id(a1) AS source, id(a2) AS target, r.distance AS distance'
    )
    YIELD graphName, nodeCount, relationshipCount
"""

with driver.session() as session:
    run_cypher_query_df(query1)
    result = run_cypher_query_df(query2)
    print(result)



                   0     1      2
0  airport_relations  3503  46389


We now compute the betweenness centrality:

In [24]:
with driver.session() as session:
    result = run_cypher_query_df("""
        CALL gds.betweenness.stream('airport_relations')
        YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).iata AS iata, gds.util.asNode(nodeId).descr AS descr, score
        ORDER BY score DESC
        LIMIT 10
    """)
    print(result)

     0                                        1              2
0  DXB              Dubai International Airport  390958.585096
1  LAX        Los Angeles International Airport  368734.104102
2  CDG                  Paris Charles de Gaulle  365259.650707
3  PEK    Beijing Capital International Airport  340393.582884
4  IST           Istanbul International Airport  339441.886007
5  ORD     Chicago O'Hare International Airport  326830.237013
6  ANC                    Anchorage Ted Stevens  298891.743319
7  FRA                        Frankfurt am Main  285270.758162
8  DFW  Dallas/Fort Worth International Airport  277537.092487
9  AMS               Amsterdam Airport Schiphol  265922.715158


The result obtained was expected. The airports with the highest score are: Dubai, Los Angeles, Paris, Beijing ans Istanbul. Istanbul and Dubai are both located between Asia, Europe and Africa. It is often mandatory to fly through them to go from one continent to another. Concerning Paris, Beijing, and Los Angeles, they are the primary gateways for their respective continent which are the three most developed continent. The conclusion that can be directly infered from this result is that: If, for some reason, one of these airport were to close, the impact on the global connectivity would be severe. Travelers would need to take other longer, less efficient routes.

Airport with a score of 0 represents small regional airports. This can be verified by changing "DESC" by "ASC" in the request. 

In [None]:
top_airports = run_cypher_query_df("""
CALL gds.betweenness.stream('airport_relations')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS airport, score
WHERE score > 50000 
RETURN airport.iata AS iata, airport.latitude AS lat, airport.longitude AS lon, score
ORDER BY score DESC
""")
import plotly.express as px

fig = px.scatter_geo(
    top_airports,
    lat=1,
    lon=2,
    text=0,
    size=3,
    projection='natural earth',
    title='Airports with Highest Betweenness Centrality'
)
fig.show()


In [None]:
driver.close()
print("\nNeo4j connection closed.")

## Conclusion

TO DO

## Work distribution

The work was distributed as follows:
- Emilien DE LA BRASSINNE BONARDAUX 
- Yanis GEURTS: General structure, selection of the database, setup of the tutorial, eda, simple query, Dijkstra, A*, Betweenness Centrality
- Eri VAN DE VYVER: General tutorial structure, Intoduction, Advantages and limitations

## Declaration on Generative AI

During the preparation of this work, the authors used chatGPT-5 and Mistral Large to: Grammar and spelling check, improve writing style, and peer review simulation. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

## Bibliography 

1. Debruyne Christophe. "INFO9016-1 Advanced Databases - Chapter 4: NoSQL Part 2". University of Liège, 2025.
2. Baylor Corydon and Enzo Htet. "What Are the Different Types of Graph Algorithms & When to Use Them?", 29 May 2025, url: https://neo4j.com/blog/graph-data-science/graph-algorithms/. Accessed 14 Nov 2025.
3. Neo4j. Airport Routes Graph Example. GitHub, url: https://github.com/neo4j-graph-examples/airport-routes. Accessed 14 Nov 2025.
4. Dadashzadeh, Ali. "Neo4j vs. SQL: Unlocking the Power of Graph-Based Data Modeling." DEV Community, 19 Feb 2025, url: https://dev.to/ali_dz/neo4j-vs-sql-unlocking-the-power-of-graph-based-data-modeling-33da. Accessed 14 Nov 2025.
5. Kolla, S. (2020). Neo4j Graph Data Science (GDS) library: Advanced analytics on connected data. International Journal of Advanced Research in Engineering and Technology, 11(8), 1077-1086.
6. Neo4j, Inc. "Neo4j Graph Database Documentation. - Cypher Manual" Neo4j, https://neo4j.com/docs/cypher-manual. Accessed 21 Nov. 2025.