<a href="https://colab.research.google.com/github/Vipul251/GraphRAG-/blob/main/Rome_transport_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-4.4.2.tar.gz (89 kB)
[?25l[K     |███▋                            | 10 kB 22.5 MB/s eta 0:00:01[K     |███████▎                        | 20 kB 25.4 MB/s eta 0:00:01[K     |███████████                     | 30 kB 18.7 MB/s eta 0:00:01[K     |██████████████▋                 | 40 kB 15.7 MB/s eta 0:00:01[K     |██████████████████▎             | 51 kB 7.3 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 8.5 MB/s eta 0:00:01[K     |█████████████████████████▋      | 71 kB 9.0 MB/s eta 0:00:01[K     |█████████████████████████████▎  | 81 kB 9.2 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 5.4 MB/s 
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25l[?25hdone
  Created wheel for neo4j: filename=neo4j-4.4.2-py3-none-any.whl size=115365 sha256=1012db5ee8214a7cf204b3cbf8ef36f778d321a00fef4217a5d93af0cf2dfea8
  Stored in directory: /root/.cache/pip/wheels/10/d6/28/95

I recommend you setup a [blank project on Neo4j Sandbox environment](https://sandbox.neo4j.com/?usecase=blank-sandbox), but you can also use other environment versions



In [None]:
# Define Neo4j connections
import pandas as pd
from neo4j import GraphDatabase
host = 'bolt://3.235.2.228:7687'
user = 'neo4j'
password = 'seats-drunks-carbon'
driver = GraphDatabase.driver(host,auth=(user, password))

def run_query(query, parameters = {}):
    with driver.session() as session:
        result = session.run(query,parameters)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

# Analysis of Rome transportation system

 I found an excellent transportation network of Rome dataset. It is quite rich with information and contains information on five different transportation modes like subway, bus, or plain walking.

## Constraints

In [None]:
spot_constraint_query = "CREATE CONSTRAINT ON (m:Stop) ASSERT m.id IS UNIQUE;"
run_query(spot_constraint_query)

## Import data

We will first import the nodes of the network and then import the relationships.

In [None]:
import_nodes_query = """

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/rome/network_nodes.csv" as row FIELDTERMINATOR ";"
MERGE (s:Stop{id:row.stop_I})
SET s+=apoc.map.clean(row,['stop_I'],[])

"""

run_query(import_nodes_query)

In [None]:
import_rels_query = """

UNWIND ['walk','bus','tram','rail','subway'] as mode
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/rome/network_" + mode + ".csv" as row FIELDTERMINATOR ";"
MATCH (from:Stop{id:row.from_stop_I}),(to:Stop{id:row.to_stop_I})
CALL apoc.create.relationship(from, toUpper(mode),
{distance:toInteger(row.d),duration_avg:toFloat(row.duration_avg)}, to) YIELD rel
RETURN distinct 'done'

"""

run_query(import_rels_query)

## Preprocess attributes

Walking is the only transportation mode that is lacking the average duration attribute. Luckily for us, we can easily calculate it if we assume that a person is walking 5 kilometers per hour on average or around 1.4 meters a second.

In [None]:
walking_duration_calculation = """
WITH 1.38889 as walking_speed
MATCH (:Stop)-[w:WALK]->()
SET w.duration_avg = toFloat(w.distance) / walking_speed
"""

run_query(walking_duration_calculation)

# Graph algorithms

Now that the graph is prepared, we can start the graph algorithms pipeline by loading the Neo4j stored graph into the projected in-memory graph. We load the graph with five relationship types and two attributes of relationships. These two attributes can be used as the relationship weights by the algorithms.

## Load the graph

In [None]:
project_graph = """
CALL gds.graph.project('rome','Stop',
    ['BUS','RAIL','SUBWAY','TRAM','WALK'],
    {
       relationshipProperties:{
          distance:{
             property:'distance'
          },
          duration:{
             property:'duration_avg'
          }
       }
    })
"""

run_query(project_graph)

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis
0,"{'Stop': {'label': 'Stop', 'properties': {}}}","{'SUBWAY': {'orientation': 'NATURAL', 'aggrega...",rome,7869,144838,470


## PageRank

To start the analysis, let’s find the most graphfamous™ stops in the tram transportation network using the PageRank algorithm.

In [None]:
import pandas as pd

def read_query(query):
    with driver.session() as session:
        result = session.run(query)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())


In [None]:
# Pagerank on a single relationship type
pagerank_single_rel = """
CALL gds.pageRank.stream('rome', {relationshipTypes:['TRAM']})
YIELD nodeId, score
WITH nodeId, score
ORDER BY score DESC LIMIT 5
RETURN gds.util.asNode(nodeId).name as name, score
"""

read_query(pagerank_single_rel)

Unnamed: 0,name,score
0,LABICANO/PORTA MAGGIORE,2.078166
1,PRENESTINA/TOR DE' SCHIAVI,1.722198
2,TRASTEVERE/MIN. P.ISTRUZIONE,1.690693
3,PRENESTINA/OLEVANO ROMANO,1.608055
4,TRASTEVERE/BERNARD. DA FELTRE,1.578368


The graph loader supports loading many relationship types, and so do the algorithms. In this example, we search for the most graphfamous™ stops in the combined network of buses, trams, and rails.

In [None]:
# Pagerank on multi relatioship types
pagerank_multi_rel = """
CALL gds.pageRank.stream('rome', {relationshipTypes:['TRAM','RAIL','BUS']})
YIELD nodeId, score
WITH nodeId, score
ORDER BY score DESC LIMIT 5
RETURN gds.util.asNode(nodeId).name as name, score
"""

read_query(pagerank_multi_rel)

Unnamed: 0,name,score
0,LGT SASSIA/S. SPIRITO (H),5.293274
1,TUSCOLANA/ROCCELLA JONICA,4.573954
2,LAURENTINA/DOUHET,4.093072
3,PETROSELLI,4.072084
4,ANAGNINA/CASALE FERRANTI,4.010664


# Connected components algorithm
Graph algorithms pipeline can also be part of a batch processing job, where you load the graph in memory, run a couple of algorithms, write back results to Neo4j, and unload the in-memory graph. Let’s run the connected components algorithm on all of the transportation modes networks separately and write back results.

In [None]:
# Connected components writeback
connected_components_query = """
UNWIND ["BUS","RAIL","SUBWAY","TRAM","WALK"] as mode
CALL gds.wcc.write('rome', {writeProperty:toLower(mode) + "_component", relationshipTypes:[mode]})
YIELD computeMillis
RETURN distinct 'done'
"""
run_query(connected_components_query)

Unnamed: 0,'done'
0,done


Explore the connected components in the TRAM network.


In [None]:
# Explore subway network components
explore_subway_component_query = """
MATCH (s:Stop)
WHERE exists(s.subway_component)
RETURN s.subway_component as component,
       collect(s.name)[..3] as example_members,
count(*) as size
ORDER BY size DESC
LIMIT 10
"""

read_query(explore_subway_component_query)

Unnamed: 0,component,example_members,size
0,7748,"[ANAGNINA, FURIO CAMILLO, PONTE LUNGO]",27
1,7721,"[BATTISTINI, BARBERINI, REPUBBLICA]",27
2,7801,"[LAURENTINA, COLOSSEO, CAVOUR]",26
3,7775,"[REBIBBIA, CASTRO PRETORIO, TERMINI]",26
4,7827,"[PANTANO, GRANITI, FINOCCHIO]",21
5,7848,"[PANTANO, GRANITI, FINOCCHIO]",21
6,6,[Villa Bonelli],1
7,5,[Muratella],1
8,2,[La Storta],1
9,9,[Roma Trastevere],1


These results are weird. I have never been to Rome, but I highly doubt there are nine disconnected SUBWAY components. Even looking at results, you might wonder why the components 7848 and 7827 have the same members.
Your component ids will likely be different, so make sure to use the right ones.

I know it is hard to see, but there stops in the network with the same name. While the names of the stops might be the same, the stop ids are not and, as such, are treated as separate nodes. We can guess that this is a single tram line driving in both directions, one on each side of the road. As the stations for each direction are a walking distance apart, this dataset differentiates between them.

## Shortest paths algorithms
I found a use-case where you would want to keep the projected graph in memory all the time. Imagine we are building an application that will help us find the shortest or fastest path between two points in Rome. We don’t want to project the graph in memory for every query, but rather have the projected graph in memory all the time.
We can search for the shortest path traversing only a specific relationship type, or in our case transportation mode.

In [None]:
# Shortest path using single relationship type
shortest_path_single_rel_query = """
MATCH (start:Stop{name:'Parco Leonardo'}),(end:Stop{name:'Roma Trastevere'})
CALL gds.shortestPath.dijkstra.stream('rome',{sourceNode: start, targetNode: end,
  relationshipWeightProperty:'distance',relationshipTypes:['RAIL']})
YIELD nodeIds,costs
UNWIND range(0, size(nodeIds) - 1) AS index
RETURN index, gds.util.asNode(nodeIds[index]).name as name, costs[index] as meters
"""
read_query(shortest_path_single_rel_query)

Unnamed: 0,index,name,meters
0,0,Parco Leonardo,0.0
1,1,Fiera di Roma,2217.0
2,2,Ponte Galeria,4537.0
3,3,Muratella,9886.0
4,4,Magliana,12020.0
5,5,Villa Bonelli,14529.0
6,6,Roma Trastevere,17403.0


The problem with using only the RAIL network is that most of the stops are not in the RAIL network. To be able to find the shortest path between any pair of stops in our network, we have to allow the algorithm to traverse the WALK relationships as well.

In [None]:
# Shortest path using multi relationship types
shortest_path_multi_rel_query = """
MATCH (start:Stop{name:'LABICANO/PORTA MAGGIORE'}),(end:Stop{name:'TARDINI'})
CALL gds.shortestPath.dijkstra.stream('rome',{sourceNode: start, targetNode: end,
  relationshipWeightProperty:'distance',relationshipTypes:['RAIL', 'WALK']})
YIELD nodeIds,costs
UNWIND range(0, size(nodeIds) - 1) AS index
RETURN index, gds.util.asNode(nodeIds[index]).name as name, costs[index] as meters
"""

read_query(shortest_path_multi_rel_query)

Unnamed: 0,index,name,meters
0,0,LABICANO/PORTA MAGGIORE,0.0
1,1,PORTA MAGGIORE,67.0
2,2,Termini Laziali,1002.0
3,3,Roma San Pietro,5296.0
4,4,GREGORIO VII/STAZ. S. PIETRO (FS),5676.0
5,5,AURELIA/PAOLO III,6239.0
6,6,VALLE AURELIA (MA),6727.0
7,7,PATETTA/D'AMELIO,7420.0
8,8,TARDINI,7906.0


And if you remember, we stored two attributes of relationships in the graph memory. Let’s now use the duration attribute as weight.

In [None]:
# Shortest path by duration
shortest_path_duration_query = """
MATCH (start:Stop{name:'LABICANO/PORTA MAGGIORE'}),(end:Stop{name:'TARDINI'})
CALL gds.shortestPath.dijkstra.stream('rome',{sourceNode: start, targetNode: end,
  relationshipWeightProperty:'duration',relationshipTypes:['RAIL', 'WALK']})
YIELD nodeIds,costs
UNWIND range(0, size(nodeIds) - 1) AS index
RETURN index, gds.util.asNode(nodeIds[index]).name as name, toFloat(costs[index]) / 60 as minutes
"""
read_query(shortest_path_duration_query)

Unnamed: 0,index,name,minutes
0,0,LABICANO/PORTA MAGGIORE,0.0
1,1,PORTA MAGGIORE,0.803999
2,2,S. BIBIANA,2.670666
3,3,TERMINI LAZIALI,4.653999
4,4,Termini Laziali,4.797999
5,5,Roma San Pietro,17.297999
6,6,Valle Aurelia,21.321529
7,7,STAMPINI,28.773523
8,8,TARDINI,36.969516
