# All Shortest Paths for metal use case

#### Unweighted shortest paths
This demo will get all shortest paths for all pairs of nodes from group S nodes (sources) to group T nodes (targets). 

Given source S and target T in the example below, the shortest paths have 3 hops (2 nodes in between), including S->1->2->T and S->1->3->T.   
<img align='left' src="img/shortest_paths.png" width='500'> 

### Steps creating traces for shortest plus paths 
1. Connect to arango database
2. Get input nodes (sources and targets)
3. Load the whole network graph from arango to memory and create a networkx graph
4. Add all shortest paths for each pair of nodes from sources to targets
5. Generate sankey graph file

In [1]:
# install lifelike_gds package if not already installed (e.g. running in Google Colab)
import importlib

if importlib.util.find_spec('lifelike_gds') is None:
  !pip install git+https://github.com/SBRG/GDS-Public

# provide the path to the notebook folder in the github repository in case the notebook is run in Google Colab
github_path = 'SBRG/GDS-Public/main/notebooks/CfB_Workshop/metal'

In [2]:
import os
import pandas as pd
import warnings
from pathlib import PurePosixPath

In [3]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.shortest_paths_trace import ShortestPathTrace



In [4]:
warnings.filterwarnings('ignore')

In [5]:
input_dir = PurePosixPath('input')
output_dir = PurePosixPath('output')
# gds database name
db_name = 'ecocyc-25'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

### 1. Connect to arango database
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [6]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

### 2. Read input files and get source and target nodes from arango

In [7]:
metal_file_path = input_dir / 'metals.csv'
if os.path.isfile(metal_file_path):
  metal_file_ref = metal_file_path
else:
  # if does not exist localy, pull from github
  metal_file_ref = f'https://raw.githubusercontent.com/{github_path}/{metal_file_path}'

In [8]:
df_metal = pd.read_csv(metal_file_ref)
print(df_metal)

   name biocyc_id
0  Zn2+      ZN+2
1  Fe2+      FE+2
2  Fe3+      FE+3


In [9]:
metals = [n for n in df_metal['biocyc_id']]
metal_nodes = database.get_nodes_by_attr(metals, 'biocyc_id')
print(f"metals: {metals}, nodes: {len(metal_nodes)}")

metals: ['ZN+2', 'FE+2', 'FE+3'], nodes: 3


In [10]:
metab_file_path = input_dir / 'biomass_precursors.csv'
if os.path.isfile(metab_file_path):
  metab_file_ref = metab_file_path
else:
  # if does not exist localy, pull from github
  metab_file_ref = f'https://raw.githubusercontent.com/{github_path}/{metab_file_path}'

In [11]:
df_metab = pd.read_csv(metab_file_ref)
metabs = [n for n in df_metab['biocyc_id']]
metab_nodes = database.get_nodes_by_attr(metabs, 'biocyc_id')
print(f"metabolites: {len(metabs)}, nodes: {len(metab_nodes)}")

metabolites: 12, nodes: 12


### 3. Create a trace graph object, and create networkx graph from arango data

In [12]:
tracegraph = ShortestPathTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
# a networkx graph is created here.  
tracegraph.init_default_graph()

INFO: MultiDirectedGraph with 33428 nodes and 37886 edges


### 4. Add shortest paths traces, and generate sankey graph file
Generate shortest paths graph from each metals (Zn and Fe) to the biomass precursor metabolites first, then generate a combined shorest paths graph from all metals (Zn, Fe2+ and Fe3+) to biomass precursor metabolites

#### Generate shortest paths from metals to metabolites
- Make a copy of the clean graph to work on
- set nodes set
- add shortest paths from each metals to the metabolites
- add shortest paths from all metals to metabolites
- generate sankey graph file

In [13]:
# create a copy of the graph so that the original graph is clean (without traces)
tracegraph.graph = tracegraph.orig_graph.copy()
nodes1 = database.get_nodes_by_attr(['ppGpp'], 'name', 'Compound')
nodes2 = database.get_nodes_by_attr(['ORNDECARBOX-RXN'], 'eid')
print(len(nodes1), len(nodes2))


1 1


In [14]:
source=tracegraph.set_node_set_for_node(nodes1[0])
target=tracegraph.set_node_set_for_node(nodes2[0])
# add graph description
tracegraph.add_graph_description(f'database: {db_name}\n')
tracegraph.add_shortest_paths(source, target)
# tracegraph.write_to_sankey_file(f"shortestpaths_from{source}_to_{target}.graph")

INFO: add Shortest paths from ppGpp to Ornithine decarboxylase-RXN: 1 paths


True

In [15]:
# create a copy of the graph so that the original graph is clean (without traces)
tracegraph.graph = tracegraph.orig_graph.copy()

# Set source and target node sets
SOURCE_SET = 'metals'
TARGET_SET = 'biomass_metabolites'
tracegraph.set_node_set_from_arango_nodes(metal_nodes, SOURCE_SET, SOURCE_SET)
tracegraph.set_node_set_from_arango_nodes(metab_nodes, TARGET_SET, TARGET_SET)

# add graph description
tracegraph.add_graph_description(f'database: {db_name}\n')

# add traces from each metals (Zn2+, Fe2+, Fe3+) to metabolites
for node in metal_nodes:
    node_key = tracegraph.set_node_set_for_node(node)
    tracegraph.add_shortest_paths(node_key, TARGET_SET, sources_as_query=False)
    
# add shortest paths from all metals to metabolites
tracegraph.add_shortest_paths(SOURCE_SET, TARGET_SET)

INFO: add Shortest paths from Zn2+ to biomass_metabolites: 62 paths
INFO: add Shortest paths from Fe2+ to biomass_metabolites: 33 paths
INFO: add Shortest paths from Fe3+ to biomass_metabolites: 33 paths
INFO: add Shortest paths from metals to biomass_metabolites: 128 paths


True

In [16]:
# export sankey file (json file). Make sure that the file end with .graph. Otherwise, Lifelike cannot recoginize it as a graph file
graphfile = "Shortest_Paths_from_Zn_Fe_to_biomass_metabs.graph"
tracegraph.write_to_sankey_file(graphfile)

INFO: clean graph: number of graph nodes decreased from 33428 to 156
INFO: writing output/Shortest_Paths_from_Zn_Fe_to_biomass_metabs


## (Optional) Shortest paths +N
This demo will get all shortest paths and shortest plus n paths for all pairs of nodes from group S nodes (sources) to group T nodes (targets). 

In the example below, given source S and target T, the shortest paths have 3 hops (2 nodes in between), including S->1->2->T and S->1->3->T.   
__Shortest paths+1__ include shortest paths and paths with one more node (3 nodes) in between: S->1->2->4->T, S->1->3->5 and S->6->7->3->T.  
__Shortest paths+2__ include shortest paths and paths with one and two more nodes(4 nodes) in between.  Paths with 4 nodes in between are S->6->7->3->5->T and S->8->9->10->11.

<table><tr>
    <td><img src="img/shortest_paths.png"> </td>
    <td><img src="img/shortest_paths_plus1.png"> </td>
    <td><img src="img/shortest_paths_plus2.png"> </td>
</tr></table>

The steps for running shortest paths+N are the same as running shortest paths, except you need one additional parameter __shortest_paths_plus_n__. By default, the parameter was set to 0.  
For shortest paths+3, we set the parameter as 3:
```
tracegraph.add_shortest_paths(SOURCE_SET, TARGET_SET, shortest_paths_plus_n=3)
```

In [17]:
# create a copy of the graph so that the original graph is clean (without traces)
tracegraph.graph = tracegraph.orig_graph.copy()

# Set source and target node sets
SOURCE_SET = 'metals'
TARGET_SET = 'biomass_metabolites'
tracegraph.set_node_set_from_arango_nodes(metal_nodes, SOURCE_SET, SOURCE_SET)
tracegraph.set_node_set_from_arango_nodes(metab_nodes, TARGET_SET, TARGET_SET)

# add graph description
tracegraph.add_graph_description(f'database: {db_name}\n')
    
# add shortest paths from all metals to metabolites
tracegraph.add_shortest_paths(SOURCE_SET, TARGET_SET, shortest_paths_plus_n=3)

# export sankey file (json file). Make sure that the file end with .graph. 
graphfile = "Shortest_Paths+3_from_Zn_Fe_to_biomass_metabs.graph"
tracegraph.write_to_sankey_file(graphfile)

INFO: add Shortest paths from metals to biomass_metabolites: 128 paths


NetworkXNotImplemented: not implemented for multigraph type