# All Shortest Paths for purF-curli


This demo will get all shortest paths from gene purF to list of curli related genes.

#### Library:  NetworkX
NetworkX is a python package for the creation, manipulation and study of the structure,
and functions of complex networks.  

#### Unweighted shortest paths
Given source S and target T in the example below, the shortest paths have 3 hops (2 nodes in between), including S->1->2->T and S->1->3->T.   
<img align='left' src="img/shortest_paths.png" width='500'>    


### Steps creating shortest paths traces
1. Connect to arango database
2. Get input nodes (sources and targets)
3. Load the whole network graph from arango to memory and create a networkx graph
4. Add all shortest paths for each pair of nodes from sources to targets, and generate sankey graph file
5. Users can also add shortest paths to the graph and generate cytoscape json file (optional)

In [5]:
import os
import sys
root = os.getcwd().split('/notebooks/')[0]
sys.path.append(os.path.join(root, 'src'))

In [6]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.shortest_paths_trace import ShortestPathTrace
import pandas as pd

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
input_dir = 'input'
output_dir = 'output'
os.makedirs(output_dir, 0o777, True)
# gds database name
db_name = 'ecocyc'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

### 1. Connect to arango database
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [9]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

### 2. Get source and target nodes from arango database

In [10]:
# source gene
gene_name = 'purF'
source_nodes = database.get_nodes_by_attr([gene_name], 'name', 'Gene')
print(len(source_nodes))
purF_node = source_nodes[0]

1


In [11]:
# Curli genes (CSG genes)
csg_file = 'csg_genes.csv'
df = pd.read_csv(os.path.join(input_dir, csg_file))
df

Unnamed: 0,name,biocyc_id
0,csgB,G6547
1,csgF,G6544
2,csgA,EG11489
3,csgC,G6548
4,csgE,G6545
5,csgG,G6543
6,csgD,G6546


In [12]:
curli_genes = [n for n in df['biocyc_id']]
curli_nodes = database.get_nodes_by_attr(curli_genes, 'biocyc_id')
print(f"curli genes: {len(curli_genes)}, nodes: {len(curli_nodes)}")

curli genes: 7, nodes: 7


### 3. Create a trace graph object, and create networkx graph from arango data

In [13]:
tracegraph = ShortestPathTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
# a networkx graph is created here.  
tracegraph.init_default_graph()

INFO:root:MultiDirectedGraph with 33428 nodes and 37886 edges


### 4. Add shortest paths traces, and generate sankey graph file

#### Generate shortest paths from metals to metabolites
- Make a copy of the clean graph to work on
- set nodes set
- add shortest paths from phenotype purF genes to curli genes
- generate sankey graph file. Make sure that the file name ends with .graph so that Lifelike can display properly.

In [14]:
# create a copy of the graph so that the original graph is clean (without traces)
tracegraph.graph = tracegraph.orig_graph.copy()

# add graph description
tracegraph.add_graph_description(f'database: {db_version}\n')

# Set source node set
source_set = tracegraph.set_node_set_for_node(purF_node)

# Set target node set
target_set = 'curli_genes'
tracegraph.set_node_set_from_arango_nodes(curli_nodes, target_set, target_set)

# add shortest paths. 
# Param 'sources_as_query' default is True. For one to many traces, make sure to set False so that the traces to each targets have different color in display
tracegraph.add_shortest_paths(source_set, target_set, sources_as_query=False)

# Export sankey file
graphfile = "Shortest_Paths_from_purF_to_curli_genes.graph"
tracegraph.write_to_sankey_file(graphfile)

INFO:root:add Shortest paths from purF to curli_genes: 26 paths
INFO:root:clean graph: number of graph nodes decreased from 33428 to 52
INFO:root:writing output/Shortest_Paths_from_purF_to_curli_genes
