# All Shortest Paths for curli use case

#### Library:  NetworkX
NetworkX is a python package for the creation, manipulation and study of the structure,
and functions of complex networks.  

#### Unweighted shortest paths
This demo will get all shortest paths for all pairs of nodes from group S nodes (sources) to group T nodes (targets). 

Given source S and target T in the example below, the shortest paths have 3 hops (2 nodes in between), including S->1->2->T and S->1->3->T.   
<img align='left' src="img/shortest_paths.png" width='500'> 

### Steps creating shortest paths traces
1. Connect to arango database
2. Get input nodes (sources and targets)
3. Load the whole network graph from arango to memory and create a networkx graph
4. Add all shortest paths for each pair of nodes from sources to targets, and generate sankey graph file
5. Users can also add shortest paths to the graph and generate cytoscape json file (optional)

In [1]:
import os

In [2]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.shortest_paths_trace import ShortestPathTrace
import pandas as pd
import networkx as nx

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
input_dir = 'input'
output_dir = 'output'
os.makedirs(output_dir, 0o777, True)
# gds database name
db_name = 'ecocyc-25'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

### 1. Connect to arango database
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [5]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

### Set data dir for the input and output data
The datadir can be set to whatever in your file system


### 2. Read input files and get source and target nodes from arango

In [6]:
# Curli phenotype 1 knockout genes
pheno1_file = 'curli_genes_pheno_1.csv'

df1 = pd.read_csv(os.path.join(input_dir, pheno1_file))
df1.head()

Unnamed: 0,name,biocyc_id
0,purA,EG10790
1,guaB,EG10421
2,purD,EG10792
3,purH,EG10795
4,purC,EG10791


In [7]:
pheno1_genes = [n for n in df1['biocyc_id']]
pheno1_nodes = database.get_nodes_by_attr(pheno1_genes, 'biocyc_id')
print(f"Phenotype 1 genes: {len(pheno1_genes)}, nodes: {len(pheno1_nodes)}")

Phenotype 1 genes: 35, nodes: 35


In [8]:
# Curli genes (CSG genes)
csg_file = 'csg_genes.csv'

df2 = pd.read_csv(os.path.join(input_dir, csg_file))
curli_genes = [n for n in df2['biocyc_id']]
curli_nodes = database.get_nodes_by_attr(curli_genes, 'biocyc_id')
print(f"curli genes: {len(curli_genes)}, nodes: {len(curli_nodes)}")

curli genes: 7, nodes: 7


### 3. Create a trace graph object, and create networkx graph from arango data

In [9]:
tracegraph = ShortestPathTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
# a networkx graph is created here.  
tracegraph.init_default_graph()

INFO: MultiDirectedGraph with 33428 nodes and 37886 edges


### 4. Add shortest paths traces, and generate sankey graph file

#### Generate shortest paths from metals to metabolites
- Make a copy of the clean graph to work on
- set nodes set
- add shortest paths from phenotype 1 knockout genes to curli genes
- generate sankey graph file. Make sure that the file name ends with .graph so that Lifelike can display properly.

In [10]:
# create a copy of the graph so that the original graph is clean (without traces)
tracegraph.graph = tracegraph.orig_graph.copy()

# add graph description
tracegraph.add_graph_description(f'database: {db_version}\n')

# Set source and target node sets
SOURCE_SET = 'pheonotype1_genes'
TARGET_SET = 'curli_genes'
tracegraph.set_node_set_from_arango_nodes(pheno1_nodes, SOURCE_SET, SOURCE_SET)
tracegraph.set_node_set_from_arango_nodes(curli_nodes, TARGET_SET, TARGET_SET)

# add shortest paths from phenotype 1 genes to curli genes
tracegraph.add_shortest_paths(SOURCE_SET, TARGET_SET)

# Export sankey file
graphfile = "Shortest_Paths_from_phenotype1_genes_to_curli_genes.graph"
tracegraph.write_to_sankey_file(graphfile)

ERROR: Target 1606 cannot be reachedfrom given sources
ERROR: Target 1611 cannot be reachedfrom given sources
ERROR: Target 10734 cannot be reachedfrom given sources
ERROR: Target 1616 cannot be reachedfrom given sources
ERROR: Target 10480 cannot be reachedfrom given sources
ERROR: Target 16050 cannot be reachedfrom given sources
ERROR: Target 1528 cannot be reachedfrom given sources
ERROR: Target 1606 cannot be reachedfrom given sources
ERROR: Target 1611 cannot be reachedfrom given sources
ERROR: Target 10734 cannot be reachedfrom given sources
ERROR: Target 1616 cannot be reachedfrom given sources
ERROR: Target 10480 cannot be reachedfrom given sources
ERROR: Target 16050 cannot be reachedfrom given sources
ERROR: Target 1528 cannot be reachedfrom given sources
ERROR: Target 1606 cannot be reachedfrom given sources
ERROR: Target 1611 cannot be reachedfrom given sources
ERROR: Target 10734 cannot be reachedfrom given sources
ERROR: Target 1616 cannot be reachedfrom given sources
ERR

#### Create Cytoscape json file (optional)
We can also export the traces in cytoscape json format so that users who use cytoscape could import into Cytoscape and analyze/view the graph. However, the json file does not have any style and layout format, and it would be up to the users to set up the layout and styles using cytoscape app.

Make sure that there is only __ONE trace__ in the trace graph. Otherwise, the traces will be mixed together. 

In [11]:
# set a node property flag to mark as node as start or end so that the user could use the property to mark start and ending nodes
tracegraph.set_nodes_flag(SOURCE_SET, 'start')
tracegraph.set_nodes_flag(TARGET_SET, 'end')
# write network to json file
tracegraph.write_cytoscape_json('Shortest_paths_from_pheno1_genes_to_curli_genes.json')


INFO: clean graph: number of graph nodes decreased from 319 to 319
INFO: All traces already have their group defined.
INFO: writing output/Shortest_paths_from_pheno1_genes_to_curli_genes
