# Imodulon Radiate Analysis
#### Select imodulon from the given list

In [1]:
imodulons = ['ArgR', 'Histidine', 'Isoleucine', 'Leucine', 'MetJ', 'Tryptophan', 'Tyr']
selected_imodulon = 'Tyr'

### Definition:
Given list of nodes (source nodes, e.g. genes, metabolites), find what nodes are influenced by the source nodes , and what nodes influence the source nodes.

The algorithm we are using is called __PageRank__. 

The __PageRank__ algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes.  

__Personalized PageRank__ is a variation of PageRank which is biased towards a set of source nodes.

<img src="img/personalized_pagerank.png" width="400" height="200" />

### Steps to run Radiate Analysis
1. Connect to arango database
2. Find input nodes (source nodes) in arango database
3. Load the whole network graph from arango to memory and create a networkx graph. NetworkX is a python network library.
4. Perform radiate analysis
    - Run personalized pagerank algorithm using source nodes to get pagerank values for each nodes that the source nodes can reach (forward direction). Those are the nodes influenced by the source nodes
    - Run personalized reverse pagerank using source nodes to get the reverce pagerank values for each nodes that can reach the source nodes
    - Export the pagerank and reverse pagerank values into __excel file__
5. The user analyzes the pagerank values (sorting, filtering etc), and select the rows that are interesting
6. Create radiate traces for selected nodes

In [2]:
import os

In [3]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.radiate_trace import RadiateTrace
from lifelike_gds.arango_network.shortest_paths_trace import ShortestPathTrace
from lifelike_gds.arango_network.trace_graph_utils import *
import pandas as pd
import networkx as nx
import os

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
input_dir = 'input'
output_dir = 'output'
os.makedirs(output_dir, 0o777, True)
# gds database name
db_name = 'ecocyc-25'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

## 1. Connect to arango database.
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [6]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

## 2. Find input nodes (source nodes) in arango database

#### Read selected imodulon genes, and find the gene nodes in arango database.

In [7]:
# get source genes 
source_file = f'{selected_imodulon}_gene_table.csv'
df = pd.read_csv(os.path.join(input_dir, source_file))
print(len(df))
df.head()

10


Unnamed: 0,locus,gene_weight,gene_name,gene_product,cog,operon,regulator,TyrR,link
0,b2601,0.509933,aroF,3-deoxy-7-phosphoheptulonate synthase%2C Tyr-s...,Amino acid transport and metabolism,aroF-tyrA,"RpoD,SoxR,TyrR,Nac",True,https://ecocyc.org/gene?orgid=ECOLI&id=EG10078
1,b2600,0.433451,tyrA,fused chorismate mutase/prephenate dehydrogenase,Amino acid transport and metabolism,aroF-tyrA,"RpoD,SoxR,TyrR",True,https://ecocyc.org/gene?orgid=ECOLI&id=EG11039
2,b0388,0.172402,aroL,shikimate kinase 2,Nucleotide transport and metabolism,aroL-yaiA-aroM,"RpoD,TyrR,TrpR",True,https://ecocyc.org/gene?orgid=ECOLI&id=EG10082
3,b0112,0.168767,aroP,aromatic amino acid:H(+) symporter AroP,"Intracellular trafficking, secretion, and vesi...",aroP,"RpoD,Cra,TyrR,ArgR,Fnr,GlaR",True,https://ecocyc.org/gene?orgid=ECOLI&id=EG10084
4,b1907,0.150986,tyrP,tyrosine:H(+) symporter,Amino acid transport and metabolism,tyrP,"RpoD,IHF,HU,TyrR,ppGpp,Lrp",True,https://ecocyc.org/gene?orgid=ECOLI&id=EG11041


In [8]:
source_genes = [n for n in df['gene_name']]
source_nodes = database.get_nodes_by_attr(source_genes, 'name', 'Gene')
print(f"source genes: {len(source_genes)}, nodes: {len(source_nodes)}")

source genes: 10, nodes: 16


## 3. Load the whole network graph from arango to memory and create a networkx graph

Create a RadiateTrace instance.  
RadiateTrace is a subclass of TraceGraphNx.  TraceGraphNx has a property __graph__, that is a networkx graph. After the graph is created by using data from arango graph database, all the algorithms and traces can be run using the python networkx library.

In [9]:
tracegraph = RadiateTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
tracegraph.init_default_graph()

INFO: MultiDirectedGraph with 33428 nodes and 37886 edges


## 4. Perform Radiate Analysis
Run personalized pagerank analysis and export values to excel file.

Pagerank analysis is performed using networkx graph that contains a set of nodes and set of edges. 

#### Set node set for source nodes
A node set is a list of node ids with a set name and description.

In [10]:
# node set name
SOURCE_SET = f'{selected_imodulon}_genes'
# node set description
source_desc = f'{selected_imodulon} genes'
# add the node set to graph
tracegraph.set_node_set_from_arango_nodes(source_nodes, name=SOURCE_SET, desc=source_desc)

#### Call export_pagerank_data
The method export_pagerank_data in RadiateTrace contains a few steps to generate the excel file
parameters: 
- sources:  The node set name for the source nodes
- direction: default is __both__. If forward, run pagerank; if reverse, run reverse pagerank; if both, run both forward and reverse pagerank. 
- num_nodes: the nodes of top pagerank or reverse pagerank nodes that will write to the excel file.  The default is 2000. 

In the exported excel file, there is also a column named nReach (or rev_nReach), indicating how many source nodes can be reached by the node in the row.  

The method will write an excel file with two tabs, one for pageranks and one for reverse pageranks.

In [11]:
# get a copy of the original graph, including source node set
tracegraph.graph = tracegraph.orig_graph.copy()

outfile_name = f"Radiate_analysis_for_{SOURCE_SET}.xlsx"
tracegraph.export_pagerank_data(SOURCE_SET, outfile_name, direction='both', num_nodes=4000)

INFO: set pagerank and num reach for Tyr_genes
INFO: export top 4000 pagerank data into output/Radiate_analysis_for_Tyr_genes.xlsx


## 5. Analyze the pagerank output file (excel), and select interesting rows for further analysis

Suggestion:   
Add a column 'select' for selecting top pagerank nodes, and set any selected rows to 1   

## 6. Create radiate traces for the selected nodes

#### Read manually selected top ranked nodes from the previous generated pagerank excel file
We will read the columns 'select' to get the selected rows. The excel file has two tabs. We will read the selected nodes for pageranks from the 'pageranks' tab, and selected nodes for reverse pageranks from the 'reverse pageranks' tab.

In [12]:
rankfile = f"Radiate_analysis_for_{SOURCE_SET}_select.xlsx"
df_pagerank = pd.read_excel(os.path.join(input_dir, rankfile), sheet_name='pageranks', usecols=['eid', 'select'])
df_rev_pagerank = pd.read_excel(os.path.join(input_dir, rankfile), sheet_name='reverse pageranks', usecols=['eid', 'select'])

#### Get selected nodes for forward traces

In [13]:
df_select = df_pagerank[df_pagerank['select']==1]
selected = [id for id in df_select['eid']]
selected_nodes = database.get_nodes_by_attr(selected, 'eid')
print(f"selected: {selected}. length {len(selected_nodes)}")

selected: ['COMPLETE-ARO-PWY', 'ALL-CHORISMATE-PWY', 'ANTHRANSYN-CPLX', 'TRANS-RXN-77', 'TYR', 'ANTHRANSYN-RXN', 'TYRP-MONOMER', 'AROP-MONOMER', 'TYRB-MONOMER', 'EG11093-MONOMER', 'AROL-MONOMER', 'AROF-MONOMER', 'CHORISMUTPREPHENDEHYDROG-MONOMER', 'ANTHRANSYNCOMPI-MONOMER', 'ANTHRANSYNCOMPII-MONOMER', 'EG12446-MONOMER', 'DAHPSYN-RXN', 'ARO-PWY', 'TYRB-DIMER', 'SHIKIMATE-KINASE-RXN', 'AROF-CPLX', 'CHORISMUTPREPHENDEHYDROG-CPLX']. length 22


#### Get selected nodes for reverse traces

In [14]:
df_rev_select = df_rev_pagerank[df_rev_pagerank['select']==1]
rev_selected = [id for id in df_rev_select['eid']]
rev_selected_nodes = database.get_nodes_by_attr(rev_selected, 'eid')
print('rev_selected', rev_selected, 'len:', len(rev_selected))

rev_selected ['TU00067', 'TU00008', 'MONOMER0-162', 'CPLX-125', 'G7072-MONOMER', 'TU00087', 'PD00353', 'G7072', 'TU00011', 'TU0-42568'] len: 10


#### Get the original trace graph

In [15]:
# get a copy of the original graph, including source node set
tracegraph.graph = tracegraph.orig_graph.copy()

#### Set node sets for selected nodes

In [16]:
# set selected node set
SELECTED_SET = 'top_pagerank_nodes'
tracegraph.set_node_set_from_arango_nodes(selected_nodes, SELECTED_SET, 'selected top pagerank nodes')

# set rev_selected node set
REV_SELECTED_SET = 'top_rev_pagerank_nodes'
tracegraph.set_node_set_from_arango_nodes(rev_selected_nodes, REV_SELECTED_SET, 'selected top rev pagerank nodes')

#### Add traces and write to graph file for visualization
Add radiate traces using selected nodes

In [None]:
# set pagerank and reverse pagerank
pagerank_prop = 'pagerank'
tracegraph.set_pagerank(SOURCE_SET, pagerank_prop)
rev_pagerank_prop = 'rev_pagerank'
tracegraph.set_pagerank(SOURCE_SET, rev_pagerank_prop, reverse=True)

# add graph description
tracegraph.add_graph_description(f'Database: {db_version}\n')

# add traces from source genes to each selected nodes
tracegraph.add_traces_from_sources_to_each_selected_nodes(selected_nodes, SOURCE_SET, weighted_prop=pagerank_prop)

# add traces from source genes to all selected nodes
tracegraph.add_trace_from_sources_to_all_selected_nodes(SELECTED_SET, SOURCE_SET, weighted_prop=pagerank_prop)

# add traces from each selected nodes to SOURCE_SET genes
tracegraph.add_traces_from_each_selected_nodes_to_targets(rev_selected_nodes,
                                                           SOURCE_SET, weighted_prop=rev_pagerank_prop)

# add traces from all reverse-selected nodes to SOURCE_SET
tracegraph.add_trace_from_all_selected_nodes_to_targets(REV_SELECTED_SET, SOURCE_SET, weighted_prop=rev_pagerank_prop)

# write all traces into one graph file
graph_file = f'Radiate_traces_for_{SOURCE_SET}.graph'
tracegraph.write_to_sankey_file(graph_file)

INFO: Adding trace network Tyr_genes to superpathway of chorismate metabolism #1
ERROR: Target 27809 cannot be reachedfrom given sources
ERROR: Target 27809 cannot be reachedfrom given sources
ERROR: Target 27809 cannot be reachedfrom given sources
ERROR: Target 27809 cannot be reachedfrom given sources
ERROR: Target 27809 cannot be reachedfrom given sources
ERROR: Target 27809 cannot be reachedfrom given sources
INFO: Adding trace network Tyr_genes to ANTHRANSYN-CPLX #2
ERROR: Target 26863 cannot be reachedfrom given sources
ERROR: Target 26863 cannot be reachedfrom given sources
ERROR: Target 26863 cannot be reachedfrom given sources
ERROR: Target 26863 cannot be reachedfrom given sources
ERROR: Target 26863 cannot be reachedfrom given sources
ERROR: Target 26863 cannot be reachedfrom given sources
INFO: Adding trace network Tyr_genes to Anthranilate synthase-RXN #3
ERROR: Target 15649 cannot be reachedfrom given sources
ERROR: Target 15649 cannot be reachedfrom given sources
ERROR: 