# Radiate Analysis with weighted starting nodes 
#### Strain crash RNA-Seq data use case

Radiate analysis was performed using personalized pageranks.  For regular radiate analysis, each source node was weighted as 1. With weighted radiate analysis, you add a weight to each of the source nodes. In the example below, we will use RNA-Seq log2(fold changes) as starting weight for personalized pagerank analysis. You could also use fold-changes, and the weight difference will be more dramatic, but the top ranked nodes will be similar. 

### Steps to run Radiate Analysis
1. Connect to arango database
2. Find input nodes (source nodes) in arango database
3. Load the whole network graph from arango to memory and create a networkx graph. NetworkX is a python network library.
4. Perform radiate analysis
    - Run personalized pagerank algorithm using source nodes __with RNA-Seq data as starting weight__ to get pagerank values for each nodes that the source nodes can reach (forward direction). Those are the nodes influenced by the source nodes
    - Run personalized reverse pagerank using source nodes __with RNA-Seq data as starting weight__ to get the reverce pagerank values for each nodes that can reach the source nodes
    - Export the pagerank and reverse pagerank values into __excel file__
5. The user analyzes the pagerank values (sorting, filtering etc), and select the rows that are interesting
6. Create radiate traces for selected nodes

In [1]:
import os
import sys
root = os.getcwd().split('/notebooks/')[0]
sys.path.append(os.path.join(root, 'src'))

In [2]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.radiate_trace import RadiateTrace
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
input_dir = 'input'
output_dir = 'output'
# gds database name
db_name = 'ecocyc-secondaries'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

## 1. Connect to arango database.
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [5]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

## 2. Find input nodes (source nodes) in arango database

#### Read input files to get all updown genes from strain crash as sources

In [6]:
file = "crash_updown_genes_rnaseq_foldchange.csv"
df = pd.read_csv(os.path.join(input_dir, file))
df.head()

Unnamed: 0,gene_name,biocyc_id,fold_change,fold_change_abs,log2(fold_change_abs),comment
0,ykgO,G0-10434,803.056686,803.056686,9.649358,
1,ykgM,G6167,212.209128,212.209128,7.729343,
2,zinT,G7061,200.048421,200.048421,7.644205,
3,ykgR,G0-10654,34.310786,34.310786,5.10059,
4,znuA,G7017,14.640203,14.640203,3.871864,


#### Find list of source nodes in arango

In [7]:
updown_genes = [n for n in df['biocyc_id']]
gene_nodes = database.get_nodes_by_attr(updown_genes, 'biocyc_id')
print(len(updown_genes), len(gene_nodes))

246 246


#### Get weight for gene nodes as a dictionary
You can use either 'fold_change_abs' or 'log2(fold_change_abs)' as weight

In [8]:
# gene_fc = {row['biocyc_id']:row['fold_change_abs'] for index, row in df.iterrows()}
gene_weights = {row['biocyc_id']:row['log2(fold_change_abs)'] for index, row in df.iterrows()}
print('gene_weight', len(gene_weights))

# get node id and biocyc_id map
node_map = { int(n['_key']): n.get('biocyc_id') for n in gene_nodes }

# get node weight
node_weights = {node_id: gene_weights[biocyc_id] for node_id, biocyc_id in node_map.items()}
print(len(node_weights))
#node_weights

gene_weight 246
246


{14989: 1.832387311,
 14990: 1.642888302,
 14991: 1.154251809,
 14995: 1.559260355,
 15010: 1.524939481,
 15029: 1.20547639,
 15038: 1.380688764,
 15058: 2.594444942,
 15065: 1.906823024,
 15074: 1.584778497,
 15080: 1.672291056,
 15118: 2.558594144,
 15124: 3.239788966,
 15153: 1.550834967,
 15156: 1.335914232,
 15158: 2.374190286,
 15165: 1.191258118,
 15198: 1.388867254,
 15232: 1.256120913,
 15237: 2.35943519,
 15247: 1.388933083,
 15262: 1.126126236,
 15267: 2.255137009,
 15277: 1.3860626,
 15334: 2.968035425,
 15335: 2.454948459,
 15371: 2.299349487,
 15379: 1.312213367,
 15388: 1.115313088,
 15392: 1.209726533,
 15401: 3.25478778,
 15406: 1.966218109,
 15427: 1.603485108,
 15428: 1.402814288,
 15461: 1.76878051,
 15470: 1.211878807,
 15502: 2.851398878,
 15506: 1.219294755,
 15538: 1.215075993,
 15545: 2.313006797,
 15567: 1.090039696,
 15592: 1.261727335,
 15627: 1.312978579,
 15633: 1.076648301,
 15649: 1.380615274,
 15670: 1.153476028,
 15711: 2.043819048,
 15713: 1.136552216

## 3. Load the whole network graph from arango to memory and create a networkx graph

Create a RadiateTrace instance.  
RadiateTrace is a subclass of TraceGraphNx.  TraceGraphNx has a property __graph__, that is a networkx graph. After the graph is created by using data from arango graph database, all the algorithms and traces can be run using the python networkx library.

In [9]:
tracegraph = RadiateTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
tracegraph.init_default_graph()

INFO: MultiDirectedGraph with 33428 nodes and 37417 edges


## 4. Perform Weighted Radiate Analysis
Run personalized pagerank analysis and export values to excel file. __Remember to pass source node_weights dict for param 'sources_personalization'__

Pagerank analysis is performed using networkx graph that contains a set of nodes and set of edges. 

#### Set node set for source nodes
A node set is a list of node ids with a set name and description.

In [10]:
# node set name
SOURCE_SET = 'updown_genes'
# node set description
source_desc = 'Crash updown genes'
# add the node set to graph
tracegraph.set_node_set_from_arango_nodes(gene_nodes, name=SOURCE_SET, desc=source_desc)

#### Call export_pagerank_data
The method export_pagerank_data in RadiateTrace contains a few steps to generate the excel file
parameters: 
- sources:  The node set name for the source nodes
- __sources_personalization__: the source node weights dictionary, default is None
- __exclude_sources__: default is True.  In the example below, we set to be False so that you could see the initial weights for the starting nodes.
- direction: default is __both__.  If forward, run pagerank; if reverse, run reverse pagerank; if both, run both forward and reverse pagerank.
- num_nodes: the nodes of top pagerank or reverse pagerank nodes that will write to the excel file.  The default is 2000. 

In the exported excel file, there is also a column named nReach (or rev_nReach), indicating how many source nodes can be reached by the node in the row.  

The method will write an excel file with two tabs, one for pageranks and one for reverse pageranks.

In [11]:
tracegraph.graph = tracegraph.orig_graph.copy()

outfile_name = f"RNA_Seq_log2_weighted_radiate_analysis_for_{SOURCE_SET}.xlsx"
tracegraph.export_pagerank_data(SOURCE_SET, outfile_name, 
                                sources_personalization=node_weights,  
                                exclude_sources=False,
                                direction='both', num_nodes=4000)

INFO: set pagerank and num reach for updown_genes
INFO: export top 4000 pagerank data into output/RNA_Seq_log2_weighted_radiate_analysis_for_updown_genes.xlsx


## Create Traces

In [12]:
tracegraph.graph = tracegraph.orig_graph.copy()

In [13]:
rankfile = f"RNA_Seq_log2_weighted_radiate_analysis_for_{SOURCE_SET}_select.xlsx"
df_pagerank = pd.read_excel(os.path.join(input_dir, rankfile), sheet_name='pageranks', usecols=['eid', 'select'])
df_rev_pagerank = pd.read_excel(os.path.join(input_dir, rankfile), sheet_name='reverse pageranks', usecols=['eid', 'select'])

In [14]:
df_select = df_pagerank[df_pagerank['select']==1]
selected = [id for id in df_select['eid']]
selected_nodes = database.get_nodes_by_attr(selected, 'eid')
print(f"selected: {selected}. length {len(selected_nodes)}")

selected: ['CPLX0-7452', 'RXN-8638', 'ABC-26-CPLX', 'ZN+2']. length 4


In [15]:
df_rev_select = df_rev_pagerank[df_rev_pagerank['select']==1]
rev_selected = [id for id in df_rev_select['eid']]
rev_selected_nodes = database.get_nodes_by_attr(rev_selected, 'eid')
print('rev_selected', rev_selected)

rev_selected ['G7072', 'CPLX0-7680', 'ZN+2']


In [16]:
# set selected node set
SELECTED_SET = 'top_pagerank_nodes'
tracegraph.set_node_set_from_arango_nodes(selected_nodes, SELECTED_SET, 'selected top pagerank nodes')

# set rev_selected node set
REV_SELECTED_SET = 'top_rev_pagerank_nodes'
tracegraph.set_node_set_from_arango_nodes(rev_selected_nodes, REV_SELECTED_SET, 'selected top rev pagerank nodes')

In [17]:
# set pagerank and reverse pagerank
pagerank_prop = 'pagerank'
tracegraph.set_pagerank(SOURCE_SET, pagerank_prop, personalization=node_weights)
rev_pagerank_prop = 'rev_pagerank'
tracegraph.set_pagerank(SOURCE_SET, rev_pagerank_prop, reverse=True, personalization=node_weights)

# add graph description
tracegraph.add_graph_description(f'Database: {db_version}\n')

# add traces from source genes to each selected nodes
tracegraph.add_traces_from_sources_to_each_selected_nodes(selected_nodes, SOURCE_SET, weighted_prop=pagerank_prop)

# add traces from source genes to all selected nodes
tracegraph.add_trace_from_sources_to_all_selected_nodes(SELECTED_SET, SOURCE_SET, weighted_prop=pagerank_prop)

# add traces from each selected nodes to SOURCE_SET genes
tracegraph.add_traces_from_each_selected_nodes_to_targets(rev_selected_nodes,
                                                           SOURCE_SET, weighted_prop=rev_pagerank_prop)

# add traces from all reverse-selected nodes to SOURCE_SET
tracegraph.add_trace_from_all_selected_nodes_to_targets(REV_SELECTED_SET, SOURCE_SET, weighted_prop=rev_pagerank_prop)

# write all traces into one graph file
graph_file = f'RNA_Seq_log2_weighted_Radiate_traces_for_{SOURCE_SET}.graph'
tracegraph.write_to_sankey_file(graph_file)

INFO: Adding trace network updown_genes to glycine betaine ABC transporter #1
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot be reachedfrom given sources
ERROR: Target 31788 cannot