# Intersection Analysis

#### Definition:
Given a list of source nodes and a list of target nodes, find the potential most important nodes from sources to targets. 

#### Approaches:
1. Run personalized pagerank using source nodes. The pagerank value is considered to be the probability that the node is influenced by the source nodes. 
2. Run personalized reverse pagerank using target nodes. The reverse pagerank value is considered to be the probability that the node influences the target nodes.
3. Use a probability formula to get a intersection pagerank that represents the possibility that the node being influenced by the source nodes and also influences the target nodes.

<img src="img/intersection.png" width="400" height="200" />


### Steps to run Intersection Analysis
1. Connect to arango database
2. Find input nodes (source nodes and target nodes) in arango database
3. Load the whole network graph from arango to memory and create a networkx graph. NetworkX is a python network library.
4. Perform intersection analysis
    - Run personalized pagerank algorithm using source nodes to get pagerank values for each nodes 
that the source nodes can reach (forward direction). Those are the nodes influenced by the source nodes
    - Run personalized reverse pagerank using target nodes to get the reverce pagerank values for each 
nodes that can reach the target nodes
    - Calculate the intersection pageranks based on source pageranks and target reverse pageranks
    - Export the pagerank values into __excel file__
5. The user analyzes the pagerank values (sorting, filtering etc), and select the rows that are interesting
6. Create intersection traces for selected nodes

In [1]:
# install lifelike_gds package if not already installed (e.g. running in Google Colab)
import importlib

if importlib.util.find_spec('lifelike_gds') is None:
  !pip install git+https://github.com/SBRG/GDS-Public

# provide the path to the notebook folder in the github repository in case the notebook is run in Google Colab
github_path = 'SBRG/GDS-Public/main/notebooks/CfB_Workshop/curli'

In [2]:
import os
import pandas as pd
import warnings
from pathlib import PurePosixPath

In [3]:
from lifelike_gds.arango_network.biocyc import *
from lifelike_gds.arango_network.radiate_trace import RadiateTrace



In [4]:
warnings.filterwarnings('ignore')

In [5]:
input_dir = PurePosixPath('input')
output_dir = PurePosixPath('output')
os.makedirs(output_dir, 0o777, True)
# gds database name
db_name = 'ecocyc-25'
# gds database version, free text, that can be used to describe the graph
db_version = 'ecocyc 25.5'

## 1. Connect to arango database.
If use BioCyc databases (e.g. EcoCyc, HumanCyc), use Class BioCycDB.  
If use Reactome database, use Class ReactomeDB. 

In [6]:
# set database uri, username and password. 
# dbname is the arango database name for the running arango instance. The default database name is 'arango'
dbname = os.getenv('ARANGO_DATABASE', db_name)

database = BiocycDB(dbname)

## 2. Find input nodes (source and target nodes) in arango database

#### Read Curli phenotype 1 and phenotype 6 genes as sources, and Curli genes as targets

In [7]:
pheno1_file_path = input_dir / 'curli_genes_pheno_1.csv'
if os.path.isfile(pheno1_file_path):
  pheno1_file_ref = pheno1_file_path
else:
  # if does not exist localy, pull from github
  pheno1_file_ref = f'https://raw.githubusercontent.com/{github_path}/{pheno1_file_path}'

In [8]:
# Curli phenotype 1 genes
df1 = pd.read_csv(pheno1_file_ref)
pheno1_genes = [n for n in df1['biocyc_id']]
pheno1_nodes = database.get_nodes_by_attr(pheno1_genes, 'biocyc_id')
print(f"Phenotype 1 genes: {len(pheno1_genes)}, nodes: {len(pheno1_nodes)}")

Phenotype 1 genes: 35, nodes: 35


In [9]:
csg_file_path = input_dir / 'csg_genes.csv'
if os.path.isfile(csg_file_path):
  csg_file_ref = csg_file_path
else:
  # if does not exist localy, pull from github
  csg_file_ref = f'https://raw.githubusercontent.com/{github_path}/{csg_file_path}'

In [10]:
# Curli genes (CSG genes)
df3 = pd.read_csv(csg_file_ref)
curli_genes = [n for n in df3['biocyc_id']]
curli_nodes = database.get_nodes_by_attr(curli_genes, 'biocyc_id')
print(f"curli genes: {len(curli_genes)}, nodes: {len(curli_nodes)}")

curli genes: 7, nodes: 7


## 3. Load the whole network graph from arango to memory and create a networkx graph

Create a RadiateTrace instance.  
RadiateTrace is a subclass of TraceGraphNx.  TraceGraphNx has a property __graph__, that is a networkx graph. After the graph is created by using data from arango graph database, all the algorithms and traces can be run using the python networkx library.

In [11]:
tracegraph = RadiateTrace(Biocyc(database))
# set up output directory where the excel and graph files will write to
tracegraph.datadir = output_dir
# initiate tracegraph by loading graph data from arango
tracegraph.init_default_graph()

INFO: MultiDirectedGraph with 33428 nodes and 37886 edges


## 4. Perform intersection analysis
Run intersection pagerank analysis and export values into excel file.

The pagerank analysis is performed using networkx graph that contains a set of nodes and set of edges. 

#### Set node sets for sources and targets.  
A node set is a list of node ids with a name and description.

We set two source node sets and one target node set.  Then we will perform intersection analysis from each source node set to the target node set

In [12]:
# Set source and target node sets
SOURCE_SET = 'pheno1_genes'
TARGET_SET = 'curli_genes'

tracegraph.set_node_set_from_arango_nodes(pheno1_nodes, SOURCE_SET, 'phenotype_1 genes')
tracegraph.set_node_set_from_arango_nodes(curli_nodes, TARGET_SET, 'curli genes')

#### Call export_intersection_pageranks
The method export_intersection_pageranks() performs the following steps
1. Run personalized pagerank using source nodes
2. Run personalized reverse pagerank using target nodes
3. Calculate intersection pagerank based on source pagerank and target reverse pagerank
4. Write values into excel file

In [13]:
# keep a clean copy of graph
tracegraph.graph = tracegraph.orig_graph.copy()

filename = f"Intersection_analysis_for_{SOURCE_SET}_and_{TARGET_SET}.xlsx"
tracegraph.export_intersection_pageranks(filename, SOURCE_SET, TARGET_SET, num_nodes=3000)

INFO: set pagerank and num reach for pheno1_genes
INFO: set pagerank and num reach for curli_genes


export intersection pagerank to file  output/Intersection_analysis_for_pheno1_genes_and_curli_genes.xlsx


## 5. Analyze the pagerank output file (excel), and select interesting rows for further analysis

#### Suggestion:   
Add a column 'select' for selecting top pagerank nodes, and set any selected rows to 1, then save the file 

## 6. Create intersection traces for the selected rows

#### Read manually selected top ranked nodes from the previous generated pagerank excel file
We will read the column 'select' to get the selected rows (for intersection pagerank selection)

In [14]:
intersect_pagerank_select_file_path = input_dir / f"Intersection_analysis_for_{SOURCE_SET}_and_{TARGET_SET}_select.xlsx"
if os.path.isfile(intersect_pagerank_select_file_path):
  intersect_pagerank_select_file_ref = intersect_pagerank_select_file_path
else:
  # if does not exist localy, pull from github
  intersect_pagerank_select_file_ref = f'https://raw.githubusercontent.com/{github_path}/{intersect_pagerank_select_file_path}'

In [15]:
df = pd.read_excel(intersect_pagerank_select_file_ref, usecols=['eid', 'select'])
df = df[df['select']==1]
df

Unnamed: 0,eid,select
0,CPLX0-226,1.0
1,CAMP,1.0
18,EG30063,1.0


In [16]:
selected_eids = [id for id in df['eid']]
selected_nodes = database.get_nodes_by_attr(selected_eids, 'eid')

#### Run pageranks again using a clean copy of the original graph

In [17]:
tracegraph.graph = tracegraph.orig_graph.copy()

pr = 'pagerank'
rev_pr = 'rev_pagerank'
tracegraph.set_pagerank(SOURCE_SET, pagerank_prop=pr)
tracegraph.set_pagerank(TARGET_SET, pagerank_prop=rev_pr, reverse=True)

#### Add traces from sources to the intersection node, and intersection node to targets
We will add traces from source nodes to each selected intersection nodes, and traces from each selected intersection nodes to targets

In [18]:
# write traces in one file
tracegraph.add_traces_from_sources_to_each_selected_nodes(selected_nodes, SOURCE_SET, weighted_prop=pr)
tracegraph.add_traces_from_each_selected_nodes_to_targets(selected_nodes, TARGET_SET, weighted_prop=rev_pr)
tracegraph.write_to_sankey_file(f"Intersection_traces_from_{SOURCE_SET}_to_{TARGET_SET}.graph")

INFO: Adding trace network pheno1_genes to cyclic-AMP #1
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
ERROR: Target 32691 cannot be reachedfrom given sources
INFO: Adding trace network pheno1_genes to CRP-