DRaWR - Discriminative Random Walk with Restart

Charles Blatti and Saurabh Sinha

KnowEnG BD2K Center of Excellence

University of Illinois Urbana-Champaign

Motivation

Analysis of co-expressed gene sets typically involves testing for enrichment of different “properties” such as biological processes, pathways, transcription factor binding, etc., one property at a time. This approach ignores any known relationships among the properties or genes themselves. Previous work has sought to exploit these relationships by building biological networks that combine multiple types of gene-gene or gene-property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. However, these existing network-based method often collapse information about individual property annotations to create simplified, homogeneous networks.

We present DRaWR, a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types, preserving more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only the relevant properties. We then rerank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork.

Installation

Clone this repository

git clone https://github.com/cblatti3/DRaWR.git

Enter the repository and install packages into a local library

cd ./DRaWR/
mkdir library
R CMD INSTALL -l ./library packages/Matrix_1.2-2.tar.gz \
    packages/KernSmooth_2.23-15.tar.gz \
    packages/bitops_1.0-6.tar.gz \
    packages/caTools_1.17.1.tar.gz \
    packages/gtools_3.5.0.tar.gz \
    packages/gdata_2.17.0.tar.gz \
    packages/gplots_2.17.0.tar.gz \
    packages/ROCR_1.0-7.tar.gz
R CMD INSTALL -l ./library packages/DRaWR/

To use the 5 species networks

gunzip networks/5ins_cdhmw.names.edge.gz
gunzip networks/5sp_adhiw.names.edge.gz

Arguments

Method Arguments

relating to input files

possetfile (string): location of file containing relative location of gene sets to test.
unifile (string): location of file listing gene universe.
networkfile (string): location of file containing network contents.

relating to output files

outdir (string): prefix of location of file to write performance results (optionally prediction results).
writepreds (boolean): write predictions out to a file. Default is FALSE.
nfolds (int): number of folds for cross validation, Default is 1, no cross-validation.

relating to network processing

undirected (bool): boolean to make network undirected. Default is TRUE.
unweighted (bool): boolean to make network unweighted. Default is FALSE.
normalize (string): "type" or "none" normalization method. Default is 'type'.
property_types (vector): vector containing ALL names of property->gene (PG) edge_types. May contain PG edge_types that are not in the given network, but should not contain gene<->gene edge_types. Default is c("allen_brain_atlas", "chip_binding", "gene_ontology", "motif_u5", "pfam_domain", "T1", "T2").
st2keep (int): number of property nodes to keep in second stage for each property type. To skip the second stage of DRaWR can set st2keep to 0. Default is 1.

relating to random walks

restarts (vector): vector of restart values to test. Default is c(0.7).
maxiters (int): maximum number of allowable iterations. Default is 50.
thresh (float): threshold for L1 norm convergence. Default is 0.0001.

Input files

Network Edge File

The network edge file is expected to be a 4 column, tab separated file. Each row represents and edge in the network and contains the values for: *col1: node1 (string) *col2: node2 (string) *col3: edge_weight (float) *col4: edge_type (string)

For edges between a property node and a gene node, the property node must be listed as node 1. The order of the nodes does not matter for gene-gene edges. The edge weights must be positive with large values meaning stronger relationships. The second stage of DRaWR is intended to limit the property nodes from many different property->gene (PG) edge_types. The number of gene<->gene (GG) edge_types should be limited. By default the network edge file will be converted to an undirected, normalized by edge_type adjacency matrix with the edge weights taken as the maximum if there is repetition.

Example network edge file:

G1	G6	0.76	typeGG
G2	G1	0.41	typeGG
G2	G7	0.73	typeGG
G4	G5	0.89	typeGG
P1	G2	0.57	typePG.1
P1	G6	0.30	typePG.1
P2	G1	0.07	typePG.1
P2	G4	0.89	typePG.1
P3	G3	0.66	typePG.1
P3	G6	0.80	typePG.1
P4	G4	0.24	typePG.2
P4	G1	0.74	typePG.2
P5	G5	0.20	typePG.2
P5	G2	0.95	typePG.2
P5	G7	0.92	typePG.2

Gene Set File / Gene Universe File

The query gene set files or gene universe files should one node name listed on each row. If the gene nodes have distinct weights, a second column (separated by a tab) can be added. Node weights must be positive with larger values meaning stronger evidence.

Example gene set file:

G3  2.5
G4  1.3
G5  4.0

The gene universe file must contain the appropriate universe of genes that the genes in the query gene sets may be chosen from. Typically, this will be all of the genes on the species of interest.

Gene Set List

This file facilitates the examination of multiple gene sets with the identical settings of DRaWR. The file must contain the location of a gene set file relative to itself on each line

Example gene set list:

../gene_sets/dmel/2270_procephalic_ectoderm_anlage_in_statu_nascendi.names.txt
../gene_sets/dmel/4227_ventral_ectoderm_primordium.names.txt
../gene_sets/dmel/5155_ventral_epidermis_primordium.names.txt
../gene_sets/dmel/6069_embryonic_dorsal_epidermis.names.txt

Output Files

File of results statistics

This file ending in '.stats' is produced to analyzing the performance results.

For every:

stage ("stage1") of DRaWR
fold ("iter") of cross validation
RWR restart parameter ("restart") setting
query gene set ("posset")

this file outputs on every line the settings of the DRaWR run as well as the:

AUROC evaluation metric ("aucval")
number of RWR interations ("rwr_iters")
number of training examples ("ntrain")

File of predictions

This file ending in '.rwr' contains information about each node in the network after running DRaWR. This file is only produced if writepreds = TRUE. There is one row for each node in the network, and the columns are:

"node": the name of that node
"type": the type of that node, -1 if node is a gene node
"universe": 1 if node in gene universe set, 0 otherwise
"baseline": probability of being in the node from converged baseline RWR
"train": 1 if node in the gene query set training set, 0 otherwise
"test": 1 if node in the gene query set test set, 0 otherwise
"stage1": probability of being in the node from converged stage1 RWR
"diff": difference between "stage1" and "baseline"
"keep": 1 if the property node was kept for the stage two RWR, 0 otherwise
"stage2": probability of being in the node from converged stage2 RWR

The file ending in '.base' contains the first 4 columns of the '.rwr' file and is used to accelerate the convergence of the RWRs.

Running

In R environment

Load libraries

.libPaths("library")
library(DRaWR)

Simple example no cross validation

DRaWR(possetfile = "packages/DRaWR/data/sample_inputs/test.setlist",
    unifile = "packages/DRaWR/data/sample_inputs/test.uni",
    networkfile = "packages/DRaWR/data/sample_inputs/test.edge",
    outdir = "packages/DRaWR/data/sample_outputs/results_",
    restarts = c(.3), nfolds = 1, st2keep = 1,
    undirected = TRUE, unweighted = FALSE, normalize = "type",
    maxiters = 50, thresh = 0.000001,
    property_types = c("T1", "T2"), writepreds = 1)

12 fly gene sets with no cross validation

DRaWR(possetfile = "setlists/dmel_possets12.list.txt",
    unifile = "gene_sets/universes/dmel.ids.uni.txt",
    networkfile = "networks/dmel_cdhmw.names.edge",
    outdir = "sample_results/",
    restarts = c(.3), nfolds = 1, st2keep = 50,
    undirected = TRUE, unweighted = FALSE, normalize = "type",
    maxiters = 50, thresh = 0.0001,
    property_types = c("chip_binding", "motif_u5", "pfam_domain"), writepreds = 1)

12 fly gene sets with 4fold cross validation

DRaWR(possetfile = "setlists/dmel_possets12.list.txt",
    unifile = "gene_sets/universes/dmel.ids.uni.txt",
    networkfile = "networks/dmel_cdhmw.names.edge",
    outdir = "sample_results/",
    restarts = c(.3), nfolds = 4, st2keep = 50,
    undirected = TRUE, unweighted = FALSE, normalize = "type",
    maxiters = 50, thresh = 0.0001,
    property_types = c("chip_binding", "motif_u5", "pfam_domain"), writepreds = 0)

Combined aggression set on 5sp aggression network without cross validation

DRaWR(possetfile = "setlists/3sps.fdr.1.list.txt",
    unifile = "gene_sets/universes/3sps_ids.uni.txt",
    networkfile = "networks/5sp_adhiw.names.edge",
    outdir = "sample_results/",
    restarts = c(.3), nfolds = 1, st2keep = 50,
    undirected = TRUE, unweighted = FALSE, normalize = "type",
    maxiters = 50, thresh = 0.0001,
    property_types = c("allen_brain_atlas", "gene_ontology","pfam_domain"), writepreds = 0)

From command line

Simple example no cross validation

Rscript -e ".libPaths(\"library\"); library(\"DRaWR\"); \
    DRaWR(possetfile = \"packages/DRaWR/data/sample_inputs/test.setlist\", \
    unifile = \"packages/DRaWR/data/sample_inputs/test.uni\", \
    networkfile = \"packages/DRaWR/data/sample_inputs/test.edge\", \
    outdir = \"packages/DRaWR/data/sample_outputs/cli_results_\", \
    restarts = c(.3), nfolds = 1, st2keep = 1, \
    undirected = TRUE, unweighted = FALSE, normalize = \"type\", \
    maxiters = 50, thresh = 0.000001, \
    property_types = c(\"T1\", \"T2\"), writepreds = 1) "

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
gene_sets		gene_sets
images		images
networks		networks
packages		packages
sample_results		sample_results
setlists		setlists
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRaWR - Discriminative Random Walk with Restart

Charles Blatti and Saurabh Sinha

KnowEnG BD2K Center of Excellence

University of Illinois Urbana-Champaign

Motivation

Installation

Arguments

Method Arguments

relating to input files

relating to output files

relating to network processing

relating to random walks

Input files

Network Edge File

Gene Set File / Gene Universe File

Gene Set List

Output Files

File of results statistics

File of predictions

Running

In R environment

From command line

About

Releases

Packages

Languages

License

cblatti3/DRaWR

Folders and files

Latest commit

History

Repository files navigation

DRaWR - Discriminative Random Walk with Restart

Charles Blatti and Saurabh Sinha

KnowEnG BD2K Center of Excellence

University of Illinois Urbana-Champaign

Motivation

Installation

Arguments

Method Arguments

relating to input files

relating to output files

relating to network processing

relating to random walks

Input files

Network Edge File

Gene Set File / Gene Universe File

Gene Set List

Output Files

File of results statistics

File of predictions

Running

In R environment

From command line

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages