This repository contains the code for the Gene Oracle project. Gene Oracle is an ongoing research effort to discover biomarker genes using gene expression data. Gene Oracle identifies gene sets which provide the most predictive power, based on how well they classify samples in a gene expression dataset.
For more information, refer to the paper: Uncovering biomarker genes with enriched classification potential from Hallmark gene sets
All of Gene Oracle's dependencies can be installed via Anaconda. On a shared system (such as a university research cluster), it is recommended that you install everything in an Anaconda environment:
conda env create -f environment.yml
You must then "activate" your environment in order to use it:
conda activate gene-oracle
# use gene-oracle
conda deactivate
After that, simply clone this repository to use Gene Oracle.
git clone https://github.com/SystemsGenetics/gene-oracle.git
# run the example
cd gene-oracle
scripts/run-example.sh
Gene Oracle consists of two phases, (1) gene set analysis and (2) gene subset analysis. This process encompasses multiple scripts which are run in sequence. The easiest way to learn how to run these scripts, as well as the input / output data involved, is to run the example script as shown above. It demonstrates how to run Gene Oracle on synthetic input data from make-inputs.py
.
Gene Oracle can use any classifier provided by scikit-learn, as well as a custom neural network (implemented in TensorFlow), to evaluate gene sets. Several classifiers are defined with sensible default parameters in models.json
. Consult the scikit-learn documention on to see the list of parameters for each classifier. The example run script uses a linear model, which is one of the simplest classifiers available. Other models such as the neural network or random forest may perform better but will take longer to train.
NOTE: A GPU is required only when using the mlp-tf model with tensorflow-gpu. A GPU might not provide significant speedup a over multicore CPU when training many small neural networks.
Gene Oracle takes three primary inputs: (1) a gene expression matrix (GEM), (2) a list of sample labels, and (3) a list of gene sets. These inputs are described below.
The gene expression matrix should be a plaintext file with rows being samples and columns being genes (features). Values in each row should be separated by tabs.
Gene1 Gene2 Gene3 Gene4
Sample1 0.523 0.991 0.421 0.829
Sample2 8.891 7.673 3.333 9.103
Sample3 4.444 5.551 6.102 0.013
For large GEM files, it is recommended that you convert the GEM to numpy format using convert.py
from the GEMprep repo, as TSPG can load this binary format much more quickly than it does the plaintext format. The convert.py
script can also transpose your GEM if it is arranged the wrong way:
bin/convert.py GEM.emx.txt GEM.emx.npy --transpose
This example will create three files: GEM.emx.npy
, GEM.emx.rownames.txt
, and GEM.emx.colnames.txt
. The latter two files contain the row names and column names, respectively. Make sure that the rows are samples and the columns are genes!
The label file should contain a label for each sample, corresponding to something such as a condition or phenotype state for the sample. This file should contain two columns, the first being the sample names and the second being the labels. Values in each row should be separated by tabs.
Sample1 Label1
Sample2 Label2
Sample3 Label3
Sample4 Label4
The gene set list should contain the name and genes for a gene set on each line, similar to the GMT format. The gene names should be identical to those used in the GEM file. Values on each row should be separated by tabs.
GeneSet1 Gene1 Gene2 Gene3
GeneSet2 Gene2 Gene4 Gene5 Gene6
The script phase1-evaluate.py
takes a list of gene sets and evaluates each gene set by training and evaluating a classifier on the input dataset with only the genes in the set. This script can also evaluate the entire set of genes in the input dataset, as well as random gene sets.
The script phase1-select.py
takes evaluation results for gene sets and compares them to results for random sets of equal size. It uses Student's t-test to determine the statistical significance of a gene set's score as compared to a null distribution for the given set size. Larger gene sets tend to yield higher classification accuracies, so the t-test is used to eliminate this bias when selecting gene sets for subset analysis.
The script phase2-evaluate.py
takes a list of gene sets and evaluates subsets of each gene set in order to determine the most salient genes in the gene set. This script can also analyze random gene sets in the same manner.
The script phase2-select.py
takes evaluation results for the subsets selected by the previous script, measures the saliency of each gene by how frequently it appeared in all subsets, and separates "candidate" genes from "non-candidate" genes according to a threshold.
This repository also provides a Nextflow pipeline for running Gene Oracle. All you need is nextflow, Docker, and nvidia-docker. On HPC systems, you can use Singularity in lieu of Docker. If for some reason you can't use either container software, you will have to install Gene Oracle and its dependencies on your local machine.
The nextflow pipeline assumes you have your input data arranged as follows:
input/
{dataset1}.emx.txt
{dataset1}.labels.txt
{dataset2}.emx.txt
{dataset2}.labels.txt
...
{genesets1}.genesets.txt
{genesets2}.genesets.txt
...
This way, you can place as many gene subsets and datasets and the pipeline will process all of them in a single run.
Here is a basic usage:
nextflow run systemsgenetics/gene-oracle -profile <conda|docker|singularity>
This example will download this pipeline to your machine and use the default nextflow.config
in this repo. It will assume that you have Gene Oracle installed natively, and it will process all input files in the input
directory, saving all output files to the output
directory, as defined in nextflow.config
.
You can also create your own nextflow.config
file; nextflow will check for a config file in your current directory before defaulting to config file in this repo. You will most likely need to customize this config file as it provides options such as which experiments to run, how many chunks to use where applicable, and various other command-line parameters for Gene Oracle. The config file also allows you to define your own "profiles" for running this pipeline in different environments. Consult the Nextflow documentation for more information on what environments are supported.
You can resume a failed run with the -resume
flag. Consult the Nextflow documentation for more information on these and other options.
You can run this pipeline, as well as any other Nextflow pipeline, on a Kubernetes cluster with minimal effort. Consult the kube-runner repo for a command-line approach and Nextflow-API for a browser-based approach.