GitHub - davidaknowles/tf_net: A custom conv net for the DREAM ENCODE challenge

A custom CNN for the DREAM ENCODE challenge

This is a pretty standard convolutional neural net on genomic sequence with the following features added:

normalized per-base DNase I cuts for the + and - strand are concatenated onto the one hot encoding of sequence, to give a [sequence context] x 6 input matrix.
gene expression PCs are included as features to allow the model to interpolate between different cell types.
a three class ordinal likelihood is used for the Unbound/Ambiguous/Bound labels.
simultaneous analysis of the forward and reverse complement.
down-sampling of the negative set to speed up training (and accounting for by weighting the likelihood).

From the round 2 leaderboard you can see performance is highly competitive for some TFs (e.g. MAX https://www.synapse.org/#!Synapse:syn6131484/wiki/402503) and less so for others (e.g. REST https://www.synapse.org/#!Synapse:syn6131484/wiki/402505).

The repo is intended to be fully self contained (save dependencies on synapseclient, pysam and pyDNase python packages), including programmatic download of challenge data, pre-processing, model fitting, prediction and submission.

METHODS.ipynb goes through the math for the ordinal likelihood, negative set downsampling and forward/RC model.

Installation

You'll need the following python packages: pysam, pyDNase, scikit-learn (for performance metrics), synapseclient (for downloading the data and submitting), numpy, scipy, theano.

Usage

The script run_all.sh will in principle run all of these steps for you. Realistically you'll want to train each TF model (and probably do the DNase pre-processing) on a cluster since this is pretty time consuming (10ish hours).

Set a data location, e.g. add something like the following to your ~/.bash_profile

export DREAM_ENCODE_DATADIR=/myscratchspace/dream_encode/

Download the challenge data using download_challenge_data.py, but note you'll need to set your Synapse email/password in that script.
[optional] Calculate gene expression PCs using gene_expression_pca.R. I included the output file, 'ge_pca.txt' so you don't strictly need to rerun this. If you do want to do this yourself you'll need the R packages irlba and foreach.
Calculate DNase I cut counts using the get_DNase_cuts.py script. This converts the DNase I bams into an efficient numpy representation of cut counts saved in .npz files. The bam first need indexing (e.g. using samtools). index.sh will do this for you.
Train models for each TF using train.py. This script includes outputting leaderboard and final submissions.
Submit to Synapse using submit.py. Note you'll need to set up a folder in Synapse to use for this and set the id in the script.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
METHODS.ipynb		METHODS.ipynb
README.md		README.md
cell_types.txt		cell_types.txt
double_net.py		double_net.py
download_challenge_data.py		download_challenge_data.py
ge_pca.txt		ge_pca.txt
gene_expression_pca.R		gene_expression_pca.R
get_DNase_cuts.py		get_DNase_cuts.py
index.sh		index.sh
one_hot.pyx		one_hot.pyx
read_cuts.py		read_cuts.py
run_all.sh		run_all.sh
submit.py		submit.py
tf_net.py		tf_net.py
train.py		train.py
train_leaderboard_final.py		train_leaderboard_final.py
train_leaderboard_final.txt		train_leaderboard_final.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A custom CNN for the DREAM ENCODE challenge

Installation

Usage

About

Releases

Packages

Languages

davidaknowles/tf_net

Folders and files

Latest commit

History

Repository files navigation

A custom CNN for the DREAM ENCODE challenge

Installation

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages