In [1]:
import glob, os, subprocess
from IPython.display import IFrame

## Precursors

To train a model, you first need to convert your sequences and targets into the input HDF5 format. Check out my tutorials for how to do that; they're linked from the [main page](../README.md).

For this tutorial, grab a small example HDF5 that I constructed here with 10% of the training sequences and only GM12878 targets for various DNase-seq, ChIP-seq, and CAGE experiments.

In [2]:
if len(glob.glob('data/heart_l131k/tfrecords/*.tfr')) == 0:
    subprocess.call('curl -o data/heart_l131k.tgz https://storage.googleapis.com/basenji_tutorial_data/heart_l131k.tgz', shell=True)
    subprocess.call('tar -xzvf data/heart_l131k.tgz', shell=True)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 71.2M  100 71.2M    0     0  16.1M      0  0:00:04  0:00:04 --:--:-- 17.3M
x heart_l131k/
x heart_l131k/contigs.bed
x heart_l131k/statistics.json
x heart_l131k/targets.txt
x heart_l131k/sequences.bed
x heart_l131k/tfrecords/
x heart_l131k/tfrecords/train-0.tfr
x heart_l131k/tfrecords/train-1.tfr
x heart_l131k/tfrecords/train-3.tfr
x heart_l131k/tfrecords/train-2.tfr
x heart_l131k/tfrecords/train-5.tfr
x heart_l131k/tfrecords/test-0.tfr
x heart_l131k/tfrecords/train-4.tfr
x heart_l131k/tfrecords/valid-0.tfr


## Train

Next, you need to decide what sort of architecture to use. This grammar probably needs work; my goal was to enable hyperparameter searches to write the parameters to file so that I could run parallel training jobs to explore the hyperparameter space. I included an example set of parameters that will work well with this data in models/params_small.txt.

Then, run [basenji_train.py](https://github.com/calico/basenji/blob/master/bin/basenji_train.py) to train a model. The program will offer training feedback via stdout and write the model output files to the prefix given by the *-s* parameter.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| -o | models/heart | Directory to save training logs and model checkpoints. |
| params_file | models/params_small.json | JSON specified parameters to setup the model architecture and optimization. |
| data_dir | data/heart_l131k | Data directory containing the test input and output datasets as generated by [basenji_data.py](https://github.com/calico/basenji/blob/master/bin/basenji_data.py) |

If you want to train, uncomment the following line and run it. Depending on your hardware, it may require several hours.

In [8]:
! basenji_train.py -o models/heart models/params_small.json data/heart_l131k

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 sequence (InputLayer)       [(None, 131072, 4)]          0         []                            
                                                                                                  
 stochastic_reverse_complem  ((None, 131072, 4),          0         ['sequence[0][0]']            
 ent (StochasticReverseComp   ())                                                                 
 lement)                                                                                          
                                                                                                  
 stochastic_shift (Stochast  (None, 131072, 4)            0         ['stochastic_reverse_complemen
 icShift)                                                           t[0][0]']               

## Test

Alternatively, you can just download a trained model.

In [4]:
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.h5'):
    subprocess.call('curl -o models/heart/model_best.h5 https://storage.googleapis.com/basenji_tutorial_data/model_best.h5', shell=True)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1157k  100 1157k    0     0  3381k      0 --:--:-- --:--:-- --:--:-- 3374k


models/heart/model_best.tf will now specify the name of your saved model to be provided to other programs.

To further benchmark the accuracy (e.g. computing significant "peak" accuracy), use [basenji_test.py](https://github.com/calico/basenji/blob/master/bin/basenji_test.py).

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --ai | 0,1,2 | Make accuracy scatter plots for targets 0, 1, and 2. |
| -o | output/heart_test | Output directory. |
| --rc | | Average the forward and reverse complement to form an ensemble predictor. |
| --shifts | | Average sequence shifts to form an ensemble predictor. |
| params_file | models/params_small.json | JSON specified parameters to setup the model architecture and optimization. |
| model_file | models/heart/model_best.h5 | Trained saved model parameters. |
| data_dir | data/heart_l131k | Data directory containing the test input and output datasets as generated by [basenji_data.py](https://github.com/calico/basenji/blob/master/bin/basenji_data.py) |

In [10]:
! basenji_test.py --ai 0,1,2 -o output/heart_test --rc --shifts "1,0,-1" models/params_small.json models/heart/model_best.h5 data/heart_l131k

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 sequence (InputLayer)       [(None, 131072, 4)]          0         []                            
                                                                                                  
 stochastic_reverse_complem  ((None, 131072, 4),          0         ['sequence[0][0]']            
 ent (StochasticReverseComp   ())                                                                 
 lement)                                                                                          
                                                                                                  
 stochastic_shift (Stochast  (None, 131072, 4)            0         ['stochastic_reverse_complemen
 icShift)                                                           t[0][0]']               

*data/heart_test/acc.txt* is a table specifiying the Pearson correlation and R2 for each dataset. 

In [11]:
! cat output/heart_test/acc.txt

index	pearsonr	r2	identifier	description
0	0.50047	0.23747	CNhs11760	aorta
1	0.64530	0.40714	CNhs12843	artery
2	0.49690	0.24133	CNhs12856	pulmonic_valve


The directories *pr*, *roc*, *violin*, and *scatter* in *data/heart_test* contain plots for the targets indexed by 0, 1, and 2 as specified by the --ai option above.

E.g.

In [12]:
IFrame('output/heart_test/pr/t0.pdf', width=600, height=500)