To train a model, you first need to convert your sequences and targets into the input HDF5 format. Check out my tutorials for how to do that; they're linked from the [main page](../README.md).

For this tutorial, grab a small example HDF5 that I constructed here with 10% of the training sequences and only GM12878 targets for various DNase-seq, ChIP-seq, and CAGE experiments.

In [2]:
import os, subprocess

if not os.path.isfile('data/heart_l131k.h5'):
    subprocess.call('curl -o heart_l131k.h5 https://storage.googleapis.com/basenji_tutorial_data/heart_l131k.h5', shell=True)

Next, you need to decide what sort of architecture to use. This grammar probably needs work; my goal was to enable hyperparameter searches to write the parameters to file so that I could run parallel training jobs to explore the hyperparameter space. I included an example set of parameters that will work well with this data in models/params_small.txt.

Then, run [basenji_train.py](https://github.com/calico/basenji/blob/master/bin/basenji_train.py) to train a model. The program will offer training feedback via stdout and write the model output files to the prefix given by the *-s* parameter.

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --augment_rc | | Process even-numbered epochs as forward, odd-numbered as reverse complemented. |
| --ensemble_rc | | Average forward and reverse complemented predictions on validation set. |
| --augment_shifts | "1,0,-1" | Rotate epochs over small sequence shifts. |
| --logdir | models/heart | Directory to save training logs and model checkpoints. |
| --params | models/params_small.txt | Table of parameters to setup the model architecture and optimization. |
| --data | data/heart_l131k.h5 | HDF5 file containing the training and validation input and output datasets as generated by [basenji_hdf5_single.py](https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_single.py) |

If you want to train, uncomment the following line and run it. Depending on your hardware, it may require several hours.

In [None]:
# ! basenji_train.py --augment_rc --ensemble_rc --augment_shifts "1,0,-1" --logdir models/heart --params models/params_small.txt --data data/heart_l131k.h5

Alternatively, you can just download a trained model.

In [3]:
if not os.path.isdir('models/heart'):
    os.mkdir('models/heart')
if not os.path.isfile('models/heart/model_best.tf.meta'):
    subprocess.call('curl -o models/heart/model_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.index', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.meta', shell=True)
    subprocess.call('curl -o models/heart/model_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_best.tf.data-00000-of-00001', shell=True)

models/heart/model_best.tf will now specify the name of your saved model to be provided to other programs.

To further benchmark the accuracy (e.g. computing significant "peak" accuracy), use [basenji_test.py](https://github.com/calico/basenji/blob/master/bin/basenji_test.py).

The most relevant options here are:

| Option/Argument | Value | Note |
|:---|:---|:---|
| --rc | | Average the forward and reverse complement to form prediction. |
| -o | output/heart_test | Output directory. |
| --ai | 0,1,2 | Make accuracy scatter plots for targets 0, 1, and 2. |
| --ti | 3,4,5 | Make BigWig tracks for targets 3, 4, and 5. |
| -t | data/heart.bed | BED file describing sequence regions for BigWig track output. |
| params_file | models/params_small.txt | Table of parameters to setup the model architecture and optimization. |
| model_file | models/heart/model_best.tf | Trained saved model prefix. |
| data_file | data/heart_l131k.h5 | HDF5 file containing the test input and output datasets as generated by [basenji_hdf5_single.py](https://github.com/calico/basenji/blob/master/bin/basenji_hdf5_single.py) |

In [6]:
! basenji_test.py --rc -o output/heart_test --ai 0,1,2 -t data/heart.bed --ti 0,1,2 --peaks models/params_small.txt models/heart/model_best.tf data/heart_l131k.h5

  from ._conv import register_converters as _register_converters
  return f(*args, **kwds)
{'optimizer': 'adam', 'num_targets': 3, 'cnn_dense': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0], 'adam_beta1': 0.97, 'target_pool': 128, 'cnn_dilation': [1, 1, 1, 1, 1, 2, 4, 8, 16, 32, 64, 1], 'batch_buffer': 4096, 'adam_beta2': 0.98, 'learning_rate': 0.002, 'cnn_filters': [128, 128, 192, 256, 256, 32, 32, 32, 32, 32, 32, 384], 'cnn_filter_sizes': [20, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 1], 'cnn_dropout': 0.1, 'batch_size': 4, 'loss': 'poisson', 'link': 'softplus', 'cnn_pool': [2, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0]}
Targets pooled by 128 to length 1024
Convolution w/ 3 384x1 filters to final targets
Model building time 11s
2018-05-15 14:23:46.495192: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-05-15 14:23:50.633997: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops

*data/gm12878_test/acc.txt* is a table specifiying the loss function value, R2, R2 after log2, and Spearman correlation for each dataset. 

In [14]:
! cat output/heart_test/acc.txt

   0  2.61342  0.06309  0.25821  0.21923  ENCSR000EJD_3_1
   1  2.21471  0.13869  0.42051  0.31831  ENCSR000EMT_2_1
   2  1.38732  0.14672  0.40900  0.32414  ENCSR000EMT_1_1
   3  2.77317  0.07130  0.27155  0.24821  ENCSR000EJD_1_1
   4  2.37138  0.09846  0.33517  0.31370  ENCSR000EJD_2_1
   5  1.70094  0.34714  0.63973  0.43597  ENCSR057BWO_2_1
   6  1.19770  0.08201  0.30734  0.31978  ENCSR000AKE_1_1
   7  0.87382  0.05702  0.26991  0.29605  ENCSR000AKF_2_1
   8  1.02482  0.22666  0.51480  0.34158  ENCSR000AOV_2_1
   9  0.82986  0.04125  0.20892  0.23112  ENCSR000AKI_2_1
  10  2.34713  0.37184  0.70600  0.45844  ENCSR000AKA_2_1
  11  0.99584  0.02339  0.17299  0.20299  ENCSR000AOX_2_1
  12  1.11231  0.12795  0.36861  0.41651  ENCSR000DRW_1_1
  13  1.22254  0.24231  0.54471  0.50531  ENCSR000AOW_1_1
  14  1.15009  0.02822  0.21957  0.23801  ENCSR000AKD_1_1
  15  0.95330  0.09574  0.33044  0.35003  ENCSR000AKE_2_1
  16  1.08106  0.13214  0.37501  0.42033  ENCSR000DRW_2_1
  17  1.37785 

*output/heart_test/peak.txt* is a table specifiying the number of peaks called, AUROC, and AUPRC for each dataset. 

In [18]:
! cat data/heart_test/peaks.txt

   0     627  0.62060  0.21187
   1     194  0.73505  0.22926
   2     124  0.80325  0.24490
   3     867  0.65384  0.28815
   4     644  0.68224  0.27852
   5     267  0.77985  0.33800
   6     343  0.73019  0.17062
   7     191  0.75358  0.11675
   8     143  0.76431  0.28520
   9       3  0.68574  0.00182
  10     350  0.77412  0.37582
  11     184  0.60722  0.05175
  12     295  0.78291  0.19798
  13     324  0.88868  0.40744
  14     130  0.64108  0.06255
  15     289  0.77005  0.14116
  16     273  0.78886  0.18852
  17     201  0.82412  0.37455
  18     116  0.82394  0.39929
  19      98  0.64931  0.02932
  20     189  0.84032  0.33949
  21     108  0.84197  0.43734
  22      95  0.96172  0.57619
  23     104  0.95983  0.58563
  24     145  0.73452  0.07188
  25     182  0.81640  0.30829
  26      55  0.65197  0.03704
  27      94  0.82064  0.30024
  28     202  0.72931  0.08968
  29     117  0.73809  0.05621
  30     468  0.70399  0.22005
  31     318  0.88671  0.39525
  32    

The directories *pr*, *roc*, *violin*, and *scatter* in *data/heart_test* contain plots for the targets indexed by 0, 1, and 2 as specified by the --ai option above.

E.g.

In [17]:
from IPython.display import IFrame
IFrame('output/heart_test/pr/t0.pdf', width=600, height=500)